In this assignment, you need to provide an in-depth analysis of HDFS tool or technique for big data analytics
The following information provides some general guidelines about what should be included in your submitted report.
- Provide some background information about this tool or technique (HDFS) (definition, history, key developers, etc.).
- Explain why HDFS is suitable for big data analytics.
- Explain the concepts/functions/components/architectures/algorithms of this tool or technique.
- Demonstrate how to use HDFS (e.g., installation, data processing/analysis, output results, etc.). In one paragraph only
- Provide one or more real world cases about how this tool or technique is used in business.
- Benchmark this tool or technique in terms of application and contribution in business.
Answer:
Introduction
In case of Big Data analytics, HDFS is developed with the intention to store extensive amount of data dependably, and to transfer those informational indexes at a high bandwidth to client applications. As mentioned by Abdul, Alkathiri and Potdar (2016), for large cluster of servers, huge number of servers are attached to the data storage in order to faster execute client applications. By circulating the storage and computation resources crosswise over numerous servers, the resources can develop with request while staying efficient at each size.
The HDFS gives a distributed framework and file system for the examination and change of vast informational indexes utilizing the Map Reduce paradigm in the Big Data analytics. One of the critical characteristics of Hadoop is the dividing or partitioning of the large data set and performing the computation crosswise over a huge number of hosts, and executing the client application parallel and near the data. A Hadoop group scales calculation limit, stockpiling limit and IO data transfer capacity by basically including item servers.
The following report portray the design of HDFS and give an account of experience utilizing HDFS to manage huge amount of data generated in the big business organizations. In addition to that, the suitability of the tool, architecture, functionalities provided by this tool is also described in this report.
1. Background information about the HDFS
Hadoop is a versatile, fault tolerant, open source, Virtual Grid operating framework for data collection and processing which also helps in high-data transfer capacity grouped capacity design (Thomson and Abadi 2015). In order to deal with the design of a Hadoop stack there is a Hadoop programming stack called HDFS, a distributed database framework in which each file shows up as an (extensive) adjoining and arbitrarily addressable sequence of bytes. In case of the analysis of data in batches the center layer of the stack is the Hadoop Map Reduce framework. This layer applies outline to the data in allotments of a HDFS file system, sorts and restructures the outcomes in light of the output of the processed data. and after that performs the reduce operations on the collection of output information things with the matching key strings from the output items with coordinating keys from the based record administration operations, the HBase store (layered on top of HDFS) is accessible as a key-esteem layer in the Hadoop stack.
Similar to other distributed systems such as PVFS, GFS, and Luster; the HDFS collects and stores metadata on a devoted server, which is known as the “NameNode”. The detailed data of the client applications are put away on different servers known as the “DataNodes”. All the servers in the clusters are connected with each other and communicates with each other (Abdul, Alkathiri and Potdar 2016). Not like the other systems (like Luster and PVFS) the DataNodes in HDFS do not utilize different data protection mechanisms, for example, RAID to secure the data in the servers. Rather, like GFS, the content of the data is repeated on numerous DataNodes for availability and reliability of the data (Gu et al. 2016). This method guarantees durability of the data, additionally this technique has the additional preferred standpoint that information exchange data transfer bandwidth for the applications are increased by multiple times, and hence opens more opportunities for finding computation resources close to the required data for the application.
In addition to that, downtime being a severe risk to numerous advanced organizations main concern, highlights that limit blackouts and add to keeping a clump diagnostic information accumulate, operational, and encouraging any online framework that requires its info are invited by both IT and business experts.
In addition to that, the HDFS is an open source application, which converts into genuine cost investment funds for its clients. The greatest number of organizations can verify, extravagant capacity arrangements can take a huge nibble out of IT spending plans and are ordinarily totally distant for little or startup organizations. Thus use of the HDFS at the organization can help the organizations to save most of the IT related investments.
2. Explain why HDFS is suitable for big data analytics
The HDFS is considered as the hierarchy of files and directories. These are represented as the “inodes” in the NameNode, which record the properties of the files and directories like modifications, read write permissions. In addition to that, time of modification, namespace and require or used disk space. The content of the files are partitioned into data blocks (ordinarily of 128 megabytes of size, however client selectable record by-document) (Thomson and Abadi 2015). Map Reduce employments utilize productive information preparing procedures which can be connected in each of the phases of Map Reduce; in particular Mapping, Grouping, Shuffling, Indexing, and Reducing. Every one of these procedures has been contemplated in this paper for execution in Map Reduce undertakings. The group can have a huge number of DataNodes and a huge number of HDFS clients at every cluster, the reason behind this can be stated as each Data Node may execute different application tasks simultaneously.
In order to process the collected data is utilizing Hadoop; it is first transported in into Hadoop distributed filet system (HDFS). HDFS utilizes master slave engineering which incorporates a solitary name node and a few data nodes (Thomson and Abadi 2015). The name node goes about as a focal disjoins and deals with the whole file system and access to blocks of information lying on the HDFS. The genuine information is put away at the data nodes which acts the slave node in the system. The upside of utilizing HDFS as a capacity framework is that it gives dynamic adaptability of the system i.e. it can scale up to a few nodes relying on the client prerequisites. Another part of this system is data processing module.
The client application that opens directory or a file for modification or execution is conceded a lease for that specific directory or file; at this situation no other client application can modify the directory or the file simultaneously (Gu et al. 2016). Heterogeneity, complexity, scale and security are the issues with Big Data that acts as the obstacle in the use of the different technologies and procedures that can create value from the huge accumulated data. Most of the application data are not stored in an organized way. As an example tweets and other social media platforms generates feebly organized bits of content. In this content video and pictures are used to show on the pages and not for search purpose by the users, thus they are considered as the semantic substance: therefore changing this kind of content into an organized data for analysis is a significant task for the servers.
The Big Data is considered as the heterogeneous blend of information both unstructured ( such as the datasets like – XLS's and CSV's and the columns and rows of the databases) and unstructured information like manuals, attachments in the emails, documents, therapeutic documents like, x-ray reports, MRI and ECG images, shapes, rich media like audio and video, contacts and archives. Most of the Organizations are fundamentally worried about analysis of the unstructured data generated from the business processes as more than 75 percent of the business data is unstructured and require huge storage space and more effort and resources to analyze those data. All this is done in the “NameNode” of the HDFS so that with lesser amount of investment the goal can be achieved for the business of the organization. At the point when there is a requirement for a new block, the Name Node dispenses a block with a unique block ID and decides a rundown of Data Nodes to host imitations of the new block.
3. The functions, components, architectures, algorithms in HDFS
Functions
HDFS is considered as the block structured file management system in a cluster: A single file is segmented into blocks. These blocks are distributed over the clusters of at least one machine with the storage capacity. These machines in the cluster are alluded to as DataNodes. A record can be made of several blocks, and they are not really put away on a similar machine or node; the target systems that are responsible for holding each bock of the file are picked arbitrarily on a block by block basis.
Components
Another component of HDFS is the Name node. In case of the HDFS, it is vital for file system to store its metadata reliably so that the access to those files can be easy and requires minimum retrieval time (Abdul, Alkathiri and Potdar 2016).
Figure 1: Structure of HDFS
(Source: Wang et al., 2013, p.98)
Data Node
Another sort of hub in HDFS design is information hub which is additionally called as the compute node. It works as client node in the whole system. In any location it may contain different data nodes in view of capacity and execution ability. A compute node performs two fundamental tasks i) putting away data in HDFS and furthermore goes about as nature for running the applications in the specified in environment (Thomson and Abadi 2015). Amid the early startup each figuring hub performs handshakes with name node. It checks for right namespaces ID in the environment. In the event that it discovered then it interfaces compute node to name node, and in the event that it not found any right namespace id at that point it basically close the connection.
Architecture
Job Tracker: This part interfaces with client applications that need the data locally. It is the distributing Map that allocates the task load to specific nodes inside a particular cluster.
Undertaking tracker: This is a process that receives the data processing tasks from a vocation tracker in the ace hub like Map, Reduce it to particular bunch hub and rearrangement (Gu et al 2016).
Name node (NN): they are in charge of keeping track for each document in Hadoop Distribute File System HDFS, a client application contact NN to find file, erase, duplicate add a file.
Data Node (DN): they are in charge of putting away in HDFS , they are keeping lists for records put away in , they are connect between customer applications and the NN .giving the customers with name of NN that are hold the required information .
Worker Nodes: they are the servers who are capable for handling undertakings; every laborer (slave) holds DN and a task or job tracker for the applications.
NameNode
It is the central node in the architecture of the HDFS, in which it comprises of data about the file system stored in the database. The main objective or functionality of the NameNode is it records every one of the traits and Meta information and right areas of records in the DataNodes (Abdul, Alkathiri and Potdar 2016). As the name node of the system acts as the master node in the whole system and knows all data about distributed and replicated data blocks in a particular structure.
Algorithms used
For the processing of BigData the HDFS uses the replication and MapReduce Algorithms so that the files can be stored, processed or retrieved using minimum request processing time for the client applications.
4. Use of HDFS
HFDS (Hadoop Distributed File System) developed similar to the Google GFS is the file system structure arrangement of any Hadoop cluster used in an organization (Thomson and Abadi 2015). A genuine Hadoop work commonly takes minutes to hours to finish, consequently Hadoop is not for continuous examination, yet rather for disconnected, batch data processing. As of late, Hadoop has experienced a total update for enhanced viability and reasonability. Something many refer to as YARN (Yet another Resource Negotiator) is at the focal point of this change (Hua et al. 2014). One noteworthy goal of Hadoop YARN is to decouple Hadoop from Map Reduce worldview to suit other parallel registering models, such as MPI (Message Passing Interface) and Spark.
5. Real world cases about the use of HDFS in business
There are several business organizations that use the HDFS for improved and efficient data analytics. Such as, Facebook, yahoo, New York Times, The online networking site Facebook has more than 1.86 monthly active users worldwide and is increasing at the rate of 17% every year. The huge amount of data generated by the users of Facebook has constrained the world's greatest social networking site to use the Hadoop for the analysis of the collected data. Thus, it began utilizing Apache Hadoop distributed file system to store internal logs and utilize it as the main tool for the analysis of data. It at present has 2 noteworthy clusters, one with 300 nodes and the other with 1100 nodes (Abdul, Alkathiri and Potdar 2016). Each of these nodes has 12 TB of data storage limit and 8 computation cores. Apache Hive was worked by Facebook which a SQL-like structure for the quarry language called HiveQL that permits its non-Java software engineers and designers to utilize Hadoop and the Hadoop distributed file storage system.
The search engine giant and mailing service provider Yahoo also uses the HDFS to provide better users experience to its users. For this, it manages and analyzes the data using the Hadoop distributed file system. Extensive HDFS groups at Yahoo! incorporate around 4000 nodes. A typical cluster of Yahoo consists of two quad core Xeon processors (CPU speed 2.5 GHz) 4–12 straightforwardly appended SATA drives (each of them consist of two terabytes of data), 1-gigabit Ethernet connection and 24 GB of RAM. 70% of the disk space is apportioned to HDFS (Gu et al. 2016). The rest of held for the operating system (Red Hat Linux), logs, and space to partition the output of the Map reduce tasks (Map Reduce transitional information are not put away in HDFS).
In case of the large clusters of Yahoo having 4000 nodes, are capable of storing around 80 million blocks and 65 million files. Each block is replicated three times on a regular basis that results in 60 000 block copies.
6. Benchmarking of HDFS in terms of application and contribution in business
When the applications are running at a remote host or node to provide faster services to the clients then it is important for the applications to have the retrieve and process the collected data. Presently the need to handle huge amounts of business data is most significant compared to the different other business processes. In any business organization the terabyte-and petabyte-scale datasets are become ordinary as they are collected from different sources. The analysis of this dataset is important for the organizations as the huge unstructured data contains value that can be embraced by the organization when new business strategies are created (Abdul, Alkathiri and Potdar 2016). In the business circle, business knowledge, driven by the capacity to accumulate information from a bewildering cluster of sources.
The value generated from the analysis of the huge unstructured business data resembles and is useful in increasing the revenue from the business. Development of information is exponential for the business organization with the growth of the internet, and will keep on increasing as innovation changes day by day. Hadoop and its component HDFS is presently the essential tool for the analysis of huge dataset. In addition to that, numerous other related technologies are being developed to enhance its effectiveness of the Haddop in the Big Data analytics
The MapReduce technique is a parallel programming technique used in the processing of distributed system and is actualized on top of HDFS . The MapReduce engine on HDFS comprises of few TaskTrackers and a JobTracker . At the point when a MapReduce technique is executed, the JobTracker helps it in dividing them in smaller tasks (Mapping and reducing) dealt with by the TaskTrackers. In the Mapping step, the master node or the Namenode takes the information, partitions it into smaller parts and disseminates them to data nodes. Every node forms a sub-issue and composes its outcomes as key. In the Reduce step, the qualities with a similar key are assembled and prepared by a similar machine to provide the final result to the client applications.
As described by Vijayakumari, Kirankumar and Rao (2014) With the layered architecture the HDFS is able to provide efficient data mapping services for the applications running on the servers. As an example each compute node keeps the present status of the duplication in its hub and produces report about the block of the data. After each hour data sends that piece answer to server node so it continuously contains the refreshed data about the data node. Amid this procedure compute node likewise sends heartbeats to name node at regular intervals, because of this the name node may distinguish which nodes are working legitimately and which are not working appropriately (Hua et al. 2014). On the off chance that name node does not get any pulse message from the data node or, then again computing node, then it just expects that the figure hub fizzles and it created the reproduction of the data node in a similar group.
Again the name node of this file system helps in better data retrieval from the storages. For this, it maintains data about the following free data blocks which is to be allocated in future. The clients enquires to the name node for setting data in the system and gives data which is as of late included, modified or removed from the data nodes.
Conclusion
The HDFS is considered as the file system segment of Hadoop. While the interface to HDFS is designed after the UNIX file system format, the standards for data security was eliminated for made better data processing of the applications that are currently executing at the server. The HDFS stores metadata related to the file system and application information independently. HDFS provides out-of-the-box failover and redundancy avoidance abilities that oblige next to zero manual maintenance (depending upon the utilization). Having such elements incorporated within the storage layer permits the database administrators and developers to focus on different duties as opposed to creating observing the system or potentially programming schedules to make up for another arrangement of capacity programming that does not have those abilities.
References
Chandrasekar, S., Dakshinamurthy, R., Seshakumar, P.G., Prabavathy, B. and Babu, C., 2013, January. A novel indexing scheme for efficient handling of small files in hadoop distributed file system. In Computer Communication and Informatics (ICCCI), 2013 International Conference on (pp. 1-8). IEEE.
Hua, X., Wu, H., Li, Z. and Ren, S., 2014. Enhancing throughput of the Hadoop Distributed File System for interaction-intensive tasks. Journal of Parallel and Distributed Computing, 74(8), pp.2770-2779.
Gu, R., Dong, Q., Li, H., Gonzalez, J., Zhang, Z., Wang, S., Huang, Y., Shenker, S., Stoica, I. and Lee, P.P., 2016. DFS-Perf: A Scalable and Unified Benchmarking Framework for Distributed File Systems.
Islam, N.S., Lu, X., Wasi-ur-Rahman, M. and Panda, D.K., 2013, August. Can parallel replication benefit hadoop distributed file system for high performance interconnects?. In High-Performance Interconnects (HOTI), 2013 IEEE 21st Annual Symposium on (pp. 75-78). IEEE.
Vijayakumari, R., Kirankumar, R. and Rao, K.G., 2014. Comparative analysis of google file system and hadoop distributed file system. ICETS-International Journal of Advanced Trends in Computer Science and Engineering, 3(1), pp.553-558.
Wang, L., Tao, J., Ranjan, R., Marten, H., Streit, A., Chen, J. and Chen, D., 2013. G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Generation Computer Systems, 29(3), pp.739-750.
Jayakumar, N., Singh, S., Patil, S.H. and Joshi, S.D., 2015. Evaluation Parameters of Infrastructure Resources Required for Integrating Parallel Computing Algorithm and Distributed File System. IJSTE-Int. J. Sci. Technol. Eng, 1(12), pp.251-254.
Inamdar, S.Y., Jadhav, A.H., Desai, R.B., Shinde, P.S., Ghadage, I.M. and Gaikwad, A.A., 2016. Data Security in Hadoop Distributed File System.
Hsiao, H.C., Chung, H.Y., Shen, H. and Chao, Y.C., 2013. Load rebalancing for distributed file systems in clouds. IEEE transactions on parallel and distributed systems, 24(5), pp.951-962.
Devi, S. and Kamaraj, K., 2014. Architecture for Hadoop Distributed File Systems. Architecture, 3(10).
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S. and Saha, B., 2013, October. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (p. 5). ACM.
Hua, X., Wu, H., Li, Z. and Ren, S., 2014. Enhancing throughput of the Hadoop Distributed File System for interaction-intensive tasks. Journal of Parallel and Distributed Computing, 74(8), pp.2770-2779.
Cho, J.Y., Jin, H.W., Lee, M. and Schwan, K., 2014. Dynamic core affinity for high-performance file upload on Hadoop Distributed File System. Parallel Computing, 40(10), pp.722-737.
Kim, M., Cui, Y., Han, S. and Lee, H., 2013. Towards efficient design and implementation of a hadoop-based distributed video transcoding system in cloud computing environment. International Journal of Multimedia and Ubiquitous Engineering, 8(2), pp.213-224.
Sivaraman, E. and Manickachezian, R., 2014, March. High performance and fault tolerant distributed file system for big data storage and processing using hadoop. In Intelligent Computing Applications (ICICA), 2014 International Conference on (pp. 32-36). IEEE.
Pal, A. and Agrawal, S., 2014, August. An experimental approach towards big data for analyzing memory utilization on a hadoop cluster using HDFS and MapReduce. In Networks & Soft Computing (ICNSC), 2014 First International Conference on (pp. 442-447). IEEE.
Thomson, A. and Abadi, D.J., 2015, February. CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems. In FAST (pp. 1-14).
Liao, C., Squicciarini, A. and Lin, D., 2016, June. LAST-HDFS: Location-Aware Storage Technique for Hadoop Distributed File System. In Cloud Computing (CLOUD), 2016 IEEE 9th International Conference on (pp. 662-669). IEEE.
Abdul, J., Alkathiri, M. and Potdar, M.B., 2016, September. Geospatial Hadoop (GS-Hadoop) an efficient mapreduce based engine for distributed processing of shapefiles. In Advances in Computing, Communication, & Automation (ICACCA)(Fall), International Conference on (pp. 1-7). IEEE.
Chevalier, M., El Malki, M., Kopliku, A., Teste, O. and Tournier, R., 2015, May. Benchmark for OLAP on NoSQL technologies. In 9th IEEE International Conference on Research Challenges in Information Science (IEEE RCIS 2015) (pp. pp-480).