A Peek Into the Hadoop Distributed File System
HDFS is a core component of the Hadoop package. It is an open source implementation of the distributed file system on the lines of Google File System (GFS), based on a paper released by Google. It has a wide range of uses that we will look at in this arti
Before explaining the Hadoop Distributed File System (HDFS), let's try to understand the basics of Distributed File Systems (DFS). fn a DFS, one or more of the central servers stores files that can be accessed by any client in the network with proper access authorisation. You need DFS for two purposes: 1. To hold large data 2. To share and use data among all users
A NFS ( Network File System) is one such ubiquitous DFS. While NFS is easy to use and configure, it has its own limitations, the primary one being a single location storage facility. Other limitations include ( but are not limited to) scaling, performance and reliability issues. HDFS is the answer to all these limitations. HDFS is designed to store very large amounts of information in the range of terabytes and petabytes. ft supports storing files across the nodes throughout the cluster. ft offers high reliability by storing replicas of the files ( divided into blocks) across the cluster. ft is highly scalable and configurable. Administrators can control the block size, the replication factor and the location of the file in the cluster. HDFS is primarily designed to support ' map reduce' jobs in Hadoop. Users are expected to store large files and do more read operations than write operations on the file. Besides, it is assumed to perform long sequential streaming reads from the file.HDFS is a block structured file system. A data block is, by default, 12U MB (or 64 MB) in size; hence, a 300 MB file will be split into 2 x 12U MB and 1 x 44 MB. All these split blocks will be copied ‘N’ times over a cluster. ‘N’ is the replication factor (which is 3, by default) responsible for providing resilience and fault tolerance of data in the cluster. These blocks are not necessarily saved on the same machine. fndividual machines in the cluster that store the files and form part of the HDFS group are called DataNodes.
Because HDFS stores files as a set of large blocks across several machines, these files are not part of the ordinary file system. Thus, typing ls or other Linux commands on a machine, and running a DataNode daemon, will display the contents of the ordinary Linux file system being used to host the Hadoop services — but it will not include any of the files stored inside the HDFS. This is because HDFS runs in a separate namespace, isolated from the contents of local files. The files and blocks can only be accessed through the DataNode service, running in its own separate daemon. The complexity of accessing files does not end here. Data must be available to all computer nodes when requested. To enable this transparency, HDFS must maintain metadata of the data. The metadata is stored in a single location called NameNode in Hadoop. The size of the metadata on HDFS is very small due to the large block size and hence can be stored in the main memory of the NameNode. This allows fast access to metadata.