OpenSource For You

A Peek Into the Hadoop Distribute­d File System

HDFS is a core component of the Hadoop package. It is an open source implementa­tion of the distribute­d file system on the lines of Google File System (GFS), based on a paper released by Google. It has a wide range of uses that we will look at in this arti

-

Before explaining the Hadoop Distribute­d File System (HDFS), let's try to understand the basics of Distribute­d File Systems (DFS). fn a DFS, one or more of the central servers stores files that can be accessed by any client in the network with proper access authorisat­ion. You need DFS for two purposes: 1. To hold large data 2. To share and use data among all users

A NFS ( Network File System) is one such ubiquitous DFS. While NFS is easy to use and configure, it has its own limitation­s, the primary one being a single location storage facility. Other limitation­s include ( but are not limited to) scaling, performanc­e and reliabilit­y issues. HDFS is the answer to all these limitation­s. HDFS is designed to store very large amounts of informatio­n in the range of terabytes and petabytes. ft supports storing files across the nodes throughout the cluster. ft offers high reliabilit­y by storing replicas of the files ( divided into blocks) across the cluster. ft is highly scalable and configurab­le. Administra­tors can control the block size, the replicatio­n factor and the location of the file in the cluster. HDFS is primarily designed to support ' map reduce' jobs in Hadoop. Users are expected to store large files and do more read operations than write operations on the file. Besides, it is assumed to perform long sequential streaming reads from the file.HDFS is a block structured file system. A data block is, by default, 12U MB (or 64 MB) in size; hence, a 300 MB file will be split into 2 x 12U MB and 1 x 44 MB. All these split blocks will be copied ‘N’ times over a cluster. ‘N’ is the replicatio­n factor (which is 3, by default) responsibl­e for providing resilience and fault tolerance of data in the cluster. These blocks are not necessaril­y saved on the same machine. fndividual machines in the cluster that store the files and form part of the HDFS group are called DataNodes.

Because HDFS stores files as a set of large blocks across several machines, these files are not part of the ordinary file system. Thus, typing ls or other Linux commands on a machine, and running a DataNode daemon, will display the contents of the ordinary Linux file system being used to host the Hadoop services — but it will not include any of the files stored inside the HDFS. This is because HDFS runs in a separate namespace, isolated from the contents of local files. The files and blocks can only be accessed through the DataNode service, running in its own separate daemon. The complexity of accessing files does not end here. Data must be available to all computer nodes when requested. To enable this transparen­cy, HDFS must maintain metadata of the data. The metadata is stored in a single location called NameNode in Hadoop. The size of the metadata on HDFS is very small due to the large block size and hence can be stored in the main memory of the NameNode. This allows fast access to metadata.

 ??  ??
 ??  ??

Newspapers in English

Newspapers from India