What is HDFS?
Hadoop Distributed File System (HDFS) is a distributed file storage system designed to store and manage vast amounts of data across a distributed cluster of commodity hardware. It is a key component of the Apache Hadoop framework, facilitating the storage and processing of large datasets.
Where to use HDFS
Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
Streaming Data Access: The time to read the whole data set is more important than latency in reading the first. HDFS is built on a write-once and read-many-times pattern.
Commodity Hardware: It works on low cost hardware.
Where not to use HDFS
Low Latency data access: Applications that require very less time to access the first data should not use HDFS as it is giving importance to whole data rather than time to fetch the first record.
Lots Of Small Files: The name node contains the metadata of files in memory and if the files are small in size it takes a lot of memory for the name node's memory which is not feasible.
Multiple Writes: It should not be used when we have to write multiple times.
Key Characteristics:
- Scalability: HDFS is designed to scale horizontally, allowing the addition of more nodes to handle growing amounts of data.
- Fault Tolerance: HDFS ensures data durability by replicating each block of data across multiple nodes. If a node fails, data can still be retrieved from replicas.
- Data Locality: HDFS aims to move computation close to data by storing data on the same nodes where computation is likely to occur. This reduces data transfer time.
HDFS Architecture: Components of Hadoop Distributed File System
NameNode: The name node serves as the master in the master-slave architecture of HDFS. It stores metadata which includes block locations, names, and permissions. The Name Node does not store the actual data content of files; it only stores metadata and coordinates data access operations. Since it is aware of every file's state as well as its metadata, Name Node serves as HDFS's controller and manager. Because the metadata is compact, it can be stored in the name node's memory, facilitating quicker access to the contents. Additionally, numerous customers use the HDFS cluster continuously, meaning that a single machine handles all of this data. It is responsible for carrying out file system actions such as opening, closing, renaming, etc.
Description: The master server that manages the metadata and namespace of the file system.
Functionality: Keeps track of the structure of the file system, metadata, and the location of data blocks.
DataNode: It stores and retrieves blocks as instructed by a client or Name node. They periodically send a report with a list of the blocks they are storing to the Name node. As indicated by the Name node, the Data node, which is a commodity hardware, also performs block formation, deletion, and replication.
Description: Worker nodes that store actual data blocks.
Functionality: Perform read and write operations as instructed by the NameNode.
Data Replication:
Blocks: Files are divided into fixed-size blocks (typically 128 MB or 256 MB).
Replication: Each block is replicated across multiple DataNodes (default replication factor is 3). This ensures fault tolerance and data durability.
The smallest quantity of data it can read or write is called a block. HDFS blocks have a configurable default size of 128 MB. HDFS divides files into block-sized portions that are kept separate from one another. In contrast to a file system, an HDFS file does not occupy an entire block if it is less than the block size. Assume that a file of size, say, 5 MB, stored in an HDFS block of 128 MB then it only requires 5 MB of space. The large HDFS block size is solely intended to reduce seek costs.
Replication Management: In order to offer fault tolerance HDFS makes use of a replication method. It does this by copying the blocks and storing them on various DataNodes. The number of copies of the blocks saved is determined by the replication factor. Although we can set it to any value, it is set to 3 by default. Every DataNode's block reports are gathered by NameNode in order to preserve the replication factor. The NameNode adds or removes replicas in accordance with whether a block is under-replicated or over-replicated, respectively.
Rack Awareness: Many Data Node machines are housed on a rack, and the production uses multiple racks of this type. Distributed placement of the block copies is achieved by HDFS using a rack awareness mechanism. Fault tolerance and minimal latency are provided by this rack awareness method. Assume that the replication factor that is set up is 3. The initial block will now be positioned on a local rack via the rack awareness algorithm. The other two blocks will remain on a different rack. If at all possible, it stores no more than two blocks in a single rack. Rack awareness helps optimize data locality by ensuring that replicas of a block are stored on different racks to minimize network traffic and improve fault tolerance. It ensures that if an entire rack or network switch fails, the data remains accessible from replicas stored on other racks.
Secondary Name Node: In the Hadoop Distributed File System (HDFS), the Secondary Name Node is essential to maintaining the dependability and effectiveness of the file system. In contrast to what its name suggests, the Secondary Name Node supports the primary Name Node by acting as an assistant. Its main purpose is to create new file system images by regularly merging the changes log with the existing file system images. The Secondary Name Node shortens the Name Node's recovery time in the event of a failure by carrying out this checkpoint function. The Name Node's computational load is partially relieved by this checkpoint procedure, which enhances system performance as a whole. Furthermore, the Secondary Name Node contributes to the general stability and dependability of HDFS by assisting in the detection and correction of anomalies in the file system's metadata. The secondary name node is an essential feature of HDFS that ensures dependable and seamless data management in remote contexts, even though it does not actively participate in real-time activities like the primary name node does.
HDFS Operations:
Write Operation:
- Client Request: The client communicates with the NameNode to create a new file.
- Block Allocation: The NameNode allocates data blocks and provides a list of DataNodes to the client.
- Write Data: The client writes the data directly to the identified DataNodes.
- Block Replication: Data is replicated across multiple DataNodes for fault tolerance.
Read Operation:
- Client Request:
- The client communicates with the NameNode to read a file.
- Block Location:
- The NameNode provides the client with the locations of the required data blocks.
- Read Data:
- The client reads data directly from the identified DataNodes.
HDFS Commands:
Uploading a File:
hadoop fs -copyFromLocal local_file_path hdfs://namenode_address/hdfs_file_path
Listing Files:
hadoop fs -ls hdfs://namenode_address/hdfs_directory_path
Creating a Directory:
hadoop fs -mkdir hdfs://namenode_address/hdfs_directory_path
Reading a File:
hadoop fs -cat hdfs://namenode_address/hdfs_file_path
5. HDFS Configuration:
Configuration Files:
- `hdfs-site.xml`: Contains configuration settings for the HDFS service.
- `core-site.xml`: Contains core configuration settings used by Hadoop components.
Key Configuration Parameters:
- dfs.replication: Specifies the default replication factor for file blocks.
- dfs.blocksize: Defines the size of each data block.
6. HDFS Security:
Authentication: HDFS supports Kerberos-based authentication for secure communication between nodes.
Access Control: Access control mechanisms can be enforced to restrict file access based on user permissions.
7. Monitoring and Maintenance:
Web Interfaces: The Hadoop Web UI provides interfaces for monitoring cluster health and NameNode and DataNode status.
Balancing Data: HDFS provides tools for balancing data across nodes to ensure uniform distribution.
Hadoop Distributed File System (HDFS) is a fundamental component of the Hadoop ecosystem, providing a scalable and fault-tolerant storage solution for big data. Understanding its architecture, operations, and configuration is essential for effectively managing and utilizing large datasets in a distributed environment.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.