Paper Review:
The HDFS is heavily inspired by the GFS in many ways. However, it does not focus on the append-only large files but offers a more general and flexible way to handle the data. Some of the approaches like configurable block placement policy and block size is great for common users but it doesn’t really solve any problem appears in GFS seven years ago. (On the other hand, Colossus kinda overcomes the issue of centralized control/coordination of single master machine and brings the scalability to the new level. At least that what they said)
Strong points:
- Comparing to GFS, HDFS doesn’t make too many assumptions on the data and operations that it’s likely to handle. The chunk size is 128 MB by default but still user configurable file-to-file. This could be considered a more general solution to distributed file system. On the other hand, since I won’t write this again in the weak points, the fact that it makes no assumptions could result in under-utilization in many situations. For example, HDFS must have much lower efficiency than GFS when it comes to massive amount of concurrent append operations;
- The fact that HDFS offers version and software version verification during node startup is really good when it comes to software debugging and upgrade. It prevents uncoordinated snapshot and other possible issue with the data integrity.
- hflush operation is really interesting. I was just about to write the packet buffer as a weak point because the write changes could be invisible to reader. hflush operation flushes everything down the pipeline and wait for ACKs so that the recent write operations are visible in all replicas. Btw since TCP connection already takes care of the packet receiving order, we don’t have to worry about it and wait for ACKs when we are sending a sequence of packets.
- The configurable block placement policy is awesome. I can’t wait to see some experiments on the performance analysis of different block placement policy. I feel that in many aspects, HDFS is similar to GFS but more flexible and configurable, like the block size, rack identification and this one. (or I’m just a little biased because this is an open-source project).
Weak points:
- Most of the weak points of GFS are repeated here, like the centralized coordination of NameNode might cause unavailability when it’s down and limited scalability
- The heartbeats interval is three seconds by default. However, the NameNode only consider a DataNode out pf service and block the replicas hosted by that DataNode after ten minutes. The heartbeat interval and the out-of-service time don’t really match. There will be a long period of time (9 min and 57 sec) when the DataNode is down but the NameNode continues to assume the node might still be alive.
- The snapshot is not really effective. While I can’t think of a better way to backup local data on DataNode, I don’t think a copy of current file directories and hard link those directories to be of too much help. I can only roll back to a “state” of the previous storage by restoring all repositories but the files cannot be rolled back even with checkpoints. I think in GFS things are better because most files are append-only and there are ways to restore files with a list of operations. Also GFS seems to allow local copies of files during snapshot. Again I know this kind of backup requires a lot of storage resource but the fact that HDFS doesn’t offer this option to users could definitely be a downside.
- In HDFS they have balancer running as an extra application on every single DataNode that it moves replicas around in order to balance the disk utilization and does not reduce the number of replicas and the racks used by that block. However, unlike GFS which takes care of the balancing mostly in the placement, HDFS uses balancer as an extra feature which consumes extra computing power and computation. It’s likely that a new block is randomly placed in a wrong place (a node with high utilization) and then the balancer has to move it to a new DataNode. Even the simple balancing during placement could result in too much traffic for a newly-joined DataNode, it’s not that hard to come up with a algorithm to cope with that (like a interval between each new placement on a single DataNode). A sophisticated placement decision is way better than rebalancing after the placement.
Paper Outline:
-
Introduction and related work:
- HDFS is the file system component of Hadoop;
- Stores metadata (on NameNode) and data (on DataNodes) separately;
- Data is replicated on different machine to:
- ensure the reliability and durability;
- data transfer bandwidth is multiplied;
- Multiple metadata servers;
-
Architecture:
-
NameNode:
- files and directories are represented on the NameNode by inodes:
- record attributes like permissions, modification and access times, namespace and disk space quotas;
- files are split into big blocks (typically 128 MB) and replicated on DataNodes (typically three replicas);
- maintains namespace tree and mapping of file blocks to DataNodes;
- read request from HDFS client:
- contact NameNode first for the file locations;
- read contents from the closest DataNode;
- write request from HDFS client:
- nominate a suite of three DataNodes;
- write to all three in a pipeline fashion;
- single NameNode for each cluster;
- image: inode data with a list of block mappings;
- stored in RAM entirely and replicated as checkpoint in local disk;
- NameNode also stores modification log of image called journal in local disk;
- restores the namespace by reading the namespace checkpoints and replaying the journal;
- files and directories are represented on the NameNode by inodes:
-
DataNodes:
- each block replica on a DataNode is represented by two files in local disk:
- the first file contains the data itself;
- the second file is block’s metadata:
- checksums and generation stamp;
- handshake with NameNode during startup;
- verify namespace ID and software version;
- namespace ID is assigned to the file system instance when it is formatted. Nodes with different namespace IDs will not be able to join the cluster to preserve the data integrity;
- register with the NameNode after handshake:
- persistently store their unique storage IDs (if they don’t have it);
- makes it recognizable even it restarts with a different IP;
- identifies block replicas and send the block report to NameNode;
- contains block ids and the generation stamps;
- the first report is sent immediately after registration and the subsequent reports will be sent on hourly basis to NameNode;
- DataNode send heartbeats to the NameNode:
- to confirm its normal operation;
- default interval is three seconds;
- NameNode marks out of service after 10 minutes without heartbeats;
- heartbeats also carry information on storage capacity, fraction of storage in use, and the number of data transfers in progress;
- NameNode replies to heartbeats with instructions like:
- replicate blocks to other nodes;
- remove local block replicas;
- re-register or to shut down the node;
- send an immediate block report;
- each block replica on a DataNode is represented by two files in local disk:
-
HDFS client:
- supports read, write and delete files; create and delete directories;
- when an application reads a file:
- client asks NameNode for the list of DataNodes with replicas;
- client contacts a DataNode directly for the file;
- when an application writes a file:
- client asks NameNode to choose DataNodes to host replicas;
- client organizes a pipeline from node to node and sends data;
- client has to contact NameNode for new DataNodes if the previous data block is filled.
- special APIs:
- one that exposes the locations of a file blocks;
- one that changes the replication factor;
-
Image and journal:
- image describes organization of application data as directories and files:
- persistent record stored in local disk is called checkpoints;
- journal is a write-ahead commit log for changes to the file system:
- must be persistent;
- every transaction is recorded before the change is committed;
- checkpoint file:
- cannot be changed by the NameNode
- could be replaced with a new checkpoint (derived from old checkpoint and journal) after startup or by CheckpointNode;
- checkpoint and journal are stored in multiple storage directories:
- on different columns and remotely;
- batches multiple transactions from different clients to remove the bottleneck of disk access of multiple threads;
- image describes organization of application data as directories and files:
-
CheckpointNode:
- downloads the current checkpoints and journal files from NameNode, combines them and return the merged checkpoint periodically;
- truncate the journal to prevent file loss/corruption and save restart time;
-
BackupNode:
- in addition to periodical checkpoint creation, it maintains an in memory image of the namespace that is always synchronized with NameNode;
- accepts stream of transaction from NameNode and saves them;
- can create checkpoints without downloading;
- can be viewed as a read-only NameNode that contains all metadata except for block locations (which is in the memory of NameNode);
-
Upgrades, file system snapshots:
- only one snapshot can exist;
- the NameNode first reads the checkpoint and journal files and merges them in memory;
- NameNode instruct DataNodes to create local snapshot during handshake:
- which is merely a copy of storage directory;
- cluster admin can choose to roll back when restarting the system;
- no option to roll forward;
- layout version is also compared during node startup;
-
-
File I/O operations and replica management
-
File read and write:
- a HDFS client request a lease to write to a file;
- renew the lease periodically;
- if soft limit expires, others can request lease;
- if hard limit expires, the lease will be revoked;
- no other clients can write to the same file;
- lease revoked when the file is closed;
- renew the lease periodically;
- writing to a block in pipeline from node to node:
- 64 KB buffer packets are pushed into pipeline when filled;
- next packets can be sent without ACKs for the previous packets;
- HDFS does not guarantee that changes are visible to readers till the file is closed (because of the packet pushing);
- hflush operation flushes the current packet to the pipeline and returns after ACKed and make the change visible immediately;
- checksums are used while reading:
- a DataNode stores checksums in a metadata file separately;
- client computes checksum upon data receive and verifies with the checksum sent from DataNodes;
- when a clients opens a file to read:
- select the closest replica first and to the next closet replica if it fails;
- HDFS permits read during write;
- HDFS I/O is particularly optimized for batch processing systems:
- high throughput for sequential read/write;
- reduce the read/write response time;
- a HDFS client request a lease to write to a file;
-
Block placement:
- spread the data across multiple racks, which will also increase the network round-trip time;
- HDFS allows an administrator to configure a script that returns a node’s rack identification given a node’s address;
- rack identifications are assigned to DataNode during register by NameNode;
- HDFS offers a configurable block placement policy interface:
- by default the first replica is where the writer is and the rest two are assigned randomly across all the racks with some restrictions;
- No DataNode contains more than one replica;
- No rack contains more than two replicas of the same block, provided that there are sufficient amount of racks;
- reading from closest replica reduces the inter-node and inter-rack traffic;
-
Replication management:
- NameNode detects if a block is over- or under- replicated from block reports:
- removes blocks (from DataNodes with least amount of available disk space) that are over-replicated;
- replicates blocks if they are under-replicated:
- replicates one-replica block first on a different rack;
- if two replicas are on different racks, the third replication should be placed on one of the two racks to reduce cost;
- treats the situation where all replicas are on the same rack as under-replicated:
- create a replica on a different rack;
- remove one of the three replicas in the original rack;
- NameNode detects if a block is over- or under- replicated from block reports:
-
Balancer:
- HDFS block placement strategy doesn’t take DataNode disk space utilization into account;
- avoid placing new data into a small set of new nodes;
- balancer is a tool that balances disk space usage:
- takes a threshold value as input parameter;
- deployed as an application program that can be run by admin;
- iteratively moves replicas from DataNodes with higher utilization to lower ones;
- guarantees that the decision does not reduce the number of replicas or the number of racks used for a particular block;
- optimizes the balancing by minimizing the inter-rack traffic;
- configured to limit the bandwidth consumed by rebalancing;
- HDFS block placement strategy doesn’t take DataNode disk space utilization into account;
-
Block scanner:
- each DataNode runs a block scanner that:
- periodically scans all replicas and verifies the checksums;
- client read could help the verification;
- verification time of each block is stored in a readable file;
- NameNode marks the replica as corrupt if checksum fails:
- NameNode starts to replicate a good copy immediately;
- then the corrupt replica is scheduled for deletion;
- allows user to retrieve data from the corrupt replicas;
- periodically scans all replicas and verifies the checksums;
- each DataNode runs a block scanner that:
-
Decommissioning:
- a black list for joining the cluster;
- once a DataNode is marked as decommissioning:
- it won’t be selected as replica target;
- continues to serve read requests;
- NameNode replicates the blocks on this DataNode;
- the node enters decommissioned state after and can be safely removed from the cluster;
-
Inter-cluster data copy:
- DistCp for large inter/intra-cluster parallel copying:
- a MapReduce job;
- DistCp for large inter/intra-cluster parallel copying:
-
-
Practice at Yahoo;
-
Future work:
- Effectively unavailable when its NameNode is down;
- Scalability of the NameNode has been a key struggle;
- one solution is to let multiple namespaces and NameNodes to share the physical storage. But this costs a lot in terms of management.