Paper Review: MapReduce Simplified Data Processing on Large Clusters

Paper review:

MapReduce is a distributed data processing approach. One machine acts as master and assign map/reduce tasks to all the machines in the cluster. Map generated intermediate k/v pairs and feed to the reduce workers using underlying filesystem. Reduce workers will merge the data with the same keys and return multiple output file (in the number same as reduce tasks) to the user.

Strong points:

  1. The system is inspired from map and reduce operations from functional programming languages, which is very much different from the way I thought about distributed data processing. This approach divided the programs into two smaller pieces. This division is easy to do in some situations and very important for the task distribution.
  2. The master in this system has relatively light-weighted task to do than workers. It only has to maintain very small amount of data and little computation as well. This means that one master could be used to support a large amount of workers.
  3. The output of distributed computation is guaranteed to be the same as a non-faulty sequential execution of the program when the map and/or reduce operators are non-deterministic. This is ensured by the master control as well as the underlying filesystem. I’m not sure about other data processing models (like spark) but this seems like a strong and necessary point to me.
  4. All the refinements in section 4 are quite useful in real world engineering. I can only imagine people get saved by bad records skipping and debug with local execution all the time.

Weak points:

  1. Each idle worker machine will be assigned with map or reduce task which means that one machine can only do a certain kind of job during a single MapReduce function. Does this mean that reduce workers kinda have to wait for the map workers to finish first? Also the map workers might be idle when reduce workers are busy. Even with the best coordination, we would still waiting time and wasted computing resources. Moreover, reduce workers have to load data from the disk of map workers.
  2. From the parallel programming’s point of view, if one worker node is slow but still in-progress (which means that it could be assigned with work), it will slow down the whole MapReduce process. Lol I wrote this down long before I read the section 3.6 and they think it’s one of the common causes that lengthens the processing time and call it straggler. They have a solution that’s simple and practical so I guess this one should go to the strong points.
  3. If the Map task splits are relatively the same in terms of computation and we have a bunch of node with same hardware and configuration. Map tasks running on different machines are very likely to be finished at the exact same time and this will cause the network bandwidth consumption to be extremely bursty since all the intermediate data will be moved from M workers to R workers using the same network media in the data center. It will significantly slows down the data transfer and the total running time.
  4. Based on the previous three points, why don’t we divide the input into N splits which is hundreds times of the number of machines in the cluster and then assign the inputs to workers with both M and R tasks? In that way if a node is slow, it would be assigned withe less splits and there’s virtually no waiting time for intermediate data transfer as well as waste of computation resource.
  5. MapReduce operation will be aborted if master fails. Even though it’s unlikely to happen considering there is only one master machine, the failure of master might cause huge loss on the computation progress. I don’t think it’s hard to keep a replica of some sort in different network and power supply. The replica serves as data backups under normal circumstances but will act as master if the original one fails. However, this will also bring overheads to master operations since we have to backup the computation states and other master data structures constantly before commit. I guess it’s more of a trade-off between reliability and performance than a weak point.
  6. MapReduce is not really flexible enough for all the big data processing situations. For example, there are cases when my operations couldn’t be expressed in the way of map and reduce. What if we have a huge data set that is used by all but in different computation method (in this case the data passed into each computer should be the same but the program tasks are not)? MapReduce has a rigid way of seeing programs/functions/tasks as a completely different thing from data. Can I use lambda calculus and see data and function as the same thing in distributed computing? That way the computation will no longer be a constraint of distribution. I can pass functions and/or data to all the workers with coordination and load-balancing without worrying about the trouble to figure out the map and reduce functions.


Paper Outline:

  1. Introduction:

    • Straightforward computation but large data;
    • Inspired from Lisp and other functional languages:
      • map: compute a set of intermediate k/v pairs;
      • reduce: merge all intermediate values with the same key;
  2. Programming model:

    • Example:

      • map takes a document as (name, content) pair and return a list of (word, 1) pairs to the reduce function;
      • reduce takes intermediate pairs with the same keys, which is word in this case, and increment to the total count of that word;
    • Types:

      • map: (k1, v1) -> list(k2, v2);
      • reduce: (k2, list(v2)) -> list(v2);
      • input k/v are drawn from a different domain than the output k/v;
      • intermediate k/v are from the same domain as output k/v;
    • More examples:

      • distributed grep;
      • count for URL access frequency;
      • reverse web-link graph;
      • term-vector per host;
      • inverted index;
      • distributed sort;
  3. Implementation:

    • Environment:

      • a cluster with thousands with commodity machine connected with commodity networking hardware;
      • supported with distributed file system using replication to provide availability and reliability;
      • a scheduling system takes orders from users;
    • Execution overview:

      • input will be divided (in parallel by different machines) into M splits, with a typical size of 16 to 64 MB per split;
      • one of the program copy acts as master and assign work to the workers. It picks idle workers and assigns each one a M or R task;
      • the workers with M task read the content of corresponding input splits and buffer the intermediate output k/v pairs in the memory;
      • the buffered pairs are saved to local disks periodically, partitioned into R regions. The locations of those pairs are passed to the master to perform R tasking assignment;
      • reduce workers are notified by the master about the data locations. They read the buffered data from the local disks of map workers;
      • reduce workers sorts the data so that the pairs with the same keys are grouped together. Sorting is necessary to reduce the workload and will be external if needed;
      • reduce workers iterate though the data and append the reduce function output to a final output;
      • master wakes up user program and returns;
    • Master data structures:

      • state (idle, in-progress or completed) and identities of the workers;
      • locations of intermediate file regions;
    • Fault tolerance:

      • worker failure:
        • detected with periodical ping from the master;
        • any finished tasks will be reassigned to other nodes upon failure;
        • finished map tasks will be reassigned to other nodes upon failure;
          • because the intermediate data is stored in local disk and therefore inaccessible. On the other hand, completed reduce tasks output will be stored globally;
          • all R workers will be notified about the reassignment;
      • master failure:
        • periodic checkpoints of the master data structures;
        • computation will be aborted if master fails;
          • highly unlikely for single master machine;
      • semantics in the presence of failures:
        • same as sequential execution if M/R is deterministic because:
          • atomic commits of map and reduce task output;
          • master ignores duplicated commit for map operations;
          • underlying file system guarantees to eliminate duplicated reduce operation output;
        • weak but reasonable semantics if M/R is non-deterministic;
          • due to different execution orderings;
    • Locality:

      • input data is stored on the local disks of the machines in the cluster;
      • files are divided into blocks of 64 MB and stores multiple copies (typically three) over the cluster;
      • master will try to save bandwidth by:
        • assign machines with data replications as M workers;
        • assign machines near data replicas as M workers;
    • Task granularity

      • the number of M and R tasks should be significantly bigger than the number of machines to improve dynamic load balancing and speed up recovery upon node failure;
      • O(M+R) scheduling decisions and O(M*R) task states should be handled by the master machine;
      • R shouldn’t be too large otherwise it might result in fragmentation of the output file;
    • Backup tools:

      • straggler: a machine which takes too long to complete the task;
      • the master perform reassignment of the in-progress tasks when it comes close to the end of MapReduce operation;
      • little overhead in theory but works in large MapReduce operations;
  4. Refinements:

    • Partitioning function:

      • “hash(key) mod R” as partitioning function;
      • other ones with special needs;
    • Ordering guarantees:

      • within a given partition, intermediate k/v pairs are processed in increasing key order;
        • easier to sort output file per partition;
    • Combiner function:

      • combiner function can partially merge intermediate data;
        • executed on each machine that performs a map task;
        • typically the same as reduce function;
    • Input and output types:

      • user can make their own support for input type by providing reader interface;
    • Side-effects:

      • no support for multiple output files from a single task;
    • Skipping bad records:

      • optional mode of execution to ignore bad records and move on;
    • Local execution;

    • Status Information;

    • Counters;




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s