Paper Review: Pig Latin: A Not-So-Foreign Language for Data Processing

Summary:

Pig is a scripting language built upon MapReduce, trying to save data scientist from the control flow MR programs that they are struggling with. It’s a combination of SQL and procedural language, which compiles highly abstract data flow programs into MapReduce executions. So developers don’t have to worry much about parallelism, fault-tolerance and pipelining etc.

Strong points:

  1. Pig is a combination of SQL querying language and low level MapReduce (high level renders and low level manipulations), which grants Pig some benefits of both. Programmers can easily write “sequential (not declarative)” programs without worrying about database layout and MapReduce parallelism.
  2. Pig is flexible in many ways like various data models and user-define function with Java and Python (in later versions) support etc. You can specify the control flow of programs at each step, just like any other procedural languages while focusing on data flow at the same time.
  3. Also Pig is different from relation database or OLAP. It can deal with large unstructured datasets or nested data structure. On the other hand, it enjoys parallelism just like parallel database, thanks to the internal MapReduce compilation.
  4. Each step is pipelined and optimized for execution. Also it’s based on MapReduce so all the fault-tolerance, scalability and other MR features are given. As a developer you don’t have to worry much about the lower level of the language.

Weak points:

  1. First of all Pig programs are still compiled into MapReduce, which is inefficient and inflexible without doubt. Data must be materialized and replicated on the distributed storage between successive MapReduce steps and this makes thing much slower even with the pipeline. Also the bandwidth might become an issue when pipelined Pig programs are running: at the end of each step, all the intermediate data will be transferred via network almost simultaneously.
  2. Execution order is not actually sequential as the programs wrote in Pig. This is also mentioned in the “future work” part of the paper, that some of the execution order will be modified to meet better performance. While this is a good feature in most scenarios, it could be dangerous and hard to debug and fix in some cases.
  3. The flexibility could be a bit of challenge for developers. For example, only small part of parallelized primitives are included in Pig and users are responsible for the rest of the implementation as well as efficiency. I can imagine it would be a pain to develop a  efficient UDF without support from the community.
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s