Monday, September 2, 2013

MapReduce Tutorial - getting started

Hello, guys!!

Maybe some of you have heard about 'MapReduce" but did not have time to search for it or did not get the main idea after reading its definition on the main article published by Google. Well, and when you search for 'MapReduce' tutorial, the first page is full of complicated and complex tutorials, nothing you can read in 5 minutes. So let's see in a simple way what is MapReduce!

MapReduce is a distributed and parallel approach to handle large amounts of data (datasets). This approach has 2 components - Map and Reduce (as the name suggests). Well, instead of a deep and boring technical explanation, let's make an analogy - we can compare MapReduce with preparing pop corn. How can we make a delicious pop corn? If you don't remember, take a look at the image below:



Easy, isn't it? So, when we make pop corn, we take an ear of corn and turn it pop corn. We have different forms of corn in the beginning and at the end of process, but we deal with corn and its intermediate states all time. Map and Reduce are similar to this:




Technically speaking, Map takes a set of input data (input data == ear of corn) and generates a set of key-value pairs (key-value pairs == corn grains). Reduce applies a new function over these key-values (add corn grains to the pan is this function) and transform them into data with a new format (data with new format == popcorn).

This is it! On future posts I'll show what can we do with MapReduce and when we can use this approach.

If you are interested, you can take a look at this tutorial or at Apache Hadoop MapReduce tutorial.
Recently, Fabiane Nardon gave a nice talk at QConSP about Big Data and published nice slides about MapReduce. Slides here.

This is it! Hope you have enjoyed the explanation.

3 comments:

  1. I like the analogy, but think it is a bit too generic for MapReduce 'world'. This is especially true since not all MapReduce jobs require both mapping and reducing, but that's another story. The best and simplest example is the hello world of MapReduce (which is what I assume you're following up with): 'word count'.

    But by staying within this analogy, a better example of the MapReduce algorithm is like having a big bowl of corn kernels, green peas, chick peas, and numerous other lentils. Mapping involves going through the bowl in handfuls or one-by-one and determining what you find. The task of the mapper could involve assigning each edible object a 'key' and 'value', like you mentioned. For example, the value could be size, mass, nutritional data, or simply its count (1 corn kernel = 1 count).

    The 'key' is the identifier for the reducer to organize the mess of data coming from the bowl, so this could some of the 'values' mentioned before (I.e. those that are distinct) but let's say we go with the bowl elements name. Now, with key-value pairs like ('corn', 4) and ('pea',2), where I've chosen to assign values of mass (grams), the reducer could handle the task of grouping all my keyed foods into groups and computing the total mass. In the end, this would be like having my messy bowl of data separated into new serving bowls. More importantly, we'd have new and useful data about each bowl.

    The true power of MapReduce comes from it's parallelism. In short, this means having multiple mappers and/or reducers running. Within the analogy, this would mean inviting my friends over and having them do the tasks of mapper and reducer. The more friends, the faster it can possibly be done. Fun :-)

    ReplyDelete
    Replies
    1. Hello, Sean! Thanks for the comment!

      I wanted to write a post for absolute beginners, so I thought the pop corn analogy was the simplest one! I totally agree on your example of word count, but I think this is one step ahead - it's simple but a little more complex than pop corn one. I wanted to give some emphasis on data transformation before talking about parallelism - we are releasing a small series of MapReduce posts (feel free to take a look on the other posts of the blog =D ) and we want talk about all the main ideas, including the power of parallelism.

      I really enjoyed the ideas you gave here for further analogies! Mind if we get them as inspiration for the next examples?

      Thanks again for the comment!

      Delete
  2. Of course, the 'food bowl' is yours to use :-)

    ReplyDelete