Hello, guys! Some days ago we gave you a quick introduction about what is MapReduce. Today we will show you in which scenarios we can use it.
MapReduce typically transforms data. This is perfect for log analysis of large amounts of data. First, what do people call large amounts of data? Can we consider lots of megabytes or gigabytes as large amounts of data? Usually not. In general, when we mentionTerabytes or even Petabytes we are talking about a large amount of data. MapReduce strategies show very useful on these scenarios - you can map the log entries on key/value pairs and apply a reduce function on it to turn the data compatible to an entry of your process system.
Imagine that we have some terabytes of log registering the user id and its access to a website. The log would have the following format:
Consider that you have a task to determine how many different websites each user visited. How could we do that? You will need to obtain the following result:
Basically, we need to transform data, like we transformed a raw ear of corn into pop corn on the past post. We have several ways to solve this. One possible alternative would be:
1. Group by users:
2. Remove duplicate websites:
3. Finally count the lines on each group:
Guess what we just did - A Map and Reduce sequence!
Hadoop and present its basic concepts. Soon we will implement the Java code that can handle this task.
This is it! Hope you have gotten the main idea of when we can use MapReduce!
You can check extra info about this here and here.