Hadoop

Map/Reduce and Hadoop performance

Plan

The Map/Reduce model

Two performance problems and ways out

Blocking steps in the Map/Reduce processing pipeline

Non-selective data access

Conclusion: toward Map/Reduce optimization?


























Map/Reduce and Hadoop performance

Plan

       The Map/Reduce model

       Two performance problems and ways out

   Blocking steps in the Map/Reduce processing pipeline

   Non-selective data access

       Conclusion: toward Map/Reduce optimization?

 

The Map/Reduce model

         Problem:

     How to compute in a distributed fashion a given processing to a very large amount of data

         Map/Reduce solution:

    Programmer: express the processing as

     Splitting the data

     Extract (key, value pairs) from each partition (MAP)

     Compute partial results for each key (REDUCE)

    Map/Reduce platform (e.g., Hadoop):

     Distributes partitions, runs one MAP task per partition

     Runs one or several REDUCE tasks per key

     Sends data across machines from MAP to REDUCE

Map/Reduce in detail

Hadoop Map/Reduce

Performance problem 1:
Idle CPU due to blocking steps

Hadoop resource usage

Hadoop benchmark

[Li, Mazur, Diao, McGregor, Shenoy, ACM SIGMOD Conference 2011]

 

 

 

 

 

 

 

 

CPU stalls during I/O intensive Merge

Reduce strictly follows Merge

Hash-based algorithms to improve Hadoop performance

[Li, Mazur, Diao, McGregor, Shenoy, SIGMOD 2011]

Main idea: use non-blocking hash-based algorithms to group items by keys during Map.LocalSort and Reduce.Merge

Principle of hashing:

 

 

 

         Partitions can be in memory or flushed to disk

         If the reduce works incrementally, early send

Performance problem 2:
non-selective data access

Data access in Hadoop

       Basic model: read all the data

   If the tasks are selective, we don't really need to!

       Database indexes? But:

   Map/Reduce works on top of a file system (e.g. Hadoop file system, HDFS)

   Data is stored only once

   Hard to foresee all future processing

     "Exploratory nature" of Hadoop

Accelerating data access in Hadoop

       Idea 1: Hadop++
[Jindal, Quiané-Ruiz, Dittrich,
ACM SOCC, 2011]

   Add header information to each data split, summarizing split attribute values

   Modify the RecordReader of HDFS, used by the Map() will prune irrelevant splits

Accelerating data access in Hadoop

       Idea 2: HAIL
[Dittrich, Quiané-Ruiz, Richter, Schuh, Jindal, Schad, PVLDB 2012]

   Each storage node builds an
in-memory, clustered index of the data in its split

   There are three copies of each split for reliability à
Build
three different indexes! J

   Customize RecordReader

Hadoop, Hadoop++ and HAIL

Conclusion

Toward Map/Reduce optimization

Optimization defined as Hadoop tuning

Hadoop parameters:

[Herodotou, technical report, Duke Univ, 2011]

Hadoop tuning

[Babu, ACM SoCC 2010]

Hadoop performance model

Hadoop performance model

References

         Shivnath Babu. "Towards automatic optimization of MapReduce programs", ACM SoCC, 2010

         Herodotos Herodotou. "Hadoop Performance Models". Duke University, 2011

         Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, Prashant Shenoy. "A Platform for Scalable One-Pass Analytics using MapReduce", ACM SIGMOD 2011

         Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jorg Schad. "Only Aggressive Elephants are Fast Elephants", VLDB 2012

         A.Jindal,J.-A.Quiané-Ruiz,andJ.Dittrich. "Trojan Data Layouts: Right Shoes for a Running Elephant" SOCC, 2011

         Harold Lim, Herodotos Herodotou, Shivnath Babu. "Stubby: A Transformation-based Optimizer for MapReduce Workflows" PVLDB 2012






Posted by MSNU