Plan
• The Map/Reduce model
• Two performance problems and ways out
– Blocking steps in the Map/Reduce processing pipeline
– Non-selective data access
• Conclusion: toward Map/Reduce optimization?
Map/Reduce
and Hadoop performance
Plan
• The Map/Reduce model
• Two performance problems and ways out
– Blocking steps in the Map/Reduce processing pipeline
– Non-selective data access
• Conclusion: toward Map/Reduce optimization?
The Map/Reduce model
• Problem:
– How to compute in a distributed fashion a given processing to a very large amount of data
• Map/Reduce solution:
– Programmer: express the processing as
• Splitting the data
• Extract (key, value pairs) from each partition (MAP)
• Compute partial results for each key (REDUCE)
– Map/Reduce platform (e.g., Hadoop):
• Distributes partitions, runs one MAP task per partition
• Runs one or several REDUCE tasks per key
• Sends data across machines from MAP to REDUCE
Map/Reduce in detail
Hadoop Map/Reduce
Performance
problem 1:
Idle CPU due to blocking steps
Hadoop resource usage
Hadoop benchmark
[Li, Mazur, Diao, McGregor, Shenoy, ACM SIGMOD Conference 2011]
CPU stalls during I/O intensive Merge
Reduce strictly follows Merge
Hash-based algorithms to improve Hadoop performance
[Li, Mazur, Diao, McGregor, Shenoy, SIGMOD 2011]
Main idea: use non-blocking hash-based algorithms to group items by keys during Map.LocalSort and Reduce.Merge
Principle of hashing:
• Partitions can be in memory or flushed to disk
• If the reduce works incrementally, early send
Performance
problem 2:
non-selective data access
Data access in Hadoop
• Basic model: read all the data
– If the tasks are selective, we don't really need to!
• Database indexes? But:
– Map/Reduce works on top of a file system (e.g. Hadoop file system, HDFS)
– Data is stored only once
– Hard to foresee all future processing
• "Exploratory nature" of Hadoop
Accelerating data access in Hadoop
• Idea 1:
Hadop++
[Jindal, Quiané-Ruiz, Dittrich,
ACM SOCC, 2011]
– Add header information to each data split, summarizing split attribute values
– Modify the RecordReader of HDFS, used by the Map() will prune irrelevant splits
Accelerating data access in Hadoop
• Idea 2: HAIL
[Dittrich, Quiané-Ruiz, Richter,
Schuh, Jindal, Schad,
PVLDB 2012]
– Each storage
node builds an
in-memory,
clustered index of the data in its split
– There are
three copies of each split for reliability à
Build three
different indexes! J
– Customize RecordReader
Hadoop, Hadoop++ and HAIL
Conclusion
Toward Map/Reduce optimization
Optimization defined as Hadoop tuning
Hadoop parameters:
[Herodotou, technical report, Duke Univ, 2011]
Hadoop tuning
[Babu, ACM SoCC 2010]
Hadoop performance model
Hadoop performance model
References
• Shivnath Babu. "Towards automatic optimization of MapReduce programs", ACM SoCC, 2010
• Herodotos Herodotou. "Hadoop Performance Models". Duke University, 2011
• Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, Prashant Shenoy. "A Platform for Scalable One-Pass Analytics using MapReduce", ACM SIGMOD 2011
• Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jorg Schad. "Only Aggressive Elephants are Fast Elephants", VLDB 2012
• A.Jindal,J.-A.Quiané-Ruiz,andJ.Dittrich. "Trojan Data Layouts: Right Shoes for a Running Elephant" SOCC, 2011
• Harold Lim, Herodotos Herodotou, Shivnath Babu. "Stubby: A Transformation-based Optimizer for MapReduce Workflows" PVLDB 2012
'myPPT' 카테고리의 다른 글
사회 갈등과 분쟁을 해결하는 수단으로서의 법의 역할 (0) | 2013.09.13 |
---|---|
Values & Vision of the siemens ::Highest performance with the highest ethics & our aspirations for the future (0) | 2013.09.09 |
직접높임과 간접높임의 차이 (3) | 2013.07.25 |
국어 문법:::안은문장과 안긴문장 (5) | 2013.07.24 |
学中文Learning Chinese중국어 입문 (0) | 2013.07.22 |