Hadoop

myPPT 2013. 8. 21. 11:47

Map/Reduce and Hadoop performance

Plan

• The Map/Reduce model

• Two performance problems and ways out

– Blocking steps in the Map/Reduce processing pipeline

– Non-selective data access

• Conclusion: toward Map/Reduce optimization?

Map/Reduce and Hadoop performance

Plan

• The Map/Reduce model

• Two performance problems and ways out

– Blocking steps in the Map/Reduce processing pipeline

– Non-selective data access

• Conclusion: toward Map/Reduce optimization?

The Map/Reduce model

• Problem:

– How to compute in a distributed fashion a given processing to a very large amount of data

• Map/Reduce solution:

– Programmer: express the processing as

• Splitting the data

• Extract (key, value pairs) from each partition (MAP)

• Compute partial results for each key (REDUCE)

– Map/Reduce platform (e.g., Hadoop):

• Distributes partitions, runs one MAP task per partition

• Runs one or several REDUCE tasks per key

• Sends data across machines from MAP to REDUCE

Map/Reduce in detail

Hadoop Map/Reduce

Performance problem 1:
Idle CPU due to blocking steps

Hadoop resource usage

Hadoop benchmark

[Li, Mazur, Diao, McGregor, Shenoy, ACM SIGMOD Conference 2011]

CPU stalls during I/O intensive Merge

Reduce strictly follows Merge

Hash-based algorithms to improve Hadoop performance

[Li, Mazur, Diao, McGregor, Shenoy, SIGMOD 2011]

Main idea: use non-blocking hash-based algorithms to group items by keys during Map.LocalSort and Reduce.Merge

Principle of hashing:

• Partitions can be in memory or flushed to disk

• If the reduce works incrementally, early send

Performance problem 2:
non-selective data access

Data access in Hadoop

• Basic model: read all the data

– If the tasks are selective, we don't really need to!

• Database indexes? But:

– Map/Reduce works on top of a file system (e.g. Hadoop file system, HDFS)

– Data is stored only once

– Hard to foresee all future processing

• "Exploratory nature" of Hadoop

Accelerating data access in Hadoop

• Idea 1: Hadop++
[Jindal, Quiané-Ruiz, Dittrich,
ACM SOCC, 2011]

– Add header information to each data split, summarizing split attribute values

– Modify the RecordReader of HDFS, used by the Map() will prune irrelevant splits

Accelerating data access in Hadoop

• Idea 2: HAIL
[Dittrich, Quiané-Ruiz, Richter, Schuh, Jindal, Schad, PVLDB 2012]

– Each storage node builds an
in-memory, clustered index of the data in its split

– There are three copies of each split for reliability à
Build three different indexes! J

– Customize RecordReader

Hadoop, Hadoop++ and HAIL

Conclusion

Toward Map/Reduce optimization

Optimization defined as Hadoop tuning

Hadoop parameters:

[Herodotou, technical report, Duke Univ, 2011]

Hadoop tuning

[Babu, ACM SoCC 2010]

Hadoop performance model

References

• Shivnath Babu. "Towards automatic optimization of MapReduce programs", ACM SoCC, 2010

• Herodotos Herodotou. "Hadoop Performance Models". Duke University, 2011

• Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, Prashant Shenoy. "A Platform for Scalable One-Pass Analytics using MapReduce", ACM SIGMOD 2011

• Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jorg Schad. "Only Aggressive Elephants are Fast Elephants", VLDB 2012

• A.Jindal,J.-A.Quiané-Ruiz,andJ.Dittrich. "Trojan Data Layouts: Right Shoes for a Running Elephant" SOCC, 2011

• Harold Lim, Herodotos Herodotou, Shivnath Babu. "Stubby: A Transformation-based Optimizer for MapReduce Workflows" PVLDB 2012

'myPPT' 카테고리의 다른 글

사회 갈등과 분쟁을 해결하는 수단으로서의 법의 역할 (0)	2013.09.13
Values & Vision of the siemens ::Highest performance with the highest ethics & our aspirations for the future (0)	2013.09.09
직접높임과 간접높임의 차이 (3)	2013.07.25
국어 문법:::안은문장과 안긴문장 (5)	2013.07.24
学中文Learning Chinese중국어 입문 (0)	2013.07.22

Posted by MSNU

Hadoop

Map/Reduce and Hadoop performance

Plan

• The Map/Reduce model

• Two performance problems and ways out

– Blocking steps in the Map/Reduce processing pipeline

– Non-selective data access

• Conclusion: toward Map/Reduce optimization?

The Map/Reduce model

• Problem:

– How to compute in a distributed fashion a given processing to a very large amount of data

• Map/Reduce solution:

– Programmer: express the processing as

• Splitting the data

• Extract (key, value pairs) from each partition (MAP)

• Compute partial results for each key (REDUCE)

– Map/Reduce platform (e.g., Hadoop):

• Distributes partitions, runs one MAP task per partition

• Runs one or several REDUCE tasks per key

• Sends data across machines from MAP to REDUCE

Map/Reduce in detail

Hadoop Map/Reduce

Performance problem 1: Idle CPU due to blocking steps

Hadoop resource usage

Hadoop benchmark

[Li, Mazur, Diao, McGregor, Shenoy, ACM SIGMOD Conference 2011]

CPU stalls during I/O intensive Merge

Reduce strictly follows Merge

Hash-based algorithms to improve Hadoop performance

[Li, Mazur, Diao, McGregor, Shenoy, SIGMOD 2011]

Main idea: use non-blocking hash-based algorithms to group items by keys during Map.LocalSort and Reduce.Merge

Principle of hashing:

• Partitions can be in memory or flushed to disk

• If the reduce works incrementally, early send

Performance problem 2: non-selective data access

Data access in Hadoop

• Basic model: read all the data

– If the tasks are selective, we don't really need to!

• Database indexes? But:

– Map/Reduce works on top of a file system (e.g. Hadoop file system, HDFS)

– Data is stored only once

– Hard to foresee all future processing

• "Exploratory nature" of Hadoop

Accelerating data access in Hadoop

• Idea 1: Hadop++ [Jindal, Quiané-Ruiz, Dittrich, ACM SOCC, 2011]

– Add header information to each data split, summarizing split attribute values

– Modify the RecordReader of HDFS, used by the Map() will prune irrelevant splits

Accelerating data access in Hadoop

• Idea 2: HAIL [Dittrich, Quiané-Ruiz, Richter, Schuh, Jindal, Schad, PVLDB 2012]

– Each storage node builds an in-memory, clustered index of the data in its split

– There are three copies of each split for reliability à Build three different indexes! J

– Customize RecordReader

Hadoop, Hadoop++ and HAIL

Conclusion Toward Map/Reduce optimization

Optimization defined as Hadoop tuning

Hadoop parameters:

[Herodotou, technical report, Duke Univ, 2011]

Hadoop tuning

[Babu, ACM SoCC 2010]

Hadoop performance model

Hadoop performance model

References

• Shivnath Babu. "Towards automatic optimization of MapReduce programs", ACM SoCC, 2010

• Herodotos Herodotou. "Hadoop Performance Models". Duke University, 2011

• Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, Prashant Shenoy. "A Platform for Scalable One-Pass Analytics using MapReduce", ACM SIGMOD 2011

• Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jorg Schad. "Only Aggressive Elephants are Fast Elephants", VLDB 2012

• A.Jindal,J.-A.Quiané-Ruiz,andJ.Dittrich. "Trojan Data Layouts: Right Shoes for a Running Elephant" SOCC, 2011

• Harold Lim, Herodotos Herodotou, Shivnath Babu. "Stubby: A Transformation-based Optimizer for MapReduce Workflows" PVLDB 2012

'myPPT' 카테고리의 다른 글

티스토리툴바

Map/Reduce and Hadoop performance

Performance problem 1:
Idle CPU due to blocking steps

Performance problem 2:
non-selective data access

• Idea 1: Hadop++
[Jindal, Quiané-Ruiz, Dittrich,
ACM SOCC, 2011]

• Idea 2: HAIL
[Dittrich, Quiané-Ruiz, Richter, Schuh, Jindal, Schad, PVLDB 2012]

– Each storage node builds an
in-memory, clustered index of the data in its split

– There are three copies of each split for reliability à
Build three different indexes! J

Conclusion

Toward Map/Reduce optimization