Vorsprung durch Technik

블로그 이미지

MSNU

Hadoop

myPPT 2013. 8. 21. 11:47

Map/Reduce and Hadoop performance

Plan

• The Map/Reduce model

• Two performance problems and ways out

– Blocking steps in the Map/Reduce processing pipeline

– Non-selective data access

• Conclusion: toward Map/Reduce optimization?


























Map/Reduce and Hadoop performance

Plan

•       The Map/Reduce model

•       Two performance problems and ways out

–   Blocking steps in the Map/Reduce processing pipeline

–   Non-selective data access

•       Conclusion: toward Map/Reduce optimization?

 

The Map/Reduce model

•         Problem:

–     How to compute in a distributed fashion a given processing to a very large amount of data

•         Map/Reduce solution:

–    Programmer: express the processing as

•     Splitting the data

•     Extract (key, value pairs) from each partition (MAP)

•     Compute partial results for each key (REDUCE)

–    Map/Reduce platform (e.g., Hadoop):

•     Distributes partitions, runs one MAP task per partition

•     Runs one or several REDUCE tasks per key

•     Sends data across machines from MAP to REDUCE

Map/Reduce in detail

Hadoop Map/Reduce

Performance problem 1:
Idle CPU due to blocking steps

Hadoop resource usage

Hadoop benchmark

[Li, Mazur, Diao, McGregor, Shenoy, ACM SIGMOD Conference 2011]

 

 

 

 

 

 

 

 

CPU stalls during I/O intensive Merge

Reduce strictly follows Merge

Hash-based algorithms to improve Hadoop performance

[Li, Mazur, Diao, McGregor, Shenoy, SIGMOD 2011]

Main idea: use non-blocking hash-based algorithms to group items by keys during Map.LocalSort and Reduce.Merge

Principle of hashing:

 

 

 

•         Partitions can be in memory or flushed to disk

•         If the reduce works incrementally, early send

Performance problem 2:
non-selective data access

Data access in Hadoop

•       Basic model: read all the data

–   If the tasks are selective, we don't really need to!

•       Database indexes? But:

–   Map/Reduce works on top of a file system (e.g. Hadoop file system, HDFS)

–   Data is stored only once

–   Hard to foresee all future processing

•     "Exploratory nature" of Hadoop

Accelerating data access in Hadoop

•       Idea 1: Hadop++
[Jindal, Quiané-Ruiz, Dittrich,
ACM SOCC, 2011]

–   Add header information to each data split, summarizing split attribute values

–   Modify the RecordReader of HDFS, used by the Map() will prune irrelevant splits

Accelerating data access in Hadoop

•       Idea 2: HAIL
[Dittrich, Quiané-Ruiz, Richter, Schuh, Jindal, Schad, PVLDB 2012]

–   Each storage node builds an
in-memory, clustered index of the data in its split

–   There are three copies of each split for reliability à
Build
three different indexes! J

–   Customize RecordReader

Hadoop, Hadoop++ and HAIL

Conclusion

Toward Map/Reduce optimization

Optimization defined as Hadoop tuning

Hadoop parameters:

[Herodotou, technical report, Duke Univ, 2011]

Hadoop tuning

[Babu, ACM SoCC 2010]

Hadoop performance model

Hadoop performance model

References

•         Shivnath Babu. "Towards automatic optimization of MapReduce programs", ACM SoCC, 2010

•         Herodotos Herodotou. "Hadoop Performance Models". Duke University, 2011

•         Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, Prashant Shenoy. "A Platform for Scalable One-Pass Analytics using MapReduce", ACM SIGMOD 2011

•         Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jorg Schad. "Only Aggressive Elephants are Fast Elephants", VLDB 2012

•         A.Jindal,J.-A.Quiané-Ruiz,andJ.Dittrich. "Trojan Data Layouts: Right Shoes for a Running Elephant" SOCC, 2011

•         Harold Lim, Herodotos Herodotou, Shivnath Babu. "Stubby: A Transformation-based Optimizer for MapReduce Workflows" PVLDB 2012






'myPPT' 카테고리의 다른 글

사회 갈등과 분쟁을 해결하는 수단으로서의 법의 역할  (0) 2013.09.13
Values & Vision of the siemens ::Highest performance with the highest ethics & our aspirations for the future  (0) 2013.09.09
직접높임과 간접높임의 차이  (3) 2013.07.25
국어 문법:::안은문장과 안긴문장  (5) 2013.07.24
学中文Learning Chinese중국어 입문  (0) 2013.07.22
Posted by MSNU






favicon

Vorsprung durch Technik

  • 태그
  • 링크 추가
  • 방명록

관리자 메뉴

  • 관리자 모드
  • 글쓰기
  • 분류 전체보기 (993)
    • myPPT (813)
    • 시리즈 (164)
      • 연소 (14)
      • 경제 (5)

카테고리

PC화면 보기 티스토리 Daum

티스토리툴바