Spark, spark streaming & tachyon

  • Published on
    20-Aug-2015

  • View
    855

  • Download
    2

Transcript

  1. 1. Spark, Spark Streaming& TachyonSolving big data problem without programming for big data
  2. 2. Who am I? What do we do? Name: Johan Hong johan.hong@pearson.com Software Architect work for Pearson Higher Education Deliver personalized and connected learning at scale Build assessment platform with micro-services to serveinternal and public services and applications
  3. 3. DefinitionsApache Spark is a fast and general engine for large-scaledata processing.Apache Spark is a cluster computing platform designed tobe fast and general-purpose.Spark Streaming makes it easy to build scalable fault-tolerantstreaming applications.Tachyon is a memory-centric distributed file systemenabling reliable file sharing at memory-speed acrosscluster frameworks
  4. 4. Spark Stack
  5. 5. Stack with Tachyon
  6. 6. Distributed Execution
  7. 7. HDFS Architecture
  8. 8. HDFS Block Replication
  9. 9. Limitations of Map Reduce
  10. 10. Spark Runtime
  11. 11. RDD is an InterfaceAdvanced Spark Internals and Tuning
  12. 12. Sample Application
  13. 13. Narrow & Wide Dependencies
  14. 14. Tachyon System ArchitectureTachyon: Memory Throughput I/O for Cluster Computing Frameworks
  15. 15. Spark Execution Plan
  16. 16. Spark Executor on Worker Node
  17. 17. Fault-Tolerant in Spark StreamingCould data be lost if the receiving node crashes before it replicatesincoming data to other data node(s)?It happens. Ooyala loses 1% of their data but it is considered asacceptable.What can we do to prevent data loss?We could persist events before they reach Spark Streaming Receiver,replay the events/messages after receiver crashes and recovers.
  18. 18. Spark Internal Operations
  19. 19. Data Locality