View

MEETUP

2018년 6월 18일 월요일

All things Spark - Machine Learning, Atlas integration, ORC & Hive EDW updates


Apache Spark has become one of the most popular in-memory compute engines due to its elegant and expressive development APIs combined with enterprise readiness. At the meetup we will focus on machine and deep learning use cases and performance; Apache Atlas integration to enable governance and metadata; performance improvements and Parquet parity with Apache ORC (high performance columnar storage); and finally we will cover Apache Hive EDW connector enabling data warehouse initiatives for advanced business analytics.


Talks

SparkML – Pyspark performance, image integration, and Deep Learning use cases – Yanbo Liang and Mingjie Tang (20 min)
Spark Atlas integration – Yanbo Liang and Mingjie Tang (20 min)
Spark + ORC – Dongjoon Hyun (20 min)
Spark + HiveEDW connector – Eric Wohlstadter (20 min)

Bios

Robert Hryniewicz (host)
Robert is a Data Evangelist with over 11 years of experience working on a variety of technologies from AI and robotics to IoT and blockchain. He’s part of the Hortonworks community team, driving data science sandbox product strategy, thought leadership on AI, delivering crash courses and lectures on Spark, data science + deep learning, and making sure that the community has all the resources needed to build kickass next-gen products. Robert will be your host for the evening.

Arun Iyer
Arun Iyer has been involved with the design and development of various Streaming Analytics platforms at Hortonworks. He has been contributing to Apache Storm project and currently a committer and a PMC member of the project. Prior to Hortonworks he was involved in the development of various streaming and distributed systems at Informatica and at Yahoo.

Jerry Shao
Jerry Shao works as a member of technical staff at Hortonworks, mainly focused on Spark area, especially Spark core, Spark on Yarn and Spark Streaming. He is an Apache Spark committer and Apache Livy (incubating) PPMC. Prior to Hortonworks, he was a software engineer at Intel working on performance tuning and optimization of Hadoop and Spark.

Yanbo Liang
Yanbo is a staff software engineer at Hortonworks. His main interests center around implementing effective machine learning and deep learning algorithms or models. He is an Apache Spark PMC member and contributes to lots of open source projects such as TensorFlow, Apache MXNet and XGBoost. He delivered the implementation of some core Spark MLlib algorithms. Prior to Hortonworks, he was a software engineer at Yahoo! and France Telecom working on machine learning and distributed system.

Mingjie Tang
Mingjie Tang is an engineer at Hortonworks. He is working on SparkSQL, Spark MLlib and Spark Streaming. He has broad research interest in database management system, similarity query processing, data indexing, big data computation, data mining and machine learning. Mingjie completed his PhD in Computer Science from Purdue University.

Dongjoon Hyun
Dongjoon Hyun is an Apache REEF PMC member and committer. Currently, he works for Hortonworks and is focusing on Apache Spark and Apache ORC.

Eric Wohlstadter
Eric is a principal engineer at Hortonworks. He is working on Hive, Tez, and Spark-Hive interoperability. His interests are in database systems and distributed query execution. Eric completed his PhD in Computer Science from the University of California at Davis.


Apache Spark는 가장 인기있는 인메모리 연산 엔진이 되었는데 그 이유는 스파크의 우아함(?)과 풍부한 개발 API들이 엔터프라인의 빠른 준비와 결합되면서 이다. 이번 밋업에서는 머신러닝과 딥러닝에 대해서 포커스를 맞추고 그것의 유즈케이스와 성능에 대해서 이야기할 것이다. Apache Atlas의 통합은 커버넌스와 메타데이타를 위한 것이다. 성능 향상과 Apache ORC(high performance columnar storage/ Optimized Row Columnar)Parquet parity에 대해서 이야기하고 마지막으로 고급 비즈니스 분석을 위한 data warehouse의 시작을 할 수 있게 해주는Apache hive EDW connector 에 대해서 다룰 것이다. 



Share Link
reply
«   2024/05   »
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31