Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists.
This Learning Path begins with an introduction to Apache Spark. We first cover the basics of Spark, introduce SparkR, then look at the charting and plotting features of Python in conjunction with Spark data processing, and finally Spark's data processing libraries. We then develop a real-world Spark application. Next, we enable you to become comfortable and confident working with Spark for data science by exploring Spark's data science libraries on a dataset of tweets.
Begin your journey into fast, large-scale, and distributed data processing using Spark with this Learning Path.
About the Authors
Rajanarayanan Thottuvaikkatumana, Raj, is a seasoned technologist with more than 23 years of software development experience at various multinational companies. He has lived and worked in India, Singapore, and the USA, and is presently based out of the UK. His experience includes architecting, designing, and developing software applications. He has worked on various technologies including major databases, application development platforms, web technologies, and big data technologies. Since 2000, he has been working mainly in Java related technologies, and does heavy-duty server-side programming in Java and Scala. He has worked on very highly concurrent, highly distributed, and high transaction volume systems. Currently he is building a next generation Hadoop YARN-based data processing platform and an application suite built with Spark using Scala.
Raj holds one master's degree in Mathematics, one master's degree in Computer Information Systems and has many certifications in ITIL and cloud computing to his credit. Raj is the author of Cassandra Design Patterns - Second Edition, published by Packt.
When not working on the assignments his day job demands, Raj is an avid listener to classical music and watches a lot of tennis.
Eric Charles has 10 years’ experience in the field of Data Science and is the founder of Datalayer (http://datalayer.io/docker), a social network for Data Scientists. He is passionate about using software and mathematics to help companies get insights from data.
His typical day includes building efficient processing with advanced machine learning algorithms, easy SQL, streaming and graph analytics. He also focuses a lot on visualization and result sharing.
He is passionate about open source and is an active Apache Member. He regularly gives talks to corporate clients and at open source events. He can be contacted on Twitter on @echarles.
- Requires basic knowledge of either Python or R
- Get to know the fundamentals of Spark 2.0 and the Spark programming model using Scala and Python
- Know how to use Spark SQL and DataFrames using Scala and Python
- Get an introduction to Spark programming using R
- Perform Spark data processing, charting, and plotting using Python
- Get acquainted with Spark stream processing using Scala and Python
- Be introduced to machine learning with Spark using Scala and Python
- Get started with graph processing with Spark using Scala
- Develop a complete Spark application
- Understand the Spark programming language and its ecosystem of packages in Data Science
- Obtain and clean data before processing it
- Understand the Spark machine learning algorithm to build a simple pipeline
- Work with interactive visualization packages in Spark
- Apply data mining techniques on the available data sets
- Build a recommendation engine