Spark Intro
Apache Spark is the terrible kid within the Big Data solutions/architectures – a distributed computing framework, the unified analytics engine for large-scale data processing, … – there’s not much that you cannot do with Spark (besides storing data, of course, but that’s not its aim). Supporting Scala, Python, Java, R and SQL as programming languages and both batch and streaming data processing, Apache Spark managed to become in the 7 years since becoming an Apache project one of the most popular processing frameworks.
In this workshop we aim to introduce Spark throughout several hands-on exercises (beware we will work in both Scala and SQL):
- How to read data in Spark and write data from Spark
- From batch storage: CSV, JSON, Parquet, Avro formats
- From streaming engines: Kafka streams (using Spark structured streaming)
- Possible operations on data (working with dataframes)
- Transformations vs actions
- Caching of data
- Partitioning
- Analyze data in Spark using Spark SQL
- Register Spark objects as tables and let SQL do its magic
Prerequisites for this workshop: a good understanding of distributed systems and systems like HDFS or NoSQL solutions. Although we will work in Scala and SQL, its not mandatory to have previous experience – we will focus on what can be done with Spark, rather than the programming side (functional programming).
Requirements: have a computer that can connect to public cloud (no VPN), Google Chrome and an SSH client present on your computer. There will be no local installations, we will work in cloud.