Spark Intro

Apache Spark is the terrible kid within the Big Data solutions/architectures –  a distributed computing framework, the unified analytics engine for large-scale data processing, … – there’s not much that you cannot do with Spark (besides storing data, of course, but that’s not its aim). Supporting Scala, Python, Java, R and SQL as programming languages and both batch and streaming data processing, Apache Spark managed to become in the 7 years since becoming an Apache project one of the most popular processing frameworks. 

In this workshop we aim to introduce Spark throughout several hands-on exercises (beware we will work in both Scala and SQL):

  • How to read data in Spark and write data from Spark 
    • From batch storage: CSV, JSON, Parquet, Avro formats 
    • From streaming engines: Kafka streams  (using Spark structured streaming)
  • Possible operations on data (working with dataframes)
    • Transformations vs actions
    • Caching of data
    • Partitioning
  • Analyze data in Spark using Spark SQL 
    • Register Spark objects as tables and let SQL do its magic

Prerequisites for this workshop: a good understanding of distributed systems and systems like HDFS or NoSQL solutions. Although we will work in Scala and SQL, its not mandatory to have previous experience – we will focus on what can be done with Spark, rather than the programming side (functional programming).

Requirements: have a computer that can connect to public cloud (no VPN), Google Chrome and an SSH client present on your computer. There will be no local installations, we will work in cloud. 

Book Now