Optimizing Large Scale Batch ETL Pipelines in Adobe Experience Platform
Processing data at scale is hard; doing it while keeping costs in check is even harder.
In this session, we will look at a series of techniques and practices of working with a high-throughput batch ETL pipeline using the Spark distributed computing framework.
We will address topics such as:
– dealing with input data structures
– shaping the data for improved performance
– tradeoffs in data access patterns
– stability of large computing clusters
– reducing computing footprint in order to achieve significant operating cost reductions.