Reducing Apache Spark Application Dependencies Upload by 99%

1 · LinkedIn · March 9, 2023, 5:34 p.m.
Co-authors: Shu Wang, Biao He, and Minchu Yang At LinkedIn, Apache Spark is our primary compute engine for offline data analytics such as data warehousing, data science, machine learning, A/B testing, and metrics reporting. We execute nearly 100,000 Spark applications daily in our Apache Hadoop YARN (more on how we scaled YARN clusters here). These applications rely heavily on dependencies (JAR files) for their computation needs. This is especially true for machine learning pipelines, which ofte...