Understanding common Performance Issues in Apache Spark
Apache Spark is a unified analytics engine for large-scale distributed data processing. It can be used for both batch and stream processing. Typical use cases are often related to machine or deep learning and to ETL (extract-transform-load).
Developers build application in Apache Spark because of its speed it achieves by exploiting in-memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting. However, in some cases the applications unexpectedly take a lot of time or even fail during runtime.
Problem Categories
In this article series we will look into some of the most common problem categories that can cause such behavior. Keep in mind that this is not an exhaustive list of areas causing a decrease in performance. I might add more categories over time.
It is important to note that each of those categories is usually not showing up exclusively. For example, if your data is skewed, processing it can cause a memory spill. Likewise, when processing spilled data it might be shuffled as it does not fit into a single Executor.
Analysis of the Problem Categories
You may have realized that the title of this article series does not contain the word “guide”. This is intended as I rather prefer to get a deeper understanding across that will enable you to guide through your specific issue. I am convinced that there is no one-size-fits-all guide that is applicable step-by-step to the broad variety of applications, data structures, cluster hardware and network set-ups.
To gain a better understanding, Databricks offers some helpful training material for our purpose — the “Spark UI Simulator”. You will find many experiments along with the code and the static content of the Spark Web UI showing how Spark applications behave in certain circumstances. Among others, the experiments #1596 (Skew Join) and #6518 (Spill) provide good insights and code snippets that I will use within this article series to answer the following questions for each of the problem categories:
- How can it happen?
- How can it be analyzed?
- How can it be mitigated?
Answering those questions for each category will provide a good understanding and, as a consequence, it gets you started to find a solution in case your Spark application is suffering from an unexpectedly low performance.
Analysis Approach
For the purpose of answering the above mentioned questions within each Problem Category, in the deep-dive articles we present how to create some dummy data or explain which publicly available data is used. As Spark required an “action” in order to get the job running but we want to keep it short, typically, the following three actions are called on a Dataframe to perform benchmarks:
In same cases I might run the code in a Databricks notebook with Runtime 8.1 (you can get a free Trial here). Otherwise, I am working on my private computer with
- Spark 3.1.1,
- Zeppelin 0.9.0,
- Ubuntu 20.04.2 LTS,
- Intel® Core™ i5–9600K CPU @ 3.70GHz × 6, and
- 16GB RAM.
Throughout the article series we mainly stick to Scala and dealing with text based files (csv, JSON) and Parquet, which is the native data format in Spark.
What’s next?
The first deep-dive will tackle the problem category Spill. Second, we look into Skew.
The reason why I am writing this article series is best described by a quote from Flannery O’Connor who is an American novelist, short story writer and essayist (wallpaper from Quotefancy.com):