Apache Spark VS Pandas VS Koalas

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool,
built on top of the Python programming language.

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark. Pandas is the de facto standard (single-node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing.

My Thoughts

As a data specialist who worked extensively with Pandas and PySpark (PySpark is an interface for Apache Spark in Python), I will share my experience.

First, let's talk about data structure (DS) and syntax. In this area, I prefer Pandas over PySpark as Pandas follow Python DS more closely than PySpark.

Second, in terms of ease of installation and setup, I will go with Pandas as PySpark has a dependency on Java, whose installation is an extra step.

Third, for API features, I prefer Pandas. This point is subjective as I don't use every API from Pandas or PySpark.

Till now, it seems Pandas is my choice. However, Pandas and PySpark are equally important to me. PySpark wins in terms of performance and resource consumption.

For data, what is most important is memory (RAM). When you are working with a dataset bigger than a single machine's RAM, that requires a cluster of machines, Pandas' performance is out of the question as it only works for a single machine.

Experiment

Here is my experiment conducted on PySpark, Pandas, and Koalas. Using Google Cloud Platform (GCP), I set up an instance with machine type n1-standard-1 (1 vCPU, 3.75 GB memory). The dataset I am loading is 1.7GB. Even though the RAM is bigger than the dataset, you have to factor in the RAM usage by the OS. What I am trying to do here is to see how far I can squeeze the machine using these libraries.

Pandas will cause the machine to hang due to insufficient RAM.

Koalas took 48.4s, and PySpark took 36.5s for wall time.

The outcome is as expected. Koalas can load the data because it uses PySpark API (lazy evaluation), but it took longer than PySpark as it's a wrapper library.

Even though Koalas sound promising as you can have the best of both worlds, I think they might have the worst of both worlds. It doesn't have all the API from either Pandas or PySpark. You never know when the gap will break your work.

I think you can use PySpark for every use case but not for Pandas. So for long-term investment, I think I will pick PySpark.

Shawn Ng