In the world of Data Science and AI, tools such as jupyter notebooks, pandas, scickit-learn and numpy make the fundamental toolkit of any data scientist or data engineer. Each one of these tools is the de facto standard of it’s own category.
In the case of data processing and manipulation, Pandas is by far the most used library. It offers a banace of a simple-to-use API, a rich feature set and a powerful and optimized engine. The library can also be installed in one line of code and is widely supported, with a massive community behind it. This makes pandas the go-to library for new comers as well as experienced data scientists.
With all those positives aside, pandas has one side on which it falls short since the day it came out, and that is distributed processing. Pandas by design is made to process small to medium datasets on a single machine, and mainly in a single threaded fasion. That means that it’s not just slow to process large datasets, but it’s incapable of doing so. That is because pandas relies on the systems memory to store data, and when that is full, the whole library crashes. And knowing that the majority of systems have 16Gb to 64Gb of memory, you can only do so much with the library before it starts to fall short, especially with projects where we need to process hundreds of gigabytes if not terabytes of data.
On the other hand, technologies like Apache Spark offer an increadibly efficient engine to distribute calculations across multiple nodes in clusters that can go up to hundreds of machines. However, setting up the environment and using Spark might present a steep learning curve, especially for beginners.
Koalas is a fairly new library (pyspark sub-module), that was introduced by Databricks to bridge the gap between Pandas and Spark. It mimics the exact Pandas API, but runs on the Spark distributed engine in the background! This makes it easy for data scientists to get started with distributed processing, and more importently, makes it easier for companies with large projects to migrate to Spark without changing their code (except for the one-line import statement).
In this article, we will see how easy it is to install and use koalas, and compare its performance with normal Spark code, as well as Pandas.
Since Koalas run on Spark, a Spark instance needs to be running on the machine. We will not focus on how to set that up in this article, but it’s a requirement. To install pyspark and koalas using pip, run the following commands:
# Installing koalas
pip install pyspark
# Installing koalas
pip install koalas
Then, importing koalas can be done as follows.
# Importing Koalas
import databricks.koalas as ks
After that, koalas can be used normally like Pandas.
Koalas aim to mimic the pandas API (one to one). Thus, the majority if not all of the library functions are exactly the same as Pandas. To create a Dataframe with Pandas and Koalas, we do the following.
Koalas shine when the datasets size are important. The library fully supports all the major data formats of Spark, such as CSV, Parquet and ORC. To read files using koalas, we do the following.
# Reading CV
koalas_csv_df = ks.read_csv("path_to_file.csv")
# Reading Parquet
koalas_parquet_df = ks.read_parquet("path_to_file.parquet")
# Reading ORC
koalas_orc_df = ks.read_spark_io("path_to_file.orc", format="orc")
Please note that the function read_spark_io is not only for reading ORC format files, but any Spark IO supported file format.
A big advantage that Koalas offer compared to Pandas is bulk imports, which is the ability to import multiple datasets using a pattern instead of a single filename. This was inhearited from Spark and is very important since datasets are generally split into small “partitions” instead of a single block. You can import all CSV files in a given partition by running the following.
After running analytics on datasets, and to write the result dataframe to the disk, we use the same functions availabe in pandas, like the following.
#Writing CSV
koalas_csv_df.to_csv("file_name.csv")
#Writing Parquet
koalas_parquet_df.to_parquet("file_name.parquet")
#Writing ORC
koalas_orc_df.to_spark_io('file_name.orc', format="orc")
It’s great to be able to use the pandas API with koalas, however, sometimes the only available input is a pandas dataframe, for example when using libraries that provide results as pandas dataframes (meteostat for example). In cases like that, koalas offer a way to directly create a dataframe from a pandas Dataframe. To do so, we can do the following. Plotting with Koalas
# Reading a Pandas DF
koalas_dataframe = ks.from_pandas(pandas_dataframe)
In addition to the already impressive featureset, Koalas offer plotting like pandas encapsulated in the plot() method.
# Plotting with Pandas
pandas_dataframe.plot()
# Plotting with Koalas
koalas_dataframe.plot()
This way, you can use the same plot parameters as with the pandas one.
Comparing Pandas to Koalas is not really an apples to apples comparison. This is due to the design differences between the two and how they’re meant to be used. For example, when an instruction to read a file, then count the number of rows is written in both libraries, pandas will straight away execute each instruction once it’s executed, whereas for Spark, it only constructs the execution plan, and only runs it when a result is required. Differences such as this one makes it difficult to compare the speed of data parsing for example, that’s why, to analyze the advantages and shortcomings of each library, we are going to limit our focus to the final performance of a complete workload, and put the light especially on parallelization.
We’re going to start the tests on a small dataset of 2MB and 400 rows:
We’re going to read the dataset and, count only cars with a MPG superior to 18, then group the results by manufacturer. In pandas, the code looks like the following.
#Read the dataset
pandas_df = pd.read_csv('data/cars.csv', delimiter=';')
#Skipping the first row (metadata)
pandas_df = pandas_df.iloc[1:pandas_df.shape[0]]
#Converting the MPG column to numeric
pandas_df["MPG"] = pd.to_numeric(pandas_df["MPG"])
#Filtering only cars with an MPG higher than 18
pepe = pandas_df[pandas_df["MPG"] >= 18].groupby('Car').count()
#Count and group by
pepe.sort_values(by='MPG', ascending=False).head()
For Koalas, and as you can see, the code looks exactly the same (except for the column casting to numeric for pandas). This is quite impressive, since the code is exactly the same but behind the scenes, the execution is completely different!
#Read the dataset
koalas_df = kl.read_csv('data/cars.csv', sep=';')
#Skipping the first row (metadata)
koalas_df = koalas_df.iloc[1:koalas_df.shape[0]]
#Filtering only cars with an MPG higher than 18
pepe = koalas_df[koalas_df["MPG"] >= 18].groupby('Car').count()
#Count and group by
pepe.sort_values(by='MPG', ascending=False).head()
From the get go, we noticed an important difference of how the two libraries perform during the execution. As cited above, Pandas is Single-threaded and completely relies on memory, this was confirmed by our monitoring as we show below. As we can see, only one core is saturated while the rest sit idle. This is really inefficient and not ideal especially if you have a powerful machine to take advantage of.
For Koalas, and even on a single node, the library utilizes all the available cores of the machine, and relies on memory, as well as on disk when memory isn’t available. The figure below shows all the cores in action during computations.
In terms of results, Pandas finished the processing in a 100 milliseconds, whereas Koalas finished in 800 milliseconds. This is quite surprising, especially after introducing Koalas as the efficient distributed twin of Pandas.
The reason behind this is that Koalas adds an important overhead in order to setup the parallel processing. We should not forget that it runs on top of Spark and that it needs to setup the node’s communication, partitioning and so on. This adds overhead, which is negligible when we are working on big datasets, however, for small ones, it’s noticeable as we see bellow.
For the next test, we will bump up the challenge and use a dataset with a size of 470MB, and a row count of 6.5 million rows. The dataset is targeted to training ML models for fraud detection, and looks like the bellow.
The analysis we will run is pretty simple, which is just counting the number of transactions for each transaction type and nameDest. The meaning of the analysis itself isn’t important, it was chosen mainly to use String columns instead of numeric ones to make the computations harder.
# Pandas Fraud Analysis
pandas_df = pd.read_csv('data/Fraud.csv')
pandas_df[['type','nameDest']].groupby(['type','nameDest']).count()
For Koalas the code is exactly the same.
# Koalas Fraud Analysis
koalas_df = kl.read_csv('data/Fraud.csv')
koalas_df[['type','nameDest']].groupby(['type','nameDest']).count()
And for the results, Pandas finished the processing in 24 seconds, while Koalas finished in 6 seconds! As we can see, Koalas is way faster than Pandas and the added overhead is completely negligible even with a dataset of “only” 400MB.
For a final test, we will use a much bigger dataset of 8GB and a massive 48 Million rows. The dataset is published by NYC, and contains parking tickets of the years between 2013 to 2017 (New Yorkers get a lot of tickets!).
The computations we will run consist of calculating the number of tickets distributed by registration state and by violation code, sorted from the state with most tickets to the one with the least.
# Loading the dataset
koalas_df = kl.read_csv('data/*.csv')
# Caching the dataset
koalas_df.spark.cache()
# The Computations
koalas_df = koalas_df.assign(count=lambda x: 0)[["Registration State", "Violation Code", "count"]]
koalas_df = koalas_df.groupby(["Registration State", "Violation Code"]).count().sort_values(by=["count"], ascending=False)
For this test, and to put it as it is, Pandas is just incapable of loading the dataset let alone processing it. The library keeps everything in memory and once that’s full, it crashes the whole kernel, and the machine we are using is an AWS EC2 instance with 8GB of ram and 4 vCores.
On the other hand, Koalas loads and processes the dataset without any issue. Still the memory is not enough to keep everything ready to process, but in that case Spark spills the least used partitions to disk in a very efficient manner, that is transparent to the user. The result is that Koalas processed the following computations in 95 seconds, which is really fast! And what’s more interesting, and that’s an advantage of koalas is memory caching. Spark keeps the most used datasets in memory, so if multiple instructions are ran on the same dataset, and this dataset is still in cache, we get a better performance.
Before moving forward, let’s quickly look at the results of the analysis because they’re quite interesting. The cars that got most of tickets were cars from NY and most of these infractions is code 21.
After looking the code up, it looks like the top code (21) is a “NO PARKING-STREET CLEANING”.
Going back to our performance comparison, during the monitoring of the execution of the job, we noticed that we only used half of the available memory, which is due to the default configuration of the Java Heap that is 4GB. We increased it to 7.5GB and the result is a reduction of almost 50% of the execution time, and after consecutive executions, the results improved even further due to caching.
In this article, we introduced Koalas, how it works and why it’s important in the Data Science field. We tested it against Pandas, with a small, medium and a large dataset and analysed the results.
For small projects, where datasets size isn’t expected to grow, there is no need to migrate or use Koalas, Pandas is sufficient, especially that koalas doesn’t implement 100% of Pandas, but around 80% of the API (which is the most used part). However and for new large projects (future proofing), or exisiting ones using Pandas, Koalas can be an good alternative to Spark for teams that already master Pandas.
It is clear based on our testing that Koalas is here to stay. Koalas offer the power of Spark, without the complexity or the learning curve required to use it. This gives an impressive performance boost, even on a single node, to the point that it’s not even comparable to Pandas.
And to close the article, we ran the same last analysis on a much bigger machine with 16 cores, and here are the results using Koalas! Isn’t it beautiful?
Made with a lot of
and