XGBoost with PySpark on AWS EMR

Compatible versions to train and deploy XGBoost using PySpark

Vijay Patil
4 min readSep 3, 2022

Following article gives a walkthrough of the steps to be taken to use XGBoost with PySpark on AWS EMR.

https://xgboost.readthedocs.io/en/stable/#

XGBoost does not provide a PySpark API in Spark, it only provides Scala and other APIs. Hence we will be using a custom python wrapper for XGBoost from this PR.

We will be using Spark 2.4.5 with XGBoost 0.9 as it is one the working version pairs.

To install Spark on your machine, refer to these articles for Windows and for Linux. If you are using any cloud services like AWS or Databricks, please use a cluster with Spark 2.4.5 installed. I have used EMR version 5.30.2 as that comes with Spark 2.4.5.

Note: To make XGBoost work on Windows machine, additional files and steps are required.
Below article gives a walkthrough of using XGBoost with PySpark on AWS EMR.

Getting required files

XGBoost Package files

In order to use Spark XGBoost, we need the package jar files. These are available on Maven repository here.

We need two jar files for XGBoost v0.9. Use following commands to download these.

wget https://repo1.maven.org/maven2/ml/dmlc/xgboost4j/0.90/xgboost4j-0.90.jar
wget https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark/0.90/xgboost4j-spark-0.90.jar

You can also use curl command in case wget doesn’t work.

curl -o xgboost4j-0.90.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j/0.90/xgboost4j-0.90.jar
curl -o xgboost4j-spark-0.90.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark/0.90/xgboost4j-spark-0.90.jar

We will give the locations of these jar files to Spark when we start a session. Please download them on the machine where you want to use them and note the path.

spark-xgboost wrapper

Download zip version of the wrapper codes from here- https://github.com/dmlc/xgboost/pull/4656#issuecomment-510693296

wget -O pyspark-xgboost_0.90.zip https://github.com/dmlc/xgboost/files/3384356/pyspark-xgboost_0.90_261ab52e07bec461c711d209b70428ab481db470.zip

The zipped module contains the wrapper codes. We will add this package to the Spark session using below command.

spark.sparkContext.addPyFile()

Below is my Home directory on the EMR cluster with the wrapper module and the jar files.

/home/hadoop/

We are now ready to use XGBoost with Spark.

Reference Model training code

Starting the Spark session with the Jar files

Depending on whether you are using the CLI or Jupyter Notebook for model training, use one of the following two commands to start your spark session.

CLI

pyspark --executor-cores=1 --executor-memory=7g --driver-memory=4g --name vp --jars /home/hadoop/xgboost4j-0.90.jar,/home/hadoop/xgboost4j-spark-0.90.jar

Jupyter

Start your Jupyter notebook and then use below Spark session builder to start the spark session:

spark = SparkSession\
.builder\
.appName("xgboost")\
.config("spark.executor.memory", "7g")\
.config("spark.driver.memory", "4g")\
.config("spark.jars", "/home/hadoop/xgboost4j-0.90.jar,/home/hadoop/xgboost4j-spark-0.90.jar")\
.getOrCreate()

You can check whether the Jars have been added to session in the SparkUI Environment tab.

Spark Environment

Importing Model modules

Add the zipped module to the spark session- the file can be loaded from an S3 location as well.

Import Classifier and Regressor

Training

Training Steps

Use Pipeline to be able to save the trained model object.

Saving the model

Directly saving the model will throw a Java error. The workaround for this is to use Pipeline from Spark ML.

PipelineModel save and load

Scoring

Now we can load this saved model and use it to get predictions.

Get scores from the model

Above snippets are on Classification Model but roughly the same steps apply for Regression.

You can find full code flow for both Classification and Regression on my git repo here in 4_xgboost.py.

Thank you for reading this article. You can subscribe below to receive email notifications for my new articles.

Please reach out to me via comments in case you have any questions or any inputs.

You can find Python/PySpark related reference material on my git repo here.

--

--