XGBoost with PySpark on AWS EMR

Compatible versions to train and deploy XGBoost using PySpark

4 min readSep 3, 2022

Following article gives a walkthrough of the steps to be taken to use XGBoost with PySpark on AWS EMR.

https://xgboost.readthedocs.io/en/stable/#

XGBoost does not provide a PySpark API in Spark, it only provides Scala and other APIs. Hence we will be using a custom python wrapper for XGBoost from this PR.

We will be using Spark 2.4.5 with XGBoost 0.9 as it is one the working version pairs.

To install Spark on your machine, refer to these articles for Windows and for Linux. If you are using any cloud services like AWS or Databricks, please use a cluster with Spark 2.4.5 installed. I have used EMR version 5.30.2 as that comes with Spark 2.4.5.

Note: To make XGBoost work on Windows machine, additional files and steps are required.
Below article gives a walkthrough of using XGBoost with PySpark on AWS EMR.

Getting required files

XGBoost Package files

In order to use Spark XGBoost, we need the package jar files. These are available on Maven repository here.

We need two jar files for XGBoost v0.9. Use following commands to download these.

wget https://repo1.maven.org/maven2/ml/dmlc/xgboost4j/0.90/xgboost4j-0.90.jar
wget https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark/0.90/xgboost4j-spark-0.90.jar

You can also use curl command in case wget doesn’t work.

curl -o xgboost4j-0.90.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j/0.90/xgboost4j-0.90.jar
curl -o xgboost4j-spark-0.90.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark/0.90/xgboost4j-spark-0.90.jar

We will give the locations of these jar files to Spark when we start a session. Please download them on the machine where you want to use them and note the path.

spark-xgboost wrapper

Download zip version of the wrapper codes from here- https://github.com/dmlc/xgboost/pull/4656#issuecomment-510693296

wget -O pyspark-xgboost_0.90.zip https://github.com/dmlc/xgboost/files/3384356/pyspark-xgboost_0.90_261ab52e07bec461c711d209b70428ab481db470.zip

The zipped module contains the wrapper codes. We will add this package to the Spark session using below command.

spark.sparkContext.addPyFile()

Below is my Home directory on the EMR cluster with the wrapper module and the jar files.

/home/hadoop/

We are now ready to use XGBoost with Spark.

Reference Model training code

Starting the Spark session with the Jar files

Depending on whether you are using the CLI or Jupyter Notebook for model training, use one of the following two commands to start your spark session.

CLI

pyspark --executor-cores=1 --executor-memory=7g --driver-memory=4g --name vp --jars /home/hadoop/xgboost4j-0.90.jar,/home/hadoop/xgboost4j-spark-0.90.jar

Jupyter

Start your Jupyter notebook and then use below Spark session builder to start the spark session:

spark = SparkSession\
    .builder\
    .appName("xgboost")\
    .config("spark.executor.memory", "7g")\
    .config("spark.driver.memory", "4g")\
    .config("spark.jars", "/home/hadoop/xgboost4j-0.90.jar,/home/hadoop/xgboost4j-spark-0.90.jar")\
    .getOrCreate()

You can check whether the Jars have been added to session in the SparkUI Environment tab.