XGBoost with PySpark on AWS EMR
Compatible versions to train and deploy XGBoost using PySpark
Following article gives a walkthrough of the steps to be taken to use XGBoost with PySpark on AWS EMR.
XGBoost does not provide a PySpark API in Spark, it only provides Scala and other APIs. Hence we will be using a custom python wrapper for XGBoost from this PR.
We will be using Spark 2.4.5 with XGBoost 0.9 as it is one the working version pairs.
To install Spark on your machine, refer to these articles for Windows and for Linux. If you are using any cloud services like AWS or Databricks, please use a cluster with Spark 2.4.5 installed. I have used EMR version 5.30.2 as that comes with Spark 2.4.5.
Note: To make XGBoost work on Windows machine, additional files and steps are required.
Below article gives a walkthrough of using XGBoost with PySpark on AWS EMR.
Getting required files
XGBoost Package files
In order to use Spark XGBoost, we need the package jar files. These are available on Maven repository here.
We need two jar files for XGBoost v0.9. Use following commands to download these.
wget https://repo1.maven.org/maven2/ml/dmlc/xgboost4j/0.90/xgboost4j-0.90.jar
wget https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark/0.90/xgboost4j-spark-0.90.jar
You can also use curl command in case wget doesn’t work.
curl -o xgboost4j-0.90.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j/0.90/xgboost4j-0.90.jar
curl -o xgboost4j-spark-0.90.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark/0.90/xgboost4j-spark-0.90.jar
We will give the locations of these jar files to Spark when we start a session. Please download them on the machine where you want to use them and note the path.
spark-xgboost wrapper
Download zip version of the wrapper codes from here- https://github.com/dmlc/xgboost/pull/4656#issuecomment-510693296
wget -O pyspark-xgboost_0.90.zip https://github.com/dmlc/xgboost/files/3384356/pyspark-xgboost_0.90_261ab52e07bec461c711d209b70428ab481db470.zip
The zipped module contains the wrapper codes. We will add this package to the Spark session using below command.
spark.sparkContext.addPyFile()
Below is my Home directory on the EMR cluster with the wrapper module and the jar files.
We are now ready to use XGBoost with Spark.
Reference Model training code
Starting the Spark session with the Jar files
Depending on whether you are using the CLI or Jupyter Notebook for model training, use one of the following two commands to start your spark session.
CLI
pyspark --executor-cores=1 --executor-memory=7g --driver-memory=4g --name vp --jars /home/hadoop/xgboost4j-0.90.jar,/home/hadoop/xgboost4j-spark-0.90.jar
Jupyter
Start your Jupyter notebook and then use below Spark session builder to start the spark session:
spark = SparkSession\
.builder\
.appName("xgboost")\
.config("spark.executor.memory", "7g")\
.config("spark.driver.memory", "4g")\
.config("spark.jars", "/home/hadoop/xgboost4j-0.90.jar,/home/hadoop/xgboost4j-spark-0.90.jar")\
.getOrCreate()
You can check whether the Jars have been added to session in the SparkUI Environment tab.
Importing Model modules
Add the zipped module to the spark session- the file can be loaded from an S3 location as well.
Import Classifier and Regressor
Training
Use Pipeline to be able to save the trained model object.
Saving the model
Directly saving the model will throw a Java error. The workaround for this is to use Pipeline from Spark ML.
Scoring
Now we can load this saved model and use it to get predictions.
Above snippets are on Classification Model but roughly the same steps apply for Regression.
You can find full code flow for both Classification and Regression on my git repo here in 4_xgboost.py.
Thank you for reading this article. You can subscribe below to receive email notifications for my new articles.
Please reach out to me via comments in case you have any questions or any inputs.
You can find Python/PySpark related reference material on my git repo here.