Installing and using PySpark on Linux machine
Installation steps simplified
Below steps have been tried on WSL on a Windows 10 laptop, with two different Spark versions (2.4.5 and 3.1.2). I have used WSL but the steps will work on Ubuntu machine as well.
PySpark requires Java version 7 or later and Python version 2.6–3.7 for Spark 2.x.x and Python 3.8 or later for Spark 3.x.x.
To check if Java is already available and find it’s version, open a terminal and type following command.
If above command gives an output like below, then you already have Java and hence can skip the below steps.
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
You can also check the location where Java is installed with below command.
Run following commands as a user with sudo privileges or root to update the packages index and install OpenJDK 11 JDK package:
sudo apt update
sudo apt install openjdk-11-jdk
In case you do not have sudo access to your machine, download the
.tar.gz file of Java from Oracle website(e.g. jre-8u271-windows-x64.tar.gz). Then, un-gzip and un-tar the downloaded file and you have a Java JRE or JDK installation.
Run below command in a terminal on the downloaded file to extract it:
tar -xvzf jre-8u271-windows-x64.tar.gz
Make a note of where Java is extracted as we will need the path later.
Use below command to check the version of Python.
Note: Spark 2.x.x don’t support Python 3.8. Please install python 3.7.x. For more information, refer to this stackoverflow question. Spark 3.x.x support Python 3.8.
Spark gets installed independent of python and can be used within any environment once activated.
Installing python virtual environments can be done via virtualenv package or by using Python via Anaconda. You can refer to this stackoverflow question in case you want to use virtualenv and have a different version of python installed in an environment.
Anaconda is easier to manage and is built and maintained for Data Science applications hence I would recommend using Anaconda. Refer to this article to install Anaconda on Linux. You might want to run below command once you have followed the steps in above article to prevent conda from auto activating on your terminal:
conda config --set auto_activate_base false
Getting the Spark files
Download the required spark version file from the Apache Spark Downloads website. Get the ‘spark-x.x.x-bin-hadoop2.7.tgz’ file, e.g. spark-3.1.2-bin-hadoop2.7.tgz.
Spark 3.x.x also come with Hadoop 3.2 but this Hadoop version causes errors when writing Parquet files so it is recommended to use Hadoop 2.7.
Make corresponding changes to remaining steps for the chosen spark version.
Downloading the file on the terminal:
Run below command on the downloaded file to extract it:
tar -xvzf spark-3.1.2-bin-hadoop2.7.tgz
Putting everything together
Create a folder for spark installation at the location of your choice. e.g.
Extract the spark file and paste the folder into chosen folder:
Setting environment variables
We have to setup below environment variables to let spark know where the required files are:
- Add SPARK_HOME/bin to PATH
- JAVA_HOME (set this only if you have extracted java into a folder in the installation step)
To temporarily set an environment variable in Linux, run an export command:
Check if it worked:
To set these permanently, add the lines to
.bashrc file. Run vi command to open the file in edit mode or use any text editor.
Add below lines- update the paths as per the location where you have Java and Spark files.
# export JAVA_HOME=/path/to/extracted/java
Then source your updated file:
Check if they have been set correctly:
echo $SPARK_HOME $HADOOP_HOME
Optional variables: Set below variables if you want to use PySpark with Jupyter notebook. If this is not set, PySpark session will start on the console. This requires Jupyter notebook to be installed in your python environment.
Optionally, you can skip these in the .bashrc file but directly run the export commands on your terminal whenever required if you will be using both the terminal as well as jupyter.
Using PySpark in standalone mode
Each command to be run in a separate terminal. Activate required python environment in the terminals before running below commands.
- Deploying Master
spark-class org.apache.spark.deploy.master.Master -h 127.0.0.1Open your browser and navigate to: http://localhost:8080/. This is the SparkUI.
- Deploying Worker
spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077SparkUI will show the worker status.
3. PySpark shell
pyspark --master spark://127.0.0.1:7077 --num-executors 1 --executor-cores 1 --executor-memory 4g --driver-memory 2g --conf spark.dynamicAllocation.enabled=false
driver-memory as per machine config. SparkUI will show the list of PySparkShell sessions.
The above command will open Jupyter Notebook instead of pyspark shell if you have set the
PYSPARK_DRIVER_PYTHON_OPTS Environment variables as well.
Run below command to start pyspark (shell or jupyter) session using all resources available on your machine. Activate the required python environment before running this command.
pyspark --master local[*]
Note: For WSL, if Spark session shows warning saying Initial job has not accepted any resources, please run WSL with Administrator access.
Thank you for reading this article. You can subscribe below to receive email notifications for my new articles.
Please reach out to me via comments in case you have any questions or any inputs.
You can find python/pyspark related reference material on my git repo here.