Installing and using PySpark on Linux machine

Installation steps simplified

Vijay Patil
5 min readMar 20, 2022

Below steps have been tried on WSL on a Windows 10 laptop, with two different Spark versions (2.4.5 and 3.1.2). I have used WSL but the steps will work on Ubuntu machine as well.

Image source: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Apache_Spark_logo.svg/1280px-Apache_Spark_logo.svg.png

Installing Prerequisites

PySpark requires Java version 7 or later and Python version 2.6–3.7 for Spark 2.x.x and Python 3.8 or later for Spark 3.x.x.

Java

To check if Java is already available and find it’s version, open a terminal and type following command.

java -version

If above command gives an output like below, then you already have Java and hence can skip the below steps.

openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

You can also check the location where Java is installed with below command.

which java

Output:

/usr/bin/java

Run following commands as a user with sudo privileges or root to update the packages index and install OpenJDK 11 JDK package:

sudo apt update
sudo apt install openjdk-11-jdk

In case you do not have sudo access to your machine, download the .tar.gz file of Java from Oracle website(e.g. jre-8u271-windows-x64.tar.gz). Then, un-gzip and un-tar the downloaded file and you have a Java JRE or JDK installation.
Run below command in a terminal on the downloaded file to extract it:

tar -xvzf jre-8u271-windows-x64.tar.gz

Make a note of where Java is extracted as we will need the path later.

Python

Use below command to check the version of Python.

python --version
# or
python3 --version

Output

Python 3.8.10

Note: Spark 2.x.x don’t support Python 3.8. Please install python 3.7.x. For more information, refer to this stackoverflow question. Spark 3.x.x support Python 3.8.

Virtual Environments

Spark gets installed independent of python and can be used within any environment once activated.

Installing python virtual environments can be done via virtualenv package or by using Python via Anaconda. You can refer to this stackoverflow question in case you want to use virtualenv and have a different version of python installed in an environment.

Anaconda is easier to manage and is built and maintained for Data Science applications hence I would recommend using Anaconda. Refer to this article to install Anaconda on Linux. You might want to run below command once you have followed the steps in above article to prevent conda from auto activating on your terminal:

conda config --set auto_activate_base false

Getting the Spark files

Download the required spark version file from the Apache Spark Downloads website. Get the ‘spark-x.x.x-bin-hadoop2.7.tgz’ file, e.g. spark-3.1.2-bin-hadoop2.7.tgz.

Spark 3.x.x also come with Hadoop 3.2 but this Hadoop version causes errors when writing Parquet files so it is recommended to use Hadoop 2.7.

Make corresponding changes to remaining steps for the chosen spark version.

Downloading the file on the terminal:

wget https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz

Run below command on the downloaded file to extract it:

tar -xvzf spark-3.1.2-bin-hadoop2.7.tgz

Putting everything together

Setup folder

Create a folder for spark installation at the location of your choice. e.g. ./spark.

mkdir spark
cd spark

Extract the spark file and paste the folder into chosen folder: ./spark/spark-3.1.2-bin-hadoop2.7

Setting environment variables

We have to setup below environment variables to let spark know where the required files are:

  • SPARK_HOME
  • HADOOP_HOME
  • Add SPARK_HOME/bin to PATH
  • JAVA_HOME (set this only if you have extracted java into a folder in the installation step)

To temporarily set an environment variable in Linux, run an export command:

export SPARK_HOME=/home/vijayp/spark/spark-3.1.2-bin-hadoop2.7

Check if it worked:

Temporary environment variable

To set these permanently, add the lines to .bashrc file. Run vi command to open the file in edit mode or use any text editor.

vi ~/.bashrc

Add below lines- update the paths as per the location where you have Java and Spark files.

export SPARK_HOME=/home/vijayp/spark/spark-3.1.2-bin-hadoop2.7
export HADOOP_HOME=$SPARK_HOME/hadoop
export PATH="$SPARK_HOME/bin:$PATH"
# export JAVA_HOME=/path/to/extracted/java

Then source your updated file:

source ~/.bashrc

Check if they have been set correctly:

echo $SPARK_HOME $HADOOP_HOME
echo $PATH

Optional variables: Set below variables if you want to use PySpark with Jupyter notebook. If this is not set, PySpark session will start on the console. This requires Jupyter notebook to be installed in your python environment.

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

Optionally, you can skip these in the .bashrc file but directly run the export commands on your terminal whenever required if you will be using both the terminal as well as jupyter.

Using PySpark in standalone mode

Commands

Each command to be run in a separate terminal. Activate required python environment in the terminals before running below commands.

  1. Deploying Master
    spark-class org.apache.spark.deploy.master.Master -h 127.0.0.1
    Open your browser and navigate to: http://localhost:8080/. This is the SparkUI.
  2. Deploying Worker
    spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077
    SparkUI will show the worker status.

3. PySpark shell
pyspark --master spark://127.0.0.1:7077 --num-executors 1 --executor-cores 1 --executor-memory 4g --driver-memory 2g --conf spark.dynamicAllocation.enabled=false

Adjust num-executors, executor-cores, executor-memory and driver-memory as per machine config. SparkUI will show the list of PySparkShell sessions.

The above command will open Jupyter Notebook instead of pyspark shell if you have set the PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS Environment variables as well.

Alternative

Run below command to start pyspark (shell or jupyter) session using all resources available on your machine. Activate the required python environment before running this command.

pyspark --master local[*]

Note: For WSL, if Spark session shows warning saying Initial job has not accepted any resources, please run WSL with Administrator access.

Thank you for reading this article. You can subscribe below to receive email notifications for my new articles.

Please reach out to me via comments in case you have any questions or any inputs.

You can find python/pyspark related reference material on my git repo here.

--

--