Deploy Spark Using Docker

 In this blog, we will learn how to deploy spark using Docker. We will use this environment in future for learning spark and solving data engineering problems.

With Spark, we also want to install a few of the packages (connectors) so that we don't need to install them separately. When we will learn Spark, we will mostly require packages below.

  • Delta Lake
  • AWS S3 ( s3a:// )
  • Google Cloud Storage ( gs:// )
  • Azure Blob Storage ( wasbs:// )
  • Azure Datalake generation 1 ( adls:// )
  • Azure Datalake generation 2 ( abfss:// )
  • Snowflake
  • Hadoop cloud magic committer for AWS
  • PostgreSQL
  • MySQL
  • MongoDB

If we check Docker hub, we see that there are multiple options available for Spark

We will select below image, as it by default provides

datamechanics/spark

  • All the required connectors which we need
  • Jupyter notebook
  • Built-in with Python & PySpark support
  • pip and conda (so it is easy to install additional packages)

Creating Image and container for S

We want to install a few packages and want to start Jupyter lab session. So please use below GitHub repo for deploying spark

https://github.com/shahkalpan/Spark

First, it will install datamechanics spark image, after that we are installing python packages and starting jupyter lab service.

docker build . -t sparkhome
docker image ls

Now, we will start container

docker run -p 8888:8888 --name spark -d sparkhome

In logs, we will have link for Jupyter lab

Copy and paste in browser to run pyspark session from browser

Running PySpark from browser terminal

We will go to Terminal from the option and check pyspark is properly installed or not

Running PySpark from Jupyter notebook (Browser)

Also create sample notebook to check spark is working from notebook or not

Running PySpark from VS Code

We can also connect spark session from VS Code. For that we need to install jupyter notebook plugin and need to connect remote engine and specify our spark link (with token)

Install below plugin

And now open our sample notebook from GitHub repo and connect to our spark instance

Once connected, on the bottom instead of Jupyter local it will show Jupyter remote.

Running PySpark from container

Go to terminal and go inside container

docker exec -it spark /bin/sh

We have spark installed and ready to use now. In the next session, we will discuss Spark basic concepts and start writing spark programs.

Also find below YouTube Video about installing Spark.



Comments