Introduction
In this blog post we are setting up Data Engineering tools set on our local environment using docker. For Data Engineering tool suite for now we are considering below tools on initial level. In the coming future, we will update our docker files and add more tools.
- Apache Spark
- Jupyter Lab
- Package for Delta Lake
- Package for AWS S3 (
s3a://
) - Package for Google Cloud Storage (
gs://
) - Package for Azure Blob Storage (
wasbs://
) - Package for Azure Datalake generation 1 (
adls://
) - Package for Azure Datalake generation 2 (
abfss://
) - Snowflake
- Hadoop cloud magic committer for AWS
- PostgreSQL
- MySQL
- MongoDB
Use below GitHub Repo, clone it on your local system
https://github.com/shahkalpan/DataEngineeringSuite
Deploying Tool Suite using Docker
Once you clone GitHub Repo, you will see the files below.

And configuration in docker-compose file is as below
- You can change passwords for all MongoDB, PostgreSQL and MySQL.
- We have exposed all the required port so that we can easily access it from our laptop
- For Spark, we have exposed port for Jupyter lab and all the spark job UI
- For MySQL, we have exposed port for accessing database
- For MongoDB, exposed port for accessing it
- For PostgreSQL, exposed port for accessing it

Next step is to create image and start container, for that use below command
docker-compose up --build

Once docker images are created and containers are up, we can see that as below
docker-compose ps

We can see all the containers up and also can see all the exposed ports with them.
We can also check it from docker desktop.

Next step is to check all the tools are configured and installed correctly or not.
Spark
We can go inside container and can check if spark is properly configured or not

We can also check by Jupyter Notebook.
http://127.0.0.1:8888

Once we start Spark session, it will provide link for checking

MongoDB
Used Mongo Compass to connect MongoDB and it is connected successfully, and we can see collections.

MySQL
We are able to connect MySQL using MySQL Workbench.

PostgreSQL
We are able to connect to the PostgreSQL server also. We can use pgAdmin or VS code database plugin to check that.

We now have all the required tools installed in our system. From our next blog we will start learning Data Engineering concepts and start solving Data Engineering problems.
Please also find a video which explains this.
Comments
Post a Comment