Data Engineering Tool Suite

 

Introduction

In this blog post we are setting up Data Engineering tools set on our local environment using docker. For Data Engineering tool suite for now we are considering below tools on initial level. In the coming future, we will update our docker files and add more tools.

  • Apache Spark
    • Jupyter Lab
    • Package for Delta Lake
    • Package for AWS S3 ( s3a:// )
    • Package for Google Cloud Storage ( gs:// )
    • Package for Azure Blob Storage ( wasbs:// )
    • Package for Azure Datalake generation 1 ( adls:// )
    • Package for Azure Datalake generation 2 ( abfss:// )
    • Snowflake
    • Hadoop cloud magic committer for AWS
  • PostgreSQL
  • MySQL
  • MongoDB

Use below GitHub Repo, clone it on your local system

https://github.com/shahkalpan/DataEngineeringSuite

Deploying Tool Suite using Docker

Once you clone GitHub Repo, you will see the files below.

And configuration in docker-compose file is as below

  • You can change passwords for all MongoDB, PostgreSQL and MySQL.
  • We have exposed all the required port so that we can easily access it from our laptop
  • For Spark, we have exposed port for Jupyter lab and all the spark job UI
  • For MySQL, we have exposed port for accessing database
  • For MongoDB, exposed port for accessing it
  • For PostgreSQL, exposed port for accessing it

Next step is to create image and start container, for that use below command

docker-compose up --build

Once docker images are created and containers are up, we can see that as below

docker-compose ps 

We can see all the containers up and also can see all the exposed ports with them.

We can also check it from docker desktop.

Next step is to check all the tools are configured and installed correctly or not.

Spark

We can go inside container and can check if spark is properly configured or not

We can also check by Jupyter Notebook.

http://127.0.0.1:8888

Once we start Spark session, it will provide link for checking

MongoDB

Used Mongo Compass to connect MongoDB and it is connected successfully, and we can see collections.

MySQL

We are able to connect MySQL using MySQL Workbench.

PostgreSQL

We are able to connect to the PostgreSQL server also. We can use pgAdmin or VS code database plugin to check that.

We now have all the required tools installed in our system. From our next blog we will start learning Data Engineering concepts and start solving Data Engineering problems.

Please also find a video which explains this.


 

Comments