Data Engineering Problem 0 (Employees with salary more than 100K)

Introduction:

As we discussed earlier, we will start solving Data Engineering problems using SQL (PostgreSQL and MySQL), NoSQL (MongoDB or Cassandra) and Apache Spark (PySpark and Spark SQL)

We will start from very easy SQL problems to difficult SQL Problems, we will also solve problems regarding data loads (Batch, replication and Streaming).

Please find our plan as below

https://developershome.blog/2023/01/28/data-engineering-learning-plan-system-setup/

Problem Statement

We have a table with employees and their salaries. Write Queries to solve below problems

List all the employees whose salary is more than 100K
Provide distinct department id
Provide first and last name of employees
Provide all the details with the employees whose last name is 'Johnson'

Problem Difficulty Level: Easy

Data Structure

Data for this problem

In CSV Format

Please also clone the below repo from GitHub, where we have problem statement, sample data and solution.

https://github.com/developershomes/DataEngineeringProblems

Solve using SQL (PostgreSQL)

First, we will solve this problem using PostgreSQL. Open PostgreSQL and create a table using the script below so that we can load sample data. This way we will also learn how to create tables in PostgreSQL and load data from CSV.

CREATE TABLE IF NOT EXISTS public.employee_salary
(
    id bigint,
    first_name character varying(100) COLLATE pg_catalog."default",
    last_name character varying(100) COLLATE pg_catalog."default",
    salary bigint,
    department_id bigint
)

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.employee_salary
    OWNER to postgres;

Once we execute this query table with the name "employee_salary" will be created. Now we will load data into this table.

Right click on table and select -> Import/Export Data

Go to Export tab and select file also select csv as file format and click on ok.

You will see a popup that data load is in process, and after few seconds you see data is loaded into table.

The first step is to check the data in the table. For that write below query

SELECT * FROM public.employee_salary;

Now, as per the problem

List all the employees whose salary is more than 100K

For this, we will use where condition and apply filter using greater than 100K. So, our query will be as below and also, we need all the data so we can use '*' or can pass all the columns name.

SELECT id, first_name, last_name, salary, department_id
	FROM public.employee_salary
	WHERE salary > 100000 ;

As output, we are getting all the employees whose salary is more than 100k. In total we are getting 39 rows.

2. Provide distinct department id

To get unique or distinct department, we will use distinct function in SQL and our query will be as below.

SELECT DISTINCT department_id
	FROM public.employee_salary ;

As output, we have unique department id.

3. Provide first and last name of employees

We will only need first and last name so we will pass only two columns in query.

SELECT first_name, last_name
	FROM public.employee_salary ;

4. Provide all the details with the employees whose last name is 'Johnson'

Here, we will again use filter and with column last name. Our query will be as below

SELECT id, first_name, last_name, salary, department_id
	FROM public.employee_salary 
	WHERE last_name = 'Johnson' ;

Solve Problem using Spark

In Spark, we will solve this problem using PySpark functions and also using Spark SQL.

First for both, we need to load data into Dataframe. First open Jupyter Lab and create one file and paste CSV data into that file.

Now, create Jupyter notebook using python as engine (kernel)

The first step is to load data into data frame.

Once data loaded as below, also check schema so we are sure that all the column is having proper datatypes.

Also check sample data by using show() function.

Solve using PySpark functions

List all the employees whose salary is more than 100K

For this we can use filter function or we can use where function.

2. Provide distinct department id

For this we can use distinct function in PySpark

3. Provide first and last name of employees

For this we can use select function by specifying list of column names

4. Provide all the details with the employees whose last name is 'Johnson'

Solve using Spark SQL

Before using Spark SQL, we need to create a temp table (or HIVE table) from data frame.

Now, we have a table with the name 'tmpEmployee' available. We can write similar queries which we wrote for PostgreSQL. Spark SQL has almost similar syntax to ANSI SQL.

List all the employees whose salary is more than 100K

2. Provide distinct department id

3. Provide first and last name of employees

4. Provide all the details with the employees whose last name is 'Johnson'

Also watch the video below to understand in more detail.

Developershome

Search This Blog