Introduction:
As we discussed earlier, we will start solving Data Engineering problems using SQL (PostgreSQL and MySQL), NoSQL (MongoDB or Cassandra) and Apache Spark (PySpark and Spark SQL)
We will start from very easy SQL problems to difficult SQL Problems, we will also solve problems regarding data loads (Batch, replication and Streaming).
Please find our plan as below
Problem Statement
We have a table with employees and their salaries. Write Queries to solve below problems
- List all the employees whose salary is more than 100K
- Provide distinct department id
- Provide first and last name of employees
- Provide all the details with the employees whose last name is 'Johnson'
Problem Difficulty Level: Easy
Data Structure

Data for this problem
Please also clone the below repo from GitHub, where we have problem statement, sample data and solution.
https://github.com/developershomes/DataEngineeringProblems
Solve using SQL (PostgreSQL)
First, we will solve this problem using PostgreSQL. Open PostgreSQL and create a table using the script below so that we can load sample data. This way we will also learn how to create tables in PostgreSQL and load data from CSV.
CREATE TABLE IF NOT EXISTS public.employee_salary
(
id bigint,
first_name character varying(100) COLLATE pg_catalog."default",
last_name character varying(100) COLLATE pg_catalog."default",
salary bigint,
department_id bigint
)
TABLESPACE pg_default;
ALTER TABLE IF EXISTS public.employee_salary
OWNER to postgres;
Once we execute this query table with the name "employee_salary" will be created. Now we will load data into this table.
Right click on table and select -> Import/Export Data

Go to Export tab and select file also select csv as file format and click on ok.

You will see a popup that data load is in process, and after few seconds you see data is loaded into table.

The first step is to check the data in the table. For that write below query
SELECT * FROM public.employee_salary;

Now, as per the problem
- List all the employees whose salary is more than 100K
For this, we will use where condition and apply filter using greater than 100K. So, our query will be as below and also, we need all the data so we can use '*' or can pass all the columns name.
SELECT id, first_name, last_name, salary, department_id
FROM public.employee_salary
WHERE salary > 100000 ;
As output, we are getting all the employees whose salary is more than 100k. In total we are getting 39 rows.

2. Provide distinct department id
To get unique or distinct department, we will use distinct function in SQL and our query will be as below.
SELECT DISTINCT department_id
FROM public.employee_salary ;
As output, we have unique department id.

3. Provide first and last name of employees
We will only need first and last name so we will pass only two columns in query.
SELECT first_name, last_name
FROM public.employee_salary ;

4. Provide all the details with the employees whose last name is 'Johnson'
Here, we will again use filter and with column last name. Our query will be as below
SELECT id, first_name, last_name, salary, department_id
FROM public.employee_salary
WHERE last_name = 'Johnson' ;

Solve Problem using Spark
In Spark, we will solve this problem using PySpark functions and also using Spark SQL.
First for both, we need to load data into Dataframe. First open Jupyter Lab and create one file and paste CSV data into that file.

Now, create Jupyter notebook using python as engine (kernel)

The first step is to load data into data frame.
Once data loaded as below, also check schema so we are sure that all the column is having proper datatypes.

Also check sample data by using show() function.

Solve using PySpark functions
- List all the employees whose salary is more than 100K
For this we can use filter function or we can use where function.


2. Provide distinct department id
For this we can use distinct function in PySpark

3. Provide first and last name of employees
For this we can use select function by specifying list of column names

4. Provide all the details with the employees whose last name is 'Johnson'

Solve using Spark SQL
Before using Spark SQL, we need to create a temp table (or HIVE table) from data frame.

Now, we have a table with the name 'tmpEmployee' available. We can write similar queries which we wrote for PostgreSQL. Spark SQL has almost similar syntax to ANSI SQL.
- List all the employees whose salary is more than 100K

2. Provide distinct department id

3. Provide first and last name of employees

4. Provide all the details with the employees whose last name is 'Johnson'

Comments
Post a Comment