Introduction:
As we discussed earlier, we will start solving Data Engineering problems using SQL (PostgreSQL and MySQL), NoSQL (MongoDB or Cassandra) and Apache Spark (PySpark and Spark SQL)
We will start from very easy SQL problems to difficult SQL Problems, we will also solve problems regarding data loads (Batch, replication and Streaming).
Problem Statement
We have a table with employees and their salaries, however, some of the records are old and contain outdated salary information. Find the current salary of each employee assuming that salaries increase each year. Output their id, first name, last name, department ID, and current salary. Order your list by employee ID in ascending order.
Problem Difficulty Level: Medium
Data Structure

Data for this problem
Please also clone below GitHub repo for the SQL queries and Jupyter notebook.
https://github.com/developershomes/DataEngineeringProblems/tree/main/Problem%201
Please find earlier blog to learn on how to load CSV data into PostgreSQL and Jupyter lab.
Solve problem using PostgreSQL
Once we table is created and data is loaded into the table, we will see the data in the table as below.

We see that in table we have total 95 rows and with each user (first and last name) we see two entries with salary.
Now, we will break down the problem and understand the problem.
- We need each employee's latest salary and as per problem we need to consider the latest salary will be higher than earlier.
- So, we can consider that for each employee we need to find the maximum salary.
- With max salary we also need to display id, first name, last name and department.
- Also want to display details ordering by employee id
Starting with the easy bit, which is what all columns we need to display

We can see that all other column values are the same. For example, For Todd id, last name and department id are same.
Now to get Maximum salary, we need to use MAX() function. But when we use max function, we also need to group by with all the columns which we want to display, in this case we want to display all.
SELECT id, first_name, last_name, department_id , MAX(salary) AS latest_salary
FROM public.employee
GROUP BY id, first_name, last_name, department_id;
If we don't pass 'group by' and only do max, it will give error and suggest same in error.

Now, with 'group by'

One last thing is we need to order the result by employee id. And for that we will use 'ORDER BY' function.
Our final query will be as below.
SELECT id, first_name, last_name, department_id , MAX(salary) AS latest_salary
FROM public.employee
GROUP BY id, first_name, last_name, department_id
ORDER BY id;

Solve problem using Spark
As we discussed in earlier blog, we will create csv file on Jupyter lab and load csv data into spark dataframe as below.
# First Load all the required library and also Start Spark Session
# Load all the required library
from pyspark.sql import SparkSession
#Start Spark Session
spark = SparkSession.builder.appName("problem1").getOrCreate()
sqlContext = SparkSession(spark)
#Dont Show warning only error
spark.sparkContext.setLogLevel("ERROR")
#Load CSV file into DataFrame
employeedf = spark.read.format("csv").option("header","true").option("inferSchema","true").load("employee.csv")

One data loaded into dataframe, we will check schema and data into dataframe.

Solve problem using PySpark functions
Here also, we will break down problems. And first we will do easy things, which is specifying required columns and we will also do order by using 'orderBy()' function into PySpark.

In the next step, we will use MAX() function and will get maximum salary of employee. Same way here also when we use aggregate function, we need to use group by with this.
So our final code will be as below
# Now we will use group by function and get max salary for each employee
employeedf.groupBy("id","first_name","last_name","department_id").max("salary").orderBy("id").show(n=100)

Solve problem using Spark SQL
For this, we will first create a temporary table or HIVE table. And by using select query we will also check for sample data into table.
# Now we are solving Same problem using Spark SQL
# Creating Temp Table or HIVE table
employeedf.createOrReplaceTempView("tmpEmployee")
# Now we have SQL Table and we can write SQL Query on top of that
# For example by Select on table
sqlContext.sql("SELECT * FROM tmpEmployee").show()

We have already solved this problem in PostgreSQL, we will use same query here and it will work. Let's try using the same query but in Spark SQL format.
# Now we will write query to get max salary for each employee
# so we will use SQL Group by and SQL Order by functions
sqlContext.sql("SELECT id,first_name,last_name,MAX(salary) AS LatesSalary,department_id \
FROM tmpEmployee \
GROUP BY id,first_name,last_name,department_id \
ORDER BY id").show(n=100)

We have successfully solved this problem and also understood how to write queries in PostgreSQL, PySpark and Spark SQL.
Please follow for learning more about Data Engineering.
Comments
Post a Comment