Data Engineering Problem 1 (Get Max Salary of each employee)

Introduction:

As we discussed earlier, we will start solving Data Engineering problems using SQL (PostgreSQL and MySQL), NoSQL (MongoDB or Cassandra) and Apache Spark (PySpark and Spark SQL)

We will start from very easy SQL problems to difficult SQL Problems, we will also solve problems regarding data loads (Batch, replication and Streaming).

Problem Statement

We have a table with employees and their salaries, however, some of the records are old and contain outdated salary information. Find the current salary of each employee assuming that salaries increase each year. Output their id, first name, last name, department ID, and current salary. Order your list by employee ID in ascending order.

Problem Difficulty Level: Medium

Data Structure

Data for this problem

In CSV Format

Please also clone below GitHub repo for the SQL queries and Jupyter notebook.

https://github.com/developershomes/DataEngineeringProblems/tree/main/Problem%201

Please find earlier blog to learn on how to load CSV data into PostgreSQL and Jupyter lab.

https://developershome.blog/2023/02/05/data-engineering-problem-0-employees-with-salary-more-than-100k/

Solve problem using PostgreSQL

Once we table is created and data is loaded into the table, we will see the data in the table as below.

We see that in table we have total 95 rows and with each user (first and last name) we see two entries with salary.

Now, we will break down the problem and understand the problem.

We need each employee's latest salary and as per problem we need to consider the latest salary will be higher than earlier.
So, we can consider that for each employee we need to find the maximum salary.
With max salary we also need to display id, first name, last name and department.
Also want to display details ordering by employee id

Starting with the easy bit, which is what all columns we need to display

We can see that all other column values are the same. For example, For Todd id, last name and department id are same.

Now to get Maximum salary, we need to use MAX() function. But when we use max function, we also need to group by with all the columns which we want to display, in this case we want to display all.

SELECT id, first_name, last_name, department_id , MAX(salary) AS latest_salary
	FROM public.employee
	GROUP BY id, first_name, last_name, department_id;

If we don't pass 'group by' and only do max, it will give error and suggest same in error.

Now, with 'group by'

One last thing is we need to order the result by employee id. And for that we will use 'ORDER BY' function.

Our final query will be as below.

SELECT id, first_name, last_name, department_id , MAX(salary) AS latest_salary
	FROM public.employee
	GROUP BY id, first_name, last_name, department_id
	ORDER BY id;

Solve problem using Spark

As we discussed in earlier blog, we will create csv file on Jupyter lab and load csv data into spark dataframe as below.

# First Load all the required library and also Start Spark Session
# Load all the required library
from pyspark.sql import SparkSession

#Start Spark Session
spark = SparkSession.builder.appName("problem1").getOrCreate()
sqlContext = SparkSession(spark)
#Dont Show warning only error
spark.sparkContext.setLogLevel("ERROR")

#Load CSV file into DataFrame
employeedf = spark.read.format("csv").option("header","true").option("inferSchema","true").load("employee.csv")

One data loaded into dataframe, we will check schema and data into dataframe.

Solve problem using PySpark functions

Here also, we will break down problems. And first we will do easy things, which is specifying required columns and we will also do order by using 'orderBy()' function into PySpark.

In the next step, we will use MAX() function and will get maximum salary of employee. Same way here also when we use aggregate function, we need to use group by with this.

So our final code will be as below

# Now we will use group by function and get max salary for each employee 
employeedf.groupBy("id","first_name","last_name","department_id").max("salary").orderBy("id").show(n=100)

Solve problem using Spark SQL

For this, we will first create a temporary table or HIVE table. And by using select query we will also check for sample data into table.

# Now we are solving Same problem using Spark SQL 
# Creating Temp Table or HIVE table
employeedf.createOrReplaceTempView("tmpEmployee")

# Now we have SQL Table and we can write SQL Query on top of that 
# For example by Select on table 
sqlContext.sql("SELECT * FROM tmpEmployee").show()

We have already solved this problem in PostgreSQL, we will use same query here and it will work. Let's try using the same query but in Spark SQL format.

# Now we will write query to get max salary for each employee 
# so we will use SQL Group by and SQL Order by functions 
sqlContext.sql("SELECT id,first_name,last_name,MAX(salary) AS LatesSalary,department_id \
                       FROM tmpEmployee \
                       GROUP BY id,first_name,last_name,department_id \
                       ORDER BY id").show(n=100)

We have successfully solved this problem and also understood how to write queries in PostgreSQL, PySpark and Spark SQL.

Please follow for learning more about Data Engineering.

Developershome

Search This Blog