In this article, we are discussing the process of getting the latest records for each group using SQL.
For the demonstration, we have records of rain data for different cities, on different dates(check the table below).
We want to find the rain measurement of the latest entry date for each city.
Sample Data
Here is the sample data we are using here for this demo-
Table name: rain_data
In the table, there is entry_id, which is the primary key. There is city which is the name of the city, there is date, and rain_measurement.
entry_id | city | date | rain_measurement |
---|---|---|---|
1 | city1 | 2012-01-01 | 1.0 |
2 | city1 | 2012-02-02 | 1.0 |
3 | city1 | 2012-03-03 | 1.3 |
4 | city2 | 2012-01-01 | 1.0 |
5 | city2 | 2012-02-02 | 3.6 |
6 | city2 | 2012-03-03 | 1.8 |
7 | city3 | 1012-01-01 | 1.0 |
8 | city3 | 2012-02-02 | 2.1 |
9 | city3 | 2012-03-03 | 1.0 |
10 | city3 | 2012-03-04 | 0.9 |
Use the following query to create the table and populate data, if you want to follow along.
CREATE TABLE `rain_data` (
`entry_id` int(11) NOT NULL AUTO_INCREMENT,
`city` varchar(100) DEFAULT NULL,
`date` date DEFAULT NULL,
`rain_measurement` decimal(10,1) DEFAULT NULL,
PRIMARY KEY (`entry_id`)
) ENGINE=InnoDB;
INSERT INTO `rain_data`(`entry_id`,`city`,`date`,`rain_measurement`) values
(1,'city1','2012-01-01',1.0),
(2,'city1','2012-02-02',1.0),
(3,'city1','2012-03-03',1.3),
(4,'city2','2012-01-01',1.0),
(5,'city2','2012-02-02',3.6),
(6,'city2','2012-03-03',1.8),
(7,'city3','1012-01-01',1.0),
(8,'city3','2012-02-02',2.1),
(9,'city3','2012-03-03',1.0),
(10,'city3','2012-03-04',0.9);
Expected Result
Here is the final result we are expecting, for the above sample data-
entry_id | city | date | rain_measurement |
---|---|---|---|
3 | city1 | 2012-03-03 | 1.3 |
6 | city2 | 2012-03-03 | 1.8 |
10 | city3 | 2012-03-04 | 0.9 |
Here are the methods we are discussing here, for getting the latest records for each group-
- Method #1: Using Join and Subquery
- Method #2: Using ROW_NUMBER()
- Method #3: Using LEFT JOIN with the Same Table
Let’s check the solution step-by-step.
Method #1: Using Join and Subquery
In this method, we will use GROUP BY
and JOIN
. This is a general process and will work on any relational database system.
The following queries are written for MySQL, but the same process and steps will work for any database.
Let’s discuss the process step by step.
Step #1: Group by and get the MAX date for each
If we do a GROUP BY with any(one or more) column (in this case, we are performing a GROUP BY city), we will get distinct cities, and using MAX(date) we will get the maximum date for that city:
SELECT city, MAX(date) AS max_date FROM rain_data GROUP BY city;
Using this query we will get results like the below:
city | date |
---|---|
city1 | 2012-03-03 |
city2 | 2012-03-03 |
city3 | 2012-03-04 |
This query gives us each city and the MAX date for that city.
Step #2: Join the main table with the subquery result
Now we will use it as a subquery and INNER JOIN with the main table. The join needs to be based on the columns that we selected in the previous query (city and max_date).
The final query will be like the below:
SELECT rd.*
FROM rain_data rd
INNER JOIN (SELECT city, MAX(`date`) AS max_date FROM rain_data GROUP BY city) grd
ON rd.city=grd.city
AND rd.date=grd.max_date;
Output
We will get entry_id, city, date, and rain_easurement for the last date for each city. The output will be something like this:
entry_id | city | date | rain_measurement |
---|---|---|---|
3 | city1 | 2012-03-03 | 1.3 |
6 | city2 | 2012-03-03 | 1.8 |
10 | city3 | 2012-03-04 | 0.9 |
Method #2: Using ROW_NUMBER()
In this method, we will use MySQL 8 Window Function ROW_NUMBER() to rank the rows of data for each city and based on the rank we will filter the latest row.
This method will not work in MySQL versions lower than 8. As ROW_NUMBER() window function is available in MySQL 8 (and above).
Let’s discuss the process step by step.
Step #1: Rank data using ROW_NUMBER()
Use the following query to rank data for each city based on date.
SELECT *,
ROW_NUMBER() OVER (PARTITION BY city ORDER BY `date` DESC) AS rank_no
FROM rain_data
ORDER BY city ASC,
rank_no ASC
We are rating rows for each city based on the date, and ranking it in descending order, so the row which has the latest date will have the lowest rank (rank_no = 1) for each city. This query will give you result like the below:
entry_id | city | date | rain_measurement | rank_no |
---|---|---|---|---|
3 | city1 | 2012-03-03 | 1.3 | 1 |
2 | city1 | 2012-02-02 | 1.0 | 2 |
1 | city1 | 2012-01-01 | 1.0 | 3 |
6 | city2 | 2012-03-03 | 1.8 | 1 |
5 | city2 | 2012-02-02 | 3.6 | 2 |
4 | city2 | 2012-01-01 | 1.0 | 3 |
10 | city3 | 2012-03-04 | 0.9 | 1 |
9 | city3 | 2012-03-03 | 1.0 | 2 |
8 | city3 | 2012-02-02 | 2.1 | 3 |
7 | city3 | 1012-01-01 | 1.0 | 4 |
Step #2: Filter and take rows with rank_no= 1
Now use the filter rank_no = 1
in the select query, like below.
SELECT *
FROM
(SELECT *,
ROW_NUMBER() OVER (PARTITION BY city ORDER BY `date` DESC) AS rank_no
FROM rain_data) rd
WHERE rd.rank_no = 1
Or you can write it in the format below if you prefer.
WITH ranked_rain_data AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY city ORDER BY `date` DESC) AS rank_no
FROM rain_data
)
SELECT * FROM ranked_rain_data WHERE rank_no= 1;
Both will give the same result, like below.
Output
You will get the following output:
entry_id | city | date | rain_measurement | rank_no |
---|---|---|---|---|
3 | city1 | 2012-03-03 | 1.3 | 1 |
6 | city2 | 2012-03-03 | 1.8 | 1 |
10 | city3 | 2012-03-04 | 0.9 | 1 |
Method #3: Using LEFT JOIN with Same Table
Here is another approach we can take to achieve the same result. We are using LEFT JOIN here, and joining the same table.
The trick is to add a condition in the where clause that is date_from_first_table_ref < date_from_second_table_ref, and then filter only those rows that do not have any entry on the right.
Let’s check step by step-
Step #1: LEFT JOIN the same table
LEFT JOIN rain_data table with the same table. The join will be based on the “city” column, if the city column is the same then we are joining.
SELECT
rd1.*, rd2.*
FROM
rain_data AS rd1
LEFT JOIN
rain_data AS rd2
ON
rd1.city = rd2.city
The output will be as below. The number of rows will be large, that’s why we have omitted some rows in the table below-
entry_id | city | date | rain_measurement | entry_id | city | date | rain_measurement |
---|---|---|---|---|---|---|---|
1 | city1 | 2012-01-01 | 1.0 | 3 | city1 | 2012-03-03 | 1.3 |
1 | city1 | 2012-01-01 | 1.0 | 2 | city1 | 2012-02-02 | 1.0 |
1 | city1 | 2012-01-01 | 1.0 | 1 | city1 | 2012-01-01 | 1.0 |
2 | city1 | 2012-02-02 | 1.0 | 3 | city1 | 2012-03-03 | 1.3 |
2 | city1 | 2012-02-02 | 1.0 | 2 | city1 | 2012-02-02 | 1.0 |
2 | city1 | 2012-02-02 | 1.0 | 1 | city1 | 2012-01-01 | 1.0 |
… | … | … | … | … | … | … | … |
… | … | … | … | … | … | … | … |
… | … | … | … | … | … | … | … |
9 | city3 | 2012-03-03 | 1.0 | 7 | city3 | 1012-01-01 | 1.0 |
10 | city3 | 2012-03-04 | 0.9 | 10 | city3 | 2012-03-04 | 0.9 |
10 | city3 | 2012-03-04 | 0.9 | 9 | city3 | 2012-03-03 | 1.0 |
10 | city3 | 2012-03-04 | 0.9 | 8 | city3 | 2012-02-02 | 2.1 |
10 | city3 | 2012-03-04 | 0.9 | 7 | city3 | 1012-01-01 | 1.0 |
We are getting a large and weird result set here, that does not make any sense. As no filter was applied to the query.
Check the following steps, as we gradually achieve our desired result, by applying filters on this JOIN query.
Step #2: Add condition rd1.date < rb2.date
On the join condition, add condition rd1.date < rd2.date –
SELECT
rd1.*, rd2.*
FROM
rain_data AS rd1
LEFT JOIN
rain_data AS rd2
ON
rd1.city = rd2.city AND
rd1.date < rd2.date
We get a smaller result set as below, as we are only getting results that have smaller dates for the left part-
entry_id | city | date | rain_measurement | entry_id | city | date | rain_measurement |
---|---|---|---|---|---|---|---|
1 | city1 | 2012-01-01 | 1.0 | 3 | city1 | 2012-03-03 | 1.3 |
1 | city1 | 2012-01-01 | 1.0 | 2 | city1 | 2012-02-02 | 1.0 |
2 | city1 | 2012-02-02 | 1.0 | 3 | city1 | 2012-03-03 | 1.3 |
3 | city1 | 2012-03-03 | 1.3 | (NULL) | (NULL) | (NULL) | (NULL) |
4 | city2 | 2012-01-01 | 1.0 | 6 | city2 | 2012-03-03 | 1.8 |
4 | city2 | 2012-01-01 | 1.0 | 5 | city2 | 2012-02-02 | 3.6 |
5 | city2 | 2012-02-02 | 3.6 | 6 | city2 | 2012-03-03 | 1.8 |
6 | city2 | 2012-03-03 | 1.8 | (NULL) | (NULL) | (NULL) | (NULL) |
7 | city3 | 1012-01-01 | 1.0 | 10 | city3 | 2012-03-04 | 0.9 |
7 | city3 | 1012-01-01 | 1.0 | 9 | city3 | 2012-03-03 | 1.0 |
7 | city3 | 1012-01-01 | 1.0 | 8 | city3 | 2012-02-02 | 2.1 |
8 | city3 | 2012-02-02 | 2.1 | 10 | city3 | 2012-03-04 | 0.9 |
8 | city3 | 2012-02-02 | 2.1 | 9 | city3 | 2012-03-03 | 1.0 |
9 | city3 | 2012-03-03 | 1.0 | 10 | city3 | 2012-03-04 | 0.9 |
10 | city3 | 2012-03-04 | 0.9 | (NULL) | (NULL) | (NULL) | (NULL) |
The most important thing to notice here are the rows that have NULL on the right. As those rows do not have any large dates for the right part. Those rows are our actual target rows.
Step #3: Filter out rows which have NULL on the right
Let’s add a WHERE clause and only get rows which has null on the right site. Use the following query-
SELECT
rd1.*, rd2.*
FROM
rain_data AS rd1
LEFT JOIN
rain_data AS rd2
ON
rd1.city = rd2.city AND
rd1.date < rd2.date
WHERE
rd2.entry_id IS NULL
We will get results like the below-
entry_id | city | date | rain_measurement | entry_id | city | date | rain_measurement |
---|---|---|---|---|---|---|---|
3 | city1 | 2012-03-03 | 1.3 | (NULL) | (NULL) | (NULL) | (NULL) |
6 | city2 | 2012-03-03 | 1.8 | (NULL) | (NULL) | (NULL) | (NULL) |
10 | city3 | 2012-03-04 | 0.9 | (NULL) | (NULL) | (NULL) | (NULL) |
Step #4: Final Query
Only take the left part of the result using the following query.
This is the final query of this approach.
SELECT
rd1.*
FROM
rain_data AS rd1
LEFT JOIN
rain_data AS rd2
ON
rd1.city = rd2.city AND
rd1.date < rd2.date
WHERE
rd2.entry_id IS NULL
Output
Here is the final output-
entry_id | city | date | rain_measurement |
---|---|---|---|
3 | city1 | 2012-03-03 | 1.3 |
6 | city2 | 2012-03-03 | 1.8 |
10 | city3 | 2012-03-04 | 0.9 |
Lifesaver! Thank You.
Happy to help, Andre