Limits of DISTINCT ON
I had a problem recently where I needed to modify a script that would update 1 row per distinct values of a column to instead update 2 rows per grouping. The data in the table looked something like this.
Id | TypeA | TypeB | Flag |
---|---|---|---|
1 | A1 | B1 | |
2 | A1 | B1 | |
3 | A2 | B1 | |
4 | A2 | B1 | |
5 | A3 | B1 | |
6 | A3 | B1 | |
7 | A1 | B2 | |
8 | A1 | B2 | |
9 | A2 | B2 | |
10 | A2 | B2 | |
11 | A3 | B2 | |
12 | A3 | B2 |
This meant that the DISTINCT ON
in the following SQL was too limiting, and I would have to do something more creative.
UPDATE "Table"
SET "Flag" = '✔️'
WHERE "Id" IN (
SELECT DISTINCT ON (s."TypeB") s."Id"
FROM "Table" s
WHERE s."TypeA" = 'A1'
);
UPDATE "Table"
SET "Flag" = '✔️'
WHERE "Id" IN (
SELECT DISTINCT ON (s."TypeB") s."Id"
FROM "Table" s
WHERE s."TypeA" = 'A2'
);
This would update the table to look like this
Id | TypeA | TypeB | Flag |
---|---|---|---|
1 | A1 | B1 | ✔️ |
2 | A1 | B1 | |
3 | A2 | B1 | ✔️ |
4 | A2 | B1 | |
5 | A3 | B1 | |
6 | A3 | B1 | |
7 | A1 | B2 | ✔️ |
8 | A1 | B2 | |
9 | A2 | B2 | ✔️ |
10 | A2 | B2 | |
11 | A3 | B2 | |
12 | A3 | B2 |
But what I really wanted was to update 2 rows per TypeB
grouping, so that the table would look like this. Notice we don’t want to touch the A3
rows, nor any other TypeA
values other than A1
and A2
.
Id | TypeA | TypeB | Flag |
---|---|---|---|
1 | A1 | B1 | ✔️ |
2 | A1 | B1 | ✔️ |
3 | A2 | B1 | ✔️ |
4 | A2 | B1 | ✔️ |
5 | A3 | B1 | |
6 | A3 | B1 | |
7 | A1 | B2 | ✔️ |
8 | A1 | B2 | ✔️ |
9 | A2 | B2 | ✔️ |
10 | A2 | B2 | ✔️ |
11 | A3 | B2 | |
12 | A3 | B2 |
Using Window Functions
To update more than 1 row per TypeB
grouping, we’ll need both of these Postgres functions
ROW_NUMBER()
(see Window Functions Docs docs)PARTIION BY
(see Window Functions Tutorial tutorial)
To start we can number each row by TypeB
like this
SELECT ROW_NUMBER() OVER (PARTITION BY "TypeB"), "TypeB"
FROM "Table"
This returns something like
ROW_NUMBER | TypeB |
---|---|
1 | B1 |
2 | B1 |
… | B1 |
1 | B2 |
2 | B2 |
… | B2 |
Knowing that PARTITION BY
creates subsets of each group, and that ROW_NUMBER()
can index each row in each group starting with 1
, then we can do this in our update statement
UPDATE "Table"
SET "Flag" = '✔️'
WHERE "Id" IN (
SELECT t."Id"
FROM (
SELECT "Id", ROW_NUMBER() OVER (PARTITION BY "TypeB") AS row_number
FROM "Table"
WHERE "TypeA" = 'A1'
) t
WHERE t.row_number < 3
);
UPDATE "Table"
SET "Flag" = '✔️'
WHERE "Id" IN (
SELECT t."Id"
FROM (
SELECT "Id", ROW_NUMBER() OVER (PARTITION BY "TypeB") AS row_number
FROM "Table"
WHERE "TypeA" = 'A2'
) t
WHERE t.row_number < 3
);
This does it!
Reducing Repetition
How can I combine the two UPDATE
statements for both A1
and A2
? Postgres does have a FOR
loop that looks like this:
FOR i IN 1..10 LOOP
-- ...
END LOOP;
The FOR
loop does work, but using it is not as simple as you would hope. Postgres can run the FOR
inside of a DO
block that runs code for a given language. In the following snippet, the language is plpgsql
which is Postgres’ own procedural language. The DECLARE
block is where we can declare variables, and the BEGIN
block is where we can run our FOR
loop.
DO $$
DECLARE a_idx integer;
BEGIN
FOR a_idx IN 1..2 LOOP
UPDATE "Table"
SET "Flag" = '✔️'
WHERE "Id" IN (
SELECT s."Id"
FROM (
SELECT "Id", ROW_NUMBER() OVER (PARTITION BY "TypeB") AS row_number
FROM "Table"
WHERE "TypeA" = 'A' || a_idx
) s
WHERE s.row_number < 6
);
END LOOP;
END $$;
This is much better than having to write out the UPDATE
statement twice, and it’s also more flexible. If we wanted to update more rows per grouping, we could just change the WHERE
clause to WHERE s.row_number < 6
and so on. And if we wanted to update more TypeA
values, we could just change the FOR
loop to FOR a_idx IN 1..3 LOOP
and so on. If we wanted to loop over a list of strings instead of integers we could do that too, though, counterintuitively it does require a bit more verbosity.
DO $$
DECLARE
a_arr text[] := ARRAY['A1', 'A2'];
a text;
BEGIN
FOREACH a IN ARRAY a_arr LOOP
UPDATE "Table"
SET "Flag" = '✔️'
WHERE "Id" IN (
SELECT s."Id"
FROM (
SELECT "Id", ROW_NUMBER() OVER (PARTITION BY "TypeB") AS row_number
FROM "Table"
WHERE "TypeA" = a
) s
WHERE s.row_number < 6
);
END LOOP;
END $$;
Conclusion
DISTINCT ON
can be used to select 1 row per distinct values of a columnROW_NUMBER()
is a window function that can index rows in a given groupPARTITION BY
can be used to create groups of rowsFOR
loops can be used to run the same code multiple timesDO $$
can be used to run code in a given languageDECLARE
blocks can be used to declare variablesFOR ... IN ... LOOP
can be used to iterate over a range of valuesFOREACH ... IN ARRAY ... LOOP
can be used to iterate over an array of values||
can be used to concatenate stringstext[] := ARRAY[...]
can be used to declare and create an array of strings