How do I efficiently select the previous non-null value?
PostgresqlPostgresql Problem Overview
I have a table in Postgres that looks like this:
# select * from p;
id | value
----+-------
1 | 100
2 |
3 |
4 |
5 |
6 |
7 |
8 | 200
9 |
(9 rows)
And I'd like to query to make it look like this:
# select * from p;
id | value | new_value
----+-------+----------
1 | 100 |
2 | | 100
3 | | 100
4 | | 100
5 | | 100
6 | | 100
7 | | 100
8 | 200 | 100
9 | | 200
(9 rows)
I can already do this with a subquery in the select, but in my real data I have 20k or more rows and it gets to be quite slow.
Is this possible to do in a window function? I'd love to use lag(), but it doesn't seem to support the IGNORE NULLS option.
select id, value, lag(value, 1) over (order by id) as new_value from p;
id | value | new_value
----+-------+-----------
1 | 100 |
2 | | 100
3 | |
4 | |
5 | |
6 | |
7 | |
8 | 200 |
9 | | 200
(9 rows)
Postgresql Solutions
Solution 1 - Postgresql
I found this answer for SQL Server that also works in Postgres. Having never done it before, I thought the technique was quite clever. Basically, he creates a custom partition for the windowing function by using a case statement inside of a nested query that increments a sum when the value is not null and leaves it alone otherwise. This allows one to delineate every null section with the same number as the previous non-null value. Here's the query:
SELECT
id, value, value_partition, first_value(value) over (partition by value_partition order by id)
FROM (
SELECT
id,
value,
sum(case when value is null then 0 else 1 end) over (order by id) as value_partition
FROM p
ORDER BY id ASC
) as q
And the results:
id | value | value_partition | first_value
----+-------+-----------------+-------------
1 | 100 | 1 | 100
2 | | 1 | 100
3 | | 1 | 100
4 | | 1 | 100
5 | | 1 | 100
6 | | 1 | 100
7 | | 1 | 100
8 | 200 | 2 | 200
9 | | 2 | 200
(9 rows)
Solution 2 - Postgresql
You can create a custom aggregate function in Postgres. Here's an example for the int
type:
CREATE FUNCTION coalesce_agg_sfunc(state int, value int) RETURNS int AS
$$
SELECT coalesce(value, state);
$$ LANGUAGE SQL;
CREATE AGGREGATE coalesce_agg(int) (
SFUNC = coalesce_agg_sfunc,
STYPE = int);
Then query as usual.
SELECT *, coalesce_agg(b) over w, sum(b) over w FROM y
WINDOW w AS (ORDER BY a);
a b coalesce_agg sum
- - ------------ ---
a 0 0 0
b ∅ 0 0
c 2 2 2
d 3 3 5
e ∅ 3 5
f 5 5 10
(6 rows)
Solution 3 - Postgresql
Well, I can't guarantee this is the most efficient way, but works:
SELECT id, value, (
SELECT p2.value
FROM p p2
WHERE p2.value IS NOT NULL AND p2.id <= p1.id
ORDER BY p2.id DESC
LIMIT 1
) AS new_value
FROM p p1 ORDER BY id;
The following index can improve the sub-query for large datasets:
CREATE INDEX idx_p_idvalue_nonnull ON p (id, value) WHERE value IS NOT NULL;
Assuming the value
is sparse (e.g. there are a lot of nulls) it will run fine.
Solution 4 - Postgresql
In my case I needed to maintain a running balance on non-trading days, which is only the weekends and occasionally a three-day weekend in the case of a non-trading holiday
If the number of empty days is pretty low, you can solve this problem via a CASE statement and a series of LAG window functions:
SELECT
CASE
WHEN balance IS NULL THEN
-- A non-null balance must be found within the first 3 preceding rows
CASE
WHEN LAG(balance, 1) OVER () IS NOT NULL
THEN LAG(balance, 1) OVER ()
WHEN LAG(balance, 2) OVER () IS NOT NULL
THEN LAG(d.balance, 2) OVER ()
WHEN LAG(balance, 3) OVER () IS NOT NULL
THEN LAG(balance, 3) OVER ()
END
ELSE balance
END
FROM daily_data;
Not practical for an unbounded problem, but a nice solution to be aware of for minor gaps. Simply add more "WHEN LAG(
Solution 5 - Postgresql
with p (id, value) as (
values (1, 100),
(2, null),
(3, null),
(4, null),
(5, null),
(6, null),
(7, null),
(8, 200),
(9, null))
select *
, (json_agg(value) filter (where value notnull) over (order by id) ->> -1)::int
from p
;
Then we will use aggregate function with filter option.
Solution 6 - Postgresql
Another possibility is to built a sum:
WITH CTE_Data(Company, ValueDate, Amount)
AS(
SELECT 'Company', '2021-05-01', 1000 UNION
SELECT 'Company', '2021-05-02', 1250 UNION
SELECT 'Company', '2021-05-03', NULL UNION
SELECT 'Company', '2021-05-04', NULL UNION
SELECT 'Company', '2021-05-05', 7500 UNION
SELECT 'Company', '2021-05-06', NULL UNION
SELECT 'Company', '2021-05-07', 3200 UNION
SELECT 'Company', '2021-05-08', 3400 UNION
SELECT 'Company', '2021-05-09', NULL UNION
SELECT 'Company', '2021-05-10', 7800
)
SELECT
d.[Company]
,d.[ValueDate]
,d.[Amount]
,d.[Partition]
,SUM(d.[Amount]) OVER(PARTITION BY d.[Company], d.[Partition]) AS [Missing]
FROM(
SELECT
d.[Company]
,d.[ValueDate]
,d.[Amount]
,SUM(CASE WHEN d.[Amount] IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY d.[Company] ORDER BY d.[ValueDate]) AS [Partition]
FROM CTE_Data AS d
) AS d
Solution 7 - Postgresql
You can use LAST_VALUE with FILTER to achieve what you need (at least in PG 9.4)
WITH base AS (
SELECT 1 AS id , 100 AS val
UNION ALL
SELECT 2 AS id , null AS val
UNION ALL
SELECT 3 AS id , null AS val
UNION ALL
SELECT 4 AS id , null AS val
UNION ALL
SELECT 5 AS id , 200 AS val
UNION ALL
SELECT 6 AS id , null AS val
UNION ALL
SELECT 7 AS id , null AS val
)
SELECT id, val, last(val) FILTER (WHERE val IS NOT NULL) over(ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) new_val
FROM base