How do I efficiently select the previous non-null value?

Postgresql

Postgresql Problem Overview


I have a table in Postgres that looks like this:

# select * from p;
 id | value 
----+-------
  1 |   100
  2 |      
  3 |      
  4 |      
  5 |      
  6 |      
  7 |      
  8 |   200
  9 |          
(9 rows)

And I'd like to query to make it look like this:

# select * from p;
 id | value | new_value
----+-------+----------
  1 |   100 |    
  2 |       |    100
  3 |       |    100
  4 |       |    100
  5 |       |    100
  6 |       |    100
  7 |       |    100
  8 |   200 |    100
  9 |       |    200
(9 rows)

I can already do this with a subquery in the select, but in my real data I have 20k or more rows and it gets to be quite slow.

Is this possible to do in a window function? I'd love to use lag(), but it doesn't seem to support the IGNORE NULLS option.

select id, value, lag(value, 1) over (order by id) as new_value from p;
 id | value | new_value
----+-------+-----------
  1 |   100 |      
  2 |       |       100
  3 |       |      
  4 |       |
  5 |       |
  6 |       |
  7 |       |
  8 |   200 |
  9 |       |       200
(9 rows)

Postgresql Solutions


Solution 1 - Postgresql

I found this answer for SQL Server that also works in Postgres. Having never done it before, I thought the technique was quite clever. Basically, he creates a custom partition for the windowing function by using a case statement inside of a nested query that increments a sum when the value is not null and leaves it alone otherwise. This allows one to delineate every null section with the same number as the previous non-null value. Here's the query:

SELECT
  id, value, value_partition, first_value(value) over (partition by value_partition order by id)
FROM (
  SELECT
    id,
    value,
    sum(case when value is null then 0 else 1 end) over (order by id) as value_partition

  FROM p
  ORDER BY id ASC
) as q

And the results:

 id | value | value_partition | first_value
----+-------+-----------------+-------------
  1 |   100 |               1 |         100
  2 |       |               1 |         100
  3 |       |               1 |         100
  4 |       |               1 |         100
  5 |       |               1 |         100
  6 |       |               1 |         100
  7 |       |               1 |         100
  8 |   200 |               2 |         200
  9 |       |               2 |         200
(9 rows)

Solution 2 - Postgresql

You can create a custom aggregate function in Postgres. Here's an example for the int type:

CREATE FUNCTION coalesce_agg_sfunc(state int, value int) RETURNS int AS
$$
    SELECT coalesce(value, state);
$$ LANGUAGE SQL;

CREATE AGGREGATE coalesce_agg(int) (
    SFUNC = coalesce_agg_sfunc,
    STYPE  = int);

Then query as usual.

SELECT *, coalesce_agg(b) over w, sum(b) over w FROM y
  WINDOW w AS (ORDER BY a);

a b coalesce_agg sum 
- - ------------ ---
a 0            0   0
b0   0
c 2            2   2
d 3            3   5
e ∅            3   5
f 5            5  10
(6 rows)

Solution 3 - Postgresql

Well, I can't guarantee this is the most efficient way, but works:

SELECT id, value, (
    SELECT p2.value
    FROM p p2
    WHERE p2.value IS NOT NULL AND p2.id <= p1.id
    ORDER BY p2.id DESC
    LIMIT 1
) AS new_value
FROM p p1 ORDER BY id;

The following index can improve the sub-query for large datasets:

CREATE INDEX idx_p_idvalue_nonnull ON p (id, value) WHERE value IS NOT NULL;

Assuming the value is sparse (e.g. there are a lot of nulls) it will run fine.

Solution 4 - Postgresql

In my case I needed to maintain a running balance on non-trading days, which is only the weekends and occasionally a three-day weekend in the case of a non-trading holiday

If the number of empty days is pretty low, you can solve this problem via a CASE statement and a series of LAG window functions:

SELECT
    CASE
        WHEN balance IS NULL THEN
            -- A non-null balance must be found within the first 3 preceding rows
            CASE
                WHEN LAG(balance, 1) OVER () IS NOT NULL
                  THEN LAG(balance, 1) OVER ()
                WHEN LAG(balance, 2) OVER () IS NOT NULL
                  THEN LAG(d.balance, 2) OVER ()
                WHEN LAG(balance, 3) OVER () IS NOT NULL
                  THEN LAG(balance, 3) OVER ()
                END
        ELSE balance
    END
FROM daily_data;

Not practical for an unbounded problem, but a nice solution to be aware of for minor gaps. Simply add more "WHEN LAG(, x) ..." clauses if necessary. I was fortunate that I only need to do this with one column and that this solution unblocked me from my aim

Solution 5 - Postgresql

with p (id, value) as (
    values (1, 100),
           (2, null),
           (3, null),
           (4, null),
           (5, null),
           (6, null),
           (7, null),
           (8, 200),
           (9, null))
select *
     , (json_agg(value) filter (where value notnull) over (order by id) ->> -1)::int
from p
;

Then we will use aggregate function with filter option.

Solution 6 - Postgresql

Another possibility is to built a sum:

WITH CTE_Data(Company, ValueDate, Amount)
AS(
	SELECT 'Company', '2021-05-01', 1000	UNION
	SELECT 'Company', '2021-05-02', 1250	UNION
	SELECT 'Company', '2021-05-03', NULL	UNION
	SELECT 'Company', '2021-05-04', NULL	UNION
	SELECT 'Company', '2021-05-05', 7500	UNION
	SELECT 'Company', '2021-05-06', NULL	UNION
	SELECT 'Company', '2021-05-07', 3200	UNION
	SELECT 'Company', '2021-05-08', 3400	UNION
	SELECT 'Company', '2021-05-09', NULL	UNION
	SELECT 'Company', '2021-05-10', 7800
)

SELECT 
	 d.[Company]
	,d.[ValueDate]
	,d.[Amount]
	,d.[Partition]
	,SUM(d.[Amount]) OVER(PARTITION BY d.[Company], d.[Partition]) AS [Missing]
FROM(
	SELECT
		 d.[Company]
		,d.[ValueDate]
		,d.[Amount]
		,SUM(CASE WHEN d.[Amount] IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY d.[Company] ORDER BY d.[ValueDate]) AS [Partition]
	FROM CTE_Data AS d 
) AS d

Solution 7 - Postgresql

You can use LAST_VALUE with FILTER to achieve what you need (at least in PG 9.4)

WITH base AS (
SELECT 1 AS id , 100 AS val
UNION ALL
SELECT 2 AS id , null AS val
UNION ALL
SELECT 3 AS id , null AS val
UNION ALL
SELECT 4 AS id , null AS val
UNION ALL
SELECT 5 AS id , 200 AS val
UNION ALL
SELECT 6 AS id , null AS val
UNION ALL
SELECT 7 AS id , null AS val
)
SELECT id, val, last(val) FILTER (WHERE val IS NOT NULL) over(ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) new_val
  FROM base

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionadamlamarView Question on Stackoverflow
Solution 1 - PostgresqladamlamarView Answer on Stackoverflow
Solution 2 - PostgresqlSlobodan PejicView Answer on Stackoverflow
Solution 3 - PostgresqlMatheusOlView Answer on Stackoverflow
Solution 4 - PostgresqlAuthorOfTheSurfView Answer on Stackoverflow
Solution 5 - PostgresqlzoolusView Answer on Stackoverflow
Solution 6 - PostgresqlBettelburscheView Answer on Stackoverflow
Solution 7 - PostgresqlZoran StipanicevView Answer on Stackoverflow