PostgreSQL - transpose columns into rows
We receive a bunch of CSV files from our customers, import them, and transform the data. Quite often we receive them in a human-readable format when a set of columns represents aggregate values over an organization. The following example illustrates this quite accurately:
school_id,grade_1_count,grade_2_count,grade_3_count XX1,336,614,447 XX2,560,335,141 XX3,819,397,806
This is a very convenient format to read and to do some spot checking, but our API expects entries in a completely different format with a single count per row. Hence we need to transform data into the following:
school_id,grade_level,count XX1,grade_1,336 XX1,grade_2,614 XX1,grade_3,447 XX2,grade_1,560 XX2,grade_2,335 XX2,grade_3,141 XX3,grade_1,819 XX3,grade_2,397 XX3,grade_3,806
Our SQL was cumbersome as follows:
SELECT school_id, 'grade_1_count' AS key, grade_1_count AS count FROM grade_counts UNION ALL SELECT school_id, 'grade_2_count' AS key, grade_2_count AS count FROM grade_counts UNION ALL SELECT school_id, 'grade_3_count' AS key, grade_3_count AS count FROM grade_counts;
We researched for improvement and found that
UNNEST can serve our purpose. It expands an array of values into rows:
SELECT UNNEST(array ['grade_1_count', 'grade_2_count', 'grade_3_count']) AS key;
and produces the following output:
key grade_1_count grade_2_count grade_3_count
Thereupon our original
SQL transformation becomes clearer:
SELECT school_id, UNNEST(ARRAY ['grade_1_count', 'grade_2_count', 'grade_3_count']) AS key, UNNEST(ARRAY [grade_1_count, grade_2_count, grade_3_count]) AS count FROM grade_counts
To test performance we populated
grade_counts with 1M of dummy records. Let's
EXPLAIN ANALYZE comparing approaches:
UNION ALL approach
QUERY PLAN Append (cost=0.00..46218.00 rows=3000000 width=40) (actual time=0.021..855.042 rows=3000000 loops=1) -> Seq Scan on grade_counts (cost=0.00..15406.00 rows=1000000 width=40) (actual time=0.021..195.652 rows=1000000 loops=1) -> Seq Scan on grade_counts grade_counts_1 (cost=0.00..15406.00 rows=1000000 width=40) (actual time=0.018..199.136 rows=1000000 loops=1) -> Seq Scan on grade_counts grade_counts_2 (cost=0.00..15406.00 rows=1000000 width=40) (actual time=0.015..210.113 rows=1000000 loops=1) Planning time: 0.109 ms Execution time: 985.687 ms
QUERY PLAN ProjectSet (cost=0.00..525406.00 rows=100000000 width=40) (actual time=0.020..2134.683 rows=3000000 loops=1) -> Seq Scan on grade_counts (cost=0.00..15406.00 rows=1000000 width=16) (actual time=0.014..107.268 rows=1000000 loops=1) Planning time: 0.061 ms Execution time: 2263.288 ms
As we see
UNNEST takes ~2 times more time.
Postgres supports indexes for array columns, but the bad thing is it does not for computed array column.
crosstab table function. It accepts
SQLas an input parameter which in turn can be built dynamically.
crosstab solution will work and is much more scalable but it's too way complex, so we even didn't consider it.
We made a compromise - from one side we have a hundred lines of cumbersome SQL, from another performance decrease. Hopefully, we run this SQL quite rare - just several times a month. Besides, we operate by smaller datasets, hence proportionally less total execution time. These 2 arguments outweigh the scales in favor of
UNNESTing. So, we got cleaner, more supportable code vs reasonable performance downgrade.