Igor Dobryn
about IT

PostgreSQL - transpose columns into rows

ETL

We receive a bunch of CSV files from our customers, import them, and transform the data. Quite often we receive them in a human-readable format when a set of columns represents aggregate values over an organization. The following example illustrates this quite accurately:

school_id,grade_1_count,grade_2_count,grade_3_count
XX1,336,614,447
XX2,560,335,141
XX3,819,397,806

 
This is a very convenient format to read and to do some spot checking, but our API expects entries in a completely different format with a single count per row. Hence we need to transform data into the following:

school_id,grade_level,count
XX1,grade_1,336
XX1,grade_2,614
XX1,grade_3,447
XX2,grade_1,560
XX2,grade_2,335
XX2,grade_3,141
XX3,grade_1,819
XX3,grade_2,397
XX3,grade_3,806

 
Our SQL was cumbersome as follows:

SELECT school_id, 'grade_1_count' AS key, grade_1_count AS count FROM grade_counts
  UNION ALL
  SELECT school_id, 'grade_2_count' AS key, grade_2_count AS count FROM grade_counts
  UNION ALL
  SELECT school_id, 'grade_3_count' AS key, grade_3_count AS count FROM grade_counts;

 
 

Solution

We researched for improvement and found that UNNEST can serve our purpose. It expands an array of values into rows:

SELECT UNNEST(array ['grade_1_count', 'grade_2_count', 'grade_3_count']) AS key;

 
and produces the following output:

key
grade_1_count
grade_2_count
grade_3_count

 
Thereupon our original SQL transformation becomes clearer:

SELECT school_id,
       UNNEST(ARRAY ['grade_1_count', 'grade_2_count', 'grade_3_count']) AS key,
       UNNEST(ARRAY [grade_1_count, grade_2_count, grade_3_count]) AS count
FROM grade_counts

 
 

Performance

To test performance we populated grade_counts with 1M of dummy records. Let's EXPLAIN ANALYZE comparing approaches:

UNION ALL approach
QUERY PLAN
Append  (cost=0.00..46218.00 rows=3000000 width=40) (actual time=0.021..855.042 rows=3000000 loops=1)
  ->  Seq Scan on grade_counts  (cost=0.00..15406.00 rows=1000000 width=40) (actual time=0.021..195.652 rows=1000000 loops=1)
  ->  Seq Scan on grade_counts grade_counts_1  (cost=0.00..15406.00 rows=1000000 width=40) (actual time=0.018..199.136 rows=1000000 loops=1)
  ->  Seq Scan on grade_counts grade_counts_2  (cost=0.00..15406.00 rows=1000000 width=40) (actual time=0.015..210.113 rows=1000000 loops=1)
Planning time: 0.109 ms
Execution time: 985.687 ms

 

UNNEST approach
QUERY PLAN
ProjectSet  (cost=0.00..525406.00 rows=100000000 width=40) (actual time=0.020..2134.683 rows=3000000 loops=1)
  ->  Seq Scan on grade_counts  (cost=0.00..15406.00 rows=1000000 width=16) (actual time=0.014..107.268 rows=1000000 loops=1)
Planning time: 0.061 ms
Execution time: 2263.288 ms

 
As we see UNNEST takes ~2 times more time.
Postgres supports indexes for array columns, but the bad thing is it does not for computed array column.

 

Alternate solutions

There is crosstab table function. It accepts SQLas an input parameter which in turn can be built dynamically. crosstab solution will work and is much more scalable but it's too way complex, so we even didn't consider it.

 

Conclusions

We made a compromise - from one side we have a hundred lines of cumbersome SQL, from another performance decrease. Hopefully, we run this SQL quite rare - just several times a month. Besides, we operate by smaller datasets, hence proportionally less total execution time. These 2 arguments outweigh the scales in favor of UNNESTing. So, we got cleaner, more supportable code vs reasonable performance downgrade.

PostgreSQL SQL