Count multiple columns in PostgreSQL
A short while ago we needed to display a report containing student counts with different demographics. Simply, we wanted to display how many males, females, students who require special education and participate additional courses and counts breakdown by a grade (actually there was much more, about 35 aggregates).
We reviewed several approaches, but first things first. Just to describe the approaches below is typical dataset:
and also the very simplistic schema of students table:
CREATE TABLE students ( id SERIAL NOT NULL, grade CHARACTER VARYING NOT NULL, gender BOOLEAN NOT NULL, additional_courses BOOLEAN NOT NULL, special_education BOOLEAN NOT NULL );
grade is an enumeration of student grade containing values 1, 2 or 3.
All the measurements were made on the test data set of 1M records.
COUNT + UNION
Select column, count and group it. Single column count looks like the following:
SELECT grade, COUNT(*) AS count FROM students GROUP BY grade
If we want to count multiple ones in a single query we should
SELECT 'grade_' || grade AS name, COUNT(*) AS count FROM students GROUP BY grade UNION SELECT 'gender_' || gender::VARCHAR AS name, COUNT(*) AS count FROM students GROUP BY gender UNION SELECT 'additional_courses' AS name, COUNT(*) AS count FROM students WHERE additional_courses GROUP BY additional_courses UNION SELECT 'special_education' AS name, COUNT(*) AS count FROM students WHERE special_education GROUP BY special_education
Planning time: 0.170 ms
Execution time: 1366.129 ms
Which will produce the following output:
With the following query plan:
HashAggregate (cost=76618.47..76618.56 rows=9 width=40) (actual time=1366.022..1366.024 rows=7 loops=1) " Group Key: ""*SELECT* 1"".name, ""*SELECT* 1"".count" -> Append (cost=20406.01..76618.42 rows=9 width=40) (actual time=409.654..1366.002 rows=7 loops=1) -> Subquery Scan on "*SELECT* 1" (cost=20406.01..20406.08 rows=3 width=40) (actual time=409.654..409.658 rows=3 loops=1) -> HashAggregate (cost=20406.01..20406.05 rows=3 width=42) (actual time=409.652..409.656 rows=3 loops=1) Group Key: students.grade -> Seq Scan on students (cost=0.00..15406.01 rows=1000001 width=2) (actual time=0.014..96.769 rows=1000001 loops=1) -> Subquery Scan on "*SELECT* 2" (cost=20406.01..20406.06 rows=2 width=40) (actual time=381.279..381.282 rows=2 loops=1) -> HashAggregate (cost=20406.01..20406.04 rows=2 width=41) (actual time=381.277..381.278 rows=2 loops=1) Group Key: students_1.gender -> Seq Scan on students students_1 (cost=0.00..15406.01 rows=1000001 width=1) (actual time=0.026..92.299 rows=1000001 loops=1) -> Subquery Scan on "*SELECT* 3" (cost=17910.68..17910.72 rows=2 width=40) (actual time=283.634..283.636 rows=1 loops=1) -> HashAggregate (cost=17910.68..17910.70 rows=2 width=41) (actual time=283.633..283.633 rows=1 loops=1) Group Key: students_2.additional_courses -> Seq Scan on students students_2 (cost=0.00..15406.01 rows=500934 width=1) (actual time=0.013..158.632 rows=499503 loops=1) Filter: additional_courses Rows Removed by Filter: 500498 -> Subquery Scan on "*SELECT* 4" (cost=17895.51..17895.56 rows=2 width=40) (actual time=291.421..291.422 rows=1 loops=1) -> HashAggregate (cost=17895.51..17895.53 rows=2 width=41) (actual time=291.421..291.422 rows=1 loops=1) Group Key: students_3.special_education -> Seq Scan on students students_3 (cost=0.00..15406.01 rows=497901 width=1) (actual time=0.015..165.494 rows=499506 loops=1) Filter: special_education Rows Removed by Filter: 500495
Predictably it makes Seq Scan for every column loading all aggregates in
1366.129 ms which is relatively bearable.
COUNT + GROUPING SETS
PostgeSQL allows specifying a set of grouping columns -
GROUPING SETS. It became available since the 9.5 version. Let's build a query:
SELECT 'grade_' || grade AS grade, 'gender_' || gender::VARCHAR AS gender, additional_courses, special_education, COUNT(*) FROM students GROUP BY GROUPING SETS ( (grade), (gender), (additional_courses), (special_education) );
Planning time: 0.078 ms
Execution time: 4033.279 ms
The output is not so convenient but it still can be processed in high-level language:
Let's take a look at the query plan:
GroupAggregate (cost=128738.46..498736.47 rows=36 width=77) (actual time=1482.305..4025.664 rows=9 loops=1) Group Key: grade Sort Key: special_education Group Key: special_education Sort Key: additional_courses Group Key: additional_courses Sort Key: gender Group Key: gender -> Sort (cost=128738.46..131238.46 rows=1000001 width=5) (actual time=1114.102..1598.111 rows=1000001 loops=1) Sort Key: grade Sort Method: external merge Disk: 14624kB -> Seq Scan on students (cost=0.00..15406.01 rows=1000001 width=5) (actual time=0.014..196.834 rows=1000001 loops=1)
It does single
Seq Scan and also it does a
Sort. In general 3 times worse than multiple
COUNT + FILTER
There is an ability to
FILTER what is sent to the aggregate function. It was added on PostgreSQL 9.4
SELECT COUNT(id) FILTER (WHERE grade = '1') AS grade_1, COUNT(id) FILTER (WHERE grade = '2') AS grade_2, COUNT(id) FILTER (WHERE grade = '3') AS grade_3, COUNT(id) FILTER (WHERE gender) AS male, COUNT(id) FILTER (WHERE NOT gender) AS female, COUNT(id) FILTER (WHERE additional_courses) AS additional_courses, COUNT(id) FILTER (WHERE special_education) AS special_education FROM students
Planning time: 0.063 ms
Execution time: 745.993 ms
Query results are very convenient to read:
QUERY PLAN Aggregate (cost=40406.03..40406.04 rows=1 width=56) (actual time=745.939..745.939 rows=1 loops=1) -> Seq Scan on students (cost=0.00..15406.01 rows=1000001 width=9) (actual time=0.013..102.350 rows=1000001 loops=1)
Single Seq Scan, fantastic! It's the most robust query so far.
Every approach listed above can serve its own goal. According to our goal of counting 35 aggregates
COUNT + UNION was acceptable but had several disadvantages:
- complex SQL building;
- result processing is not as straightforward;
- weak performance.
After updating to
COUNT + FILTER we won several gains. High-level SQL building became simpler, as well as result processing and we also got 4x faster query on our production database.