Igor Dobryn
about IT

Count multiple columns in PostgreSQL

A short while ago we needed to display a report containing student counts with different demographics. Simply, we wanted to display how many males, females, students who require special education and participate additional courses and counts breakdown by a grade (actually there was much more, about 35 aggregates).

We reviewed several approaches, but first things first. Just to describe the approaches below is typical dataset:

id grade gender additional_courses special_education
1 1 false true false
2 2 true false true
3 2 false false true
4 3 true true true
5 1 true true true

and also the very simplistic schema of students table:

CREATE TABLE students
(
  id                 SERIAL            NOT NULL,
  grade              CHARACTER VARYING NOT NULL,
  gender             BOOLEAN           NOT NULL,
  additional_courses BOOLEAN           NOT NULL,
  special_education  BOOLEAN           NOT NULL
);

Where the grade is an enumeration of student grade containing values 1, 2 or 3.

All the measurements were made on the test data set of 1M records.

 

COUNT + UNION

Select column, count and group it. Single column count looks like the following:

SELECT grade, COUNT(*) AS count FROM students GROUP BY grade

 
If we want to count multiple ones in a single query we should UNION multiple SELECTs:

SELECT 'grade_' || grade AS name, COUNT(*) AS count FROM students GROUP BY grade
UNION
SELECT 'gender_' || gender::VARCHAR AS name, COUNT(*) AS count FROM students GROUP BY gender
UNION
SELECT 'additional_courses' AS name, COUNT(*) AS count FROM students WHERE additional_courses GROUP BY additional_courses
UNION
SELECT 'special_education' AS name, COUNT(*) AS count FROM students WHERE special_education GROUP BY special_education

Planning time: 0.170 ms
Execution time: 1366.129 ms

Which will produce the following output:

name count
special_education 499506
grade_3 333465
additional_courses 499503
grade_2 333368
gender_false 498461
grade_1 333168
gender_true 501540

 
With the following query plan:

HashAggregate  (cost=76618.47..76618.56 rows=9 width=40) (actual time=1366.022..1366.024 rows=7 loops=1)
"  Group Key: ""*SELECT* 1"".name, ""*SELECT* 1"".count"
  ->  Append  (cost=20406.01..76618.42 rows=9 width=40) (actual time=409.654..1366.002 rows=7 loops=1)
        ->  Subquery Scan on "*SELECT* 1"  (cost=20406.01..20406.08 rows=3 width=40) (actual time=409.654..409.658 rows=3 loops=1)
              ->  HashAggregate  (cost=20406.01..20406.05 rows=3 width=42) (actual time=409.652..409.656 rows=3 loops=1)
                    Group Key: students.grade
                    ->  Seq Scan on students  (cost=0.00..15406.01 rows=1000001 width=2) (actual time=0.014..96.769 rows=1000001 loops=1)
        ->  Subquery Scan on "*SELECT* 2"  (cost=20406.01..20406.06 rows=2 width=40) (actual time=381.279..381.282 rows=2 loops=1)
              ->  HashAggregate  (cost=20406.01..20406.04 rows=2 width=41) (actual time=381.277..381.278 rows=2 loops=1)
                    Group Key: students_1.gender
                    ->  Seq Scan on students students_1  (cost=0.00..15406.01 rows=1000001 width=1) (actual time=0.026..92.299 rows=1000001 loops=1)
        ->  Subquery Scan on "*SELECT* 3"  (cost=17910.68..17910.72 rows=2 width=40) (actual time=283.634..283.636 rows=1 loops=1)
              ->  HashAggregate  (cost=17910.68..17910.70 rows=2 width=41) (actual time=283.633..283.633 rows=1 loops=1)
                    Group Key: students_2.additional_courses
                    ->  Seq Scan on students students_2  (cost=0.00..15406.01 rows=500934 width=1) (actual time=0.013..158.632 rows=499503 loops=1)
                          Filter: additional_courses
                          Rows Removed by Filter: 500498
        ->  Subquery Scan on "*SELECT* 4"  (cost=17895.51..17895.56 rows=2 width=40) (actual time=291.421..291.422 rows=1 loops=1)
              ->  HashAggregate  (cost=17895.51..17895.53 rows=2 width=41) (actual time=291.421..291.422 rows=1 loops=1)
                    Group Key: students_3.special_education
                    ->  Seq Scan on students students_3  (cost=0.00..15406.01 rows=497901 width=1) (actual time=0.015..165.494 rows=499506 loops=1)
                          Filter: special_education
                          Rows Removed by Filter: 500495

 
Predictably it makes Seq Scan for every column loading all aggregates in 1366.129 ms which is relatively bearable.

 

COUNT + GROUPING SETS

PostgeSQL allows specifying a set of grouping columns - GROUPING SETS. It became available since the 9.5 version. Let's build a query:

SELECT 'grade_' || grade            AS grade,
       'gender_' || gender::VARCHAR AS gender,
       additional_courses,
       special_education,
       COUNT(*)
FROM students
GROUP BY
  GROUPING SETS ( (grade),
                  (gender),
                  (additional_courses),
                  (special_education)
  );

Planning time: 0.078 ms
Execution time: 4033.279 ms

The output is not so convenient but it still can be processed in high-level language:

grade gender additional_courses special_education count
grade_1 NULL NULL NULL 333168
grade_2 NULL NULL NULL 333368
grade_3 NULL NULL NULL 333465
NULL NULL NULL false 500495
NULL NULL NULL true 499506
NULL NULL false NULL 500498
NULL NULL true NULL 499503
NULL gender_false NULL NULL 498461
NULL gender_true NULL NULL 501540

Let's take a look at the query plan:

GroupAggregate  (cost=128738.46..498736.47 rows=36 width=77) (actual time=1482.305..4025.664 rows=9 loops=1)
  Group Key: grade
  Sort Key: special_education
    Group Key: special_education
  Sort Key: additional_courses
    Group Key: additional_courses
  Sort Key: gender
    Group Key: gender
  ->  Sort  (cost=128738.46..131238.46 rows=1000001 width=5) (actual time=1114.102..1598.111 rows=1000001 loops=1)
        Sort Key: grade
        Sort Method: external merge  Disk: 14624kB
        ->  Seq Scan on students  (cost=0.00..15406.01 rows=1000001 width=5) (actual time=0.014..196.834 rows=1000001 loops=1)

 
It does single Seq Scan and also it does a Sort. In general 3 times worse than multiple SELECTs.

 

COUNT + FILTER

There is an ability to FILTER what is sent to the aggregate function. It was added on PostgreSQL 9.4

SELECT COUNT(id) FILTER (WHERE grade = '1')        AS grade_1,
       COUNT(id) FILTER (WHERE grade = '2')        AS grade_2,
       COUNT(id) FILTER (WHERE grade = '3')        AS grade_3,
       COUNT(id) FILTER (WHERE gender)             AS male,
       COUNT(id) FILTER (WHERE NOT gender)         AS female,
       COUNT(id) FILTER (WHERE additional_courses) AS additional_courses,
       COUNT(id) FILTER (WHERE special_education)  AS special_education
FROM students

Planning time: 0.063 ms
Execution time: 745.993 ms

Query results are very convenient to read:

grade_1 grade_2 grade_3 male female additional_courses special_education
333168 333368 333465 501540 498461 499503 499506
QUERY PLAN
Aggregate  (cost=40406.03..40406.04 rows=1 width=56) (actual time=745.939..745.939 rows=1 loops=1)
  ->  Seq Scan on students  (cost=0.00..15406.01 rows=1000001 width=9) (actual time=0.013..102.350 rows=1000001 loops=1)

 
Single Seq Scan, fantastic! It's the most robust query so far.

 

Conclusions

Every approach listed above can serve its own goal. According to our goal of counting 35 aggregates COUNT + UNION was acceptable but had several disadvantages:

  • complex SQL building;
  • result processing is not as straightforward;
  • weak performance.

After updating to COUNT + FILTER we won several gains. High-level SQL building became simpler, as well as result processing and we also got 4x faster query on our production database.

PostgreSQL Performance optimization COUNT Query analysis