PostgreSQL: array_agg() Function

Background

I was assigned the task of creating a summary table that involved the aggregation of strings into an array of strings. For the purposes of this article lets assume this table contains contact information. The table has four columns: name_first, name_last, email, and contact_categories.

The contact_categories column is of datatype VARCHAR(60)[], i.e. an array of strings. This column gave me the most trouble.

For my first (and very naive) attempt, I wrote a SQL function that aggregated categories row by row using the PostgreSQL function. For each row that matched my SELECT criteria, I performed an array_append(array_variable, string_to_add) and used the array_variable to insert data into row.

Iterating over hundreds of thousands of rows is never performant, and my wiser coworker let me in on an amazing PostgreSQL secret: the array_agg() method.

Executing the array_agg() function in a subquery (in place of my row-by-row function) took about five seconds to complete. Embarrasingly, the function I wrote as a first attempt took about three hours to execute.

Needless to say, the array_agg() function is powerful.

Demo Data

In order to explain PostgreSQL's array_agg() function, we'll use three tables of WWF wrestler data.

Below is a description of each table.

  1. The contacts table has contact information (name, email, etc.) for WWF wrestlers.
  2. The categories table has adjectives describing/categorize wrestlers.
  3. In the contacts_categories table, each row relates a wrestler to a category. Each wrestler can belong to more than one category, and each category can be applied to more than one wrestler.

Here's the data we'll work with:


contacts

contacts table: see sql at bottom of post for text data


categories

categories table: see sql at bottom of post for text data


contact_categories

contacts_categories table: see sql at bottom of post for text data


If you'd like to recreate these tables, I've provided the necessary SQL at the very bottom of this post. Paste the code into query tool (or in your command line, if that's more your flavor), and the tables above, along with the data, will be created in your database.

Using array_agg()

Below is the query I used when creating my summary table.

SELECT cc.contact_id,  
       con.name_first,
       con.name_last,
       array_agg(cat.name) as categories
FROM contacts_categories cc  
INNER JOIN categories cat ON cat.id = cc.category_id  
INNER JOIN contacts con ON con.id = cc.contact_id  
GROUP BY cc.contact_id,  
         con.name_first,
         con.name_last;

There's isn't much magic in the query. Here's the breakdown:

  1. SELECT desired columns — lines 1 - 4
  2. JOIN tables — lines 5 - 7
  3. GROUP data — lines 8 - 10

The interesting part is line 4, where PostgresSQL creates an array of all the categories (each is a string from the categories.name column) that are related to a contact_id.

The resulting table is below.

summary table

Benefits of Using the Array Datatype

Using the array datatype in summary tables is can be very helpful as applications frequently need arrays of data, and will create them after a database call if the database doesn't provide the array ahead of time.

By creating the arrays ahead of time in a summary table, you can reap performance gains and reduce the complexity of your backend code.


SQL to Generate Demo Data

CREATE TABLE contacts (  
    id SERIAL PRIMARY KEY,
    name_first VARCHAR(60),
    name_last VARCHAR(80),
    email VARCHAR(60)
);

CREATE TABLE categories (  
    id SERIAL PRIMARY KEY,
    name VARCHAR(60)
);

CREATE TABLE contacts_categories (  
    id SERIAL PRIMARY KEY,
    contact_id INTEGER,
    category_id INTEGER
);

INSERT INTO categories (id, name) VALUES (1, 'strong');  
INSERT INTO categories (id, name) VALUES (2, 'nimble');  
INSERT INTO categories (id, name) VALUES (3, 'sophisticated');  
INSERT INTO categories (id, name) VALUES (4, 'wild');  
INSERT INTO categories (id, name) VALUES (5, 'undead');


INSERT INTO contacts (id, name_first, name_last, email) VALUES (1, 'Ted', 'Dibiase', '[email protected]');  
INSERT INTO contacts (id, name_first, name_last, email) VALUES (2, 'Jimmy', 'Snuka', '[email protected]');  
INSERT INTO contacts (id, name_first, name_last, email) VALUES (4, 'James', 'Hellwig', '[email protected]');  
INSERT INTO contacts (id, name_first, name_last, email) VALUES (5, 'Randy', 'Savage', '[email protected]');  
INSERT INTO contacts (id, name_first, name_last, email) VALUES (6, 'Mark', 'Calaway', '[email protected]');  
INSERT INTO contacts (id, name_first, name_last, email) VALUES (7, 'Jake', 'Roberts', '[email protected]');  
INSERT INTO contacts (id, name_first, name_last, email) VALUES (3, 'Roddy', 'Piper', '[email protected]');


INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (1, 1, 2);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (2, 1, 3);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (3, 2, 2);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (4, 2, 4);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (5, 3, 1);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (6, 3, 4);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (7, 4, 1);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (8, 4, 4);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (9, 5, 1);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (10, 5, 4);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (11, 6, 1);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (13, 7, 1);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (14, 7, 4);  
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (12, 6, 5);

Check out my free videos on

Written by

Loren Stewart

Writing code with the wonderful devs at Speaker, a data-driven influencer marketing company. Also an educator for Codemore.

Los Angeles, CA

Tags
postgres