Background
I was assigned the task of creating a summary table that involved the aggregation of strings into an array of strings. For the purposes of this article lets assume this table contains contact information. The table has four columns: name_first
, name_last
, email
, and contact_categories
.
The contact_categories
column is of datatype VARCHAR(60)[]
, i.e. an array of strings. This column gave me the most trouble.
For my first (and very naive) attempt, I wrote a SQL function that aggregated categories row by row using the PostgreSQL function. For each row that matched my SELECT
criteria, I performed an array_append(array_variable, string_to_add)
and used the array_variable
to insert data into row.
Iterating over hundreds of thousands of rows is never performant, and my wiser coworker let me in on an amazing PostgreSQL secret: the array_agg()
method.
Executing the array_agg()
function in a subquery (in place of my row-by-row function) took about five seconds to complete. Embarrasingly, the function I wrote as a first attempt took about three hours to execute.
Needless to say, the array_agg()
function is powerful.
Demo Data
In order to explain PostgreSQL's array_agg()
function, we'll use three tables of WWF wrestler data.
Below is a description of each table.
- The contacts table has contact information (name, email, etc.) for WWF wrestlers.
- The categories table has adjectives describing/categorize wrestlers.
- In the contacts_categories table, each row relates a wrestler to a category. Each wrestler can belong to more than one category, and each category can be applied to more than one wrestler.
Here's the data we'll work with:
contacts
categories
contact_categories
If you'd like to recreate these tables, I've provided the necessary SQL at the very bottom of this post. Paste the code into query tool (or in your command line, if that's more your flavor), and the tables above, along with the data, will be created in your database.
Using array_agg()
Below is the query I used when creating my summary table.
SELECT cc.contact_id,
con.name_first,
con.name_last,
array_agg(cat.name) as categories
FROM contacts_categories cc
INNER JOIN categories cat ON cat.id = cc.category_id
INNER JOIN contacts con ON con.id = cc.contact_id
GROUP BY cc.contact_id,
con.name_first,
con.name_last;
There's isn't much magic in the query. Here's the breakdown:
SELECT
desired columns —lines 1 - 4
JOIN
tables —lines 5 - 7
GROUP
data —lines 8 - 10
The interesting part is line 4
, where PostgresSQL creates an array of all the categories (each is a string from the categories.name
column) that are related to a contact_id
.
The resulting table is below.
Benefits of Using the Array Datatype
Using the array
datatype in summary tables is can be very helpful as applications frequently need arrays of data, and will create them after a database call if the database doesn't provide the array ahead of time.
By creating the arrays ahead of time in a summary table, you can reap performance gains and reduce the complexity of your backend code.
SQL to Generate Demo Data
CREATE TABLE contacts (
id SERIAL PRIMARY KEY,
name_first VARCHAR(60),
name_last VARCHAR(80),
email VARCHAR(60)
);
CREATE TABLE categories (
id SERIAL PRIMARY KEY,
name VARCHAR(60)
);
CREATE TABLE contacts_categories (
id SERIAL PRIMARY KEY,
contact_id INTEGER,
category_id INTEGER
);
INSERT INTO categories (id, name) VALUES (1, 'strong');
INSERT INTO categories (id, name) VALUES (2, 'nimble');
INSERT INTO categories (id, name) VALUES (3, 'sophisticated');
INSERT INTO categories (id, name) VALUES (4, 'wild');
INSERT INTO categories (id, name) VALUES (5, 'undead');
INSERT INTO contacts (id, name_first, name_last, email) VALUES (1, 'Ted', 'Dibiase', '[email protected]');
INSERT INTO contacts (id, name_first, name_last, email) VALUES (2, 'Jimmy', 'Snuka', '[email protected]');
INSERT INTO contacts (id, name_first, name_last, email) VALUES (4, 'James', 'Hellwig', '[email protected]');
INSERT INTO contacts (id, name_first, name_last, email) VALUES (5, 'Randy', 'Savage', '[email protected]');
INSERT INTO contacts (id, name_first, name_last, email) VALUES (6, 'Mark', 'Calaway', '[email protected]');
INSERT INTO contacts (id, name_first, name_last, email) VALUES (7, 'Jake', 'Roberts', '[email protected]');
INSERT INTO contacts (id, name_first, name_last, email) VALUES (3, 'Roddy', 'Piper', '[email protected]');
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (1, 1, 2);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (2, 1, 3);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (3, 2, 2);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (4, 2, 4);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (5, 3, 1);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (6, 3, 4);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (7, 4, 1);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (8, 4, 4);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (9, 5, 1);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (10, 5, 4);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (11, 6, 1);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (13, 7, 1);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (14, 7, 4);
INSERT INTO contacts_categories (id, contact_id, category_id) VALUES (12, 6, 5);