Cleaning Whitespace in Postgres

School TeacherAs part of our recent release, we added every survey submitted about our agents to our agent profiles. To prep for this, we needed to do a mail merge with each survey response, which we exported from a Postgres database. The problem was, when our users submitted their answers, they used the skills their grade 3 English teacher taught them and wrote in paragraphs. But all the common export formats indicate new records by using line-breaks, so we needed a way to clean the whitespace from these surveys.

As it turns out, Postgres has some very useful regex functions that make string operations a breeze. But since no one wants to have to reconstruct the appropriate syntax every time they need to clean whitespace, you can make a Postgres function that wraps the functionality:

CREATE OR REPLACE FUNCTION clean_whitespace(to_clean text) RETURNS text AS $$
    BEGIN
        RETURN regexp_replace(to_clean, E'[ tnr]+', ' ', 'g');
    END;
$$ LANGUAGE plpgsql IMMUTABLE;

This replaces each group of whitespace in the argument with a single space. The immutable flag indicates that the function will have no side-effects, and thus allows it to be used in indices. Also notice that we only want to match occurrences of length at least 1 (by using “+” rather than “*”), because otherwise you end up with a space between every character!

Thanks to Thomas Kellerer on the postgres message board for pointing us in the right direction with regards to the arguments we needed.

(Photo credits: jamesdale10 on Flickr)

Discussion