Databases Stored Procedure for Automated Change-log

Ritesh Agrawal
4 min readJan 31, 2021

Assume you have a table where each row contains various features about customers. These features can be static (eg. account creation time, gender, etc) and/or dynamic (eg. “number of searches”, “last visited on”, etc). One of the challenges encountered in maintaining such a table is that different processes updated these features at different times. For instance, some of the dynamic features can be real-time whereas others might be coming from Hadoop and might have 24-hour latency. From a debugging perspective having a table that gets updated by different processes can be a nightmare as there is no way to travel in time and determine how feature values have changed over time. An obvious solution is to keep a changelog.

While there are many ways to create such a changelog, this post focuses on leveraging database stored procedures to automatically create changelog when inserting or updating features. Databases stored procedure allows defining custom processing on inserted/updated data in the database and gets triggered during regular CRUD (Create-Read-Update-Delete) operations. I particularly like this approach as it’s way more tightly integrated with the databases and keeps the application layer cleaner and leaner.

There are three steps to get the stored procedure working. The post below provides an example of the three steps. In the example below, the aim is to observe any kind of insert or update operation on one table user_featuresand create an entry in the changelog table so that we can keep of how features in user_features table changed over time.

Step 1: Define the table to observe
To demonstrate how stored procedures work, let’s create two tables. As defined below, the user_featurestable contains various features about customers. We also define another table, changelog, to capture changes (when and how) in customer features.

-- Define customer feature table 
CREATE TABLE user_features (
id INTEGER PRIMARY KEY, -- customer_id
account_created_at DATE, -- ACCOUNT CREATION DATE
search_ct INTEGER, -- NUMBER OF SEARCHES
click_ct INTEGER -- NUMBER OF CLICKS
)
-- Define changelog table to capture changes
CREATE TABLE changelog (
customer_id INTEGER NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
data JSONB NOT NULL -- STORE FEATURES THAT WERE UPDATED
)

Step 2: Define the task to perform
Next, as shown below, we create a function that will listen to changes in the user_featurestable and add an entry in the changelogtable. The function has access to many useful variables but the two most important are “OLD” and “NEW”. As the name suggests, the “OLD” variable contains a database row that will be updated before changes. The “NEW” variable contains a copy of the row after changes. Below, we convert the updated row into JSON blob and remove “id” as we are separately storing it. This will become clear once we insert rows into our customer feature table below.

CREATE OR REPLACE FUNCTION log ()
RETURNS TRIGGER
AS $BODY$
BEGIN
IF TG_OP IN ('UPDATE', 'INSERT') THEN
INSERT INTO changelog (customer_id, data)
VALUES (
NEW.id,
JSON_STRIP_NULLS(ROW_TO_JSON(NEW))::JSONB - 'id'
);
END IF;
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql
VOLATILE
SECURITY DEFINER
COST 100;

Step 3: Define Triggers to connect the observed table and the task to perform
Lastly, we create separate triggers to bind the above function to the insert and the update operations on the user_features table.

CREATE TRIGGER log_insert
AFTER INSERT ON user_features
FOR EACH ROW
EXECUTE PROCEDURE add_changelog ();
CREATE TRIGGER log_update
AFTER UPDATE ON user_features
FOR EACH ROW
WHEN (OLD.* IS DISTINCT FROM NEW.*)
EXECUTE PROCEDURE add_changelog ();

Now, the fun begins. Inserting a new record or updating an existing row in thecustomer_feature table automatically creates a new record in the changelog table.

INSERT INTO customer_features values (1, '2020-10-01', 1, 0);
SELECT * FROM changelog;
customer_id | created_at | data
------------+---------------+-----------
1 | 2021-01-22...| {"click_ct": 0, "search_ct": 1, "account_created_at": "2020-10-01"} <-- Note this record was automatically created on insert
UPDATE customer_features SET search_ct = 3 WHERE id = 1;
SELECT * FROM changelog;
customer_id | created_at | data
-------------+---------------+------------
1 | 2021-01-22 23.. | {"click_ct": 0, "search_ct": 1, "account_created_at": "2020-10-01"}
1 | 2021-01-22 23.. | {"click_ct": 0, "search_ct": 3, "account_created_at": "2020-10-01"} <-- note this record was automatically created on update

That’s pretty awesome. You can get pretty creative with stored functions. For instance, one might notice above changelog contains a copy of all the features, even those that haven’t changed. One can easily modify stored functions as below to only capture features that have changed. However, be mindful that having complex stored procedures will increase your CURD operation time and also consume more CPU.

-- THIS IS TRYING TO BE SPACE EFFICIENT 
-- WE ONLY STORE FEATURES THAT HAVE BEEN UPDATED
CREATE OR REPLACE FUNCTION add_changelog ()
RETURNS TRIGGER
AS $BODY$
BEGIN
IF TG_OP = 'UPDATE' THEN
INSERT INTO changelog (customer_id, data)
SELECT
NEW.id,
JSON_OBJECT(ARRAY_AGG(NEW2.key), ARRAY_AGG(NEW2.value))
FROM
JSON_EACH_TEXT(ROW_TO_JSON(OLD)) OLD2
RIGHT JOIN JSON_EACH_TEXT(ROW_TO_JSON(NEW)) NEW2
ON OLD2.key = NEW2.key
WHERE
((OLD2.value IS NOT NULL AND NEW2.value IS NULL)
OR (OLD2.value IS NULL AND NEW2.value IS NOT NULL)
OR OLD2.value <> NEW2.value)
AND NEW2.key <> 'id'
;
ELSIF TG_OP = 'INSERT' THEN
INSERT INTO changelog (customer_id, data)
VALUES (NEW.id, JSON_STRIP_NULLS(ROW_TO_JSON(NEW))::JSONB - 'id');
END IF;
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql
VOLATILE
SECURITY DEFINER
COST 100;x

--

--

Ritesh Agrawal

Senior Machine Learning Engineer, Varo Money; Contributor and Maintainer of sklearn-pandas library