The Problem

This was one of the challenges I had to handle recently. I wanted to model a Definitive Finite Automata (DFA) using SQL. This reminded me of one of the courses I took in my bachelors (+). This is where I understood how to use DFA for a SQL problem which at first, seems unsolvable.

Let’s define the problem clearly. Note that this is a real-life business problem. I’m not talking about theory and a made-up problem.

Suppose that we have a table where we store contract events for each store. Customers can create, pause, resume or cancel their contracts.

So each contract has three states:

  • Active
  • Paused
  • Canceled

And these are the transitions between states via events:

I had a corresponding table as well for these state changes. Let’s call it contract_events.

So, this would be our table. Let’s name it contract_events as it captures events per shop, per customer and per contract.

shop_idcustomer_idcontract_idevent_dateevent
10155501C12024-01-01contract_created
10155501C12024-01-15contract_paused
10155501C22024-01-20contract_created
10155501C12024-02-01contract_resumed
10155501C12024-02-01contract_resumed
10155501C12024-03-01contract_canceled
10155501C22024-03-10contract_canceled
10299902C32024-01-05contract_created
10299902C32024-01-10contract_resumed
10299902C32024-02-15contract_paused
10299902C32024-03-10contract_paused
10299902C32024-03-20contract_canceled

Each shop has multiple customers and each of the customers may create multiple contracts. So, there are one-to-many relationships so far.

What do we needed?

Basically, we wanted to look at events table from subscriber point of view. Not at contract level. But at subscriber level.

Why? Because we wanted to count active subscribers. We wanted to be able to calculate churn subscriber or see how many new subscribers used the app at specific duration.

Also, we wanted to have those events stored in an actual database so that we can track, watch and monitor them. We wanted to write automated tests as well. Also, having physical tables can make the debug process much easier.

This is because sometimes our user doesn’t care about how many contract he has. He want to know how many people use his system. They would even if one of them has one contract and the other has 10 active contracts.

So, we had to model subscriber entity (instead of contract).

Each subscriber has also three states:

  • Active (If they have at least one active contract)
  • Frozen (if they have paused all of their contracts)
  • Unsubscribed (if they have canceled all of their contracts)

Also, we also want to mark new active customers differently from old active customers using subscriber_new_activated and subscriber_reactivated events.

This is very useful if we want to see how many new customers we have in a given period.

In essence, we wanted to create a subscriber events table which outputs events like this:

shop_idcustomer_idevent_dateeventLogic Explanation
101555012024-01-01subscriber_new_activatedFirst activation.
101555012024-01-15subscriber_frozenActive count hits 0 (C1 paused).
101555012024-01-20subscriber_reactivatedBecomes active via C2 while C1 still frozen.
skip555012024-02-01no rowC1 resumes. Active count 1 -> 2. Still ‘activated’.
skip555012024-03-01no rowC1 cancels. Active count 2 -> 1. Still ‘activated’.
101555012024-03-10subscriber_unsubscribedLast active contract (C2) cancels.
102999022024-01-05subscriber_new_activatedFirst activation.
102999022024-02-15subscriber_frozenMoves from active to frozen.
102999022024-03-20subscriber_unsubscribedFinal churn.

Now the problem seems well-defined. You can now think about how to get to subscribers_events table.

First Solution

My First solution was to simply keep a counter per event.

For example: In active state, we can say that created or resume event have +1 impact. and pause or cancel have -1 impact. (We can do this for any other state as well). We may as well increase the numbers of counters to whatever we want (it won’t make fundamental difference).

The code would be written as below.

WITH event_weights  
     AS (SELECT shop_id,  
                customer_id,  
                contract_id,  
                event_date,  
                event,  
                CASE  
         WHEN event IN ( 'contract_created', 'contract_resumed' ) THEN  1  
         WHEN event IN ( 'contract_paused', 'contract_canceled' ) THEN  -1  
                  ELSE 0  
                END AS active_weight
         FROM   contract_events),  
     running_totals  
     AS (SELECT *,  
                SUM(active_weight)  
                  over (  
                    PARTITION BY shop_id, customer_id  
                    ORDER BY event_date ROWS BETWEEN unbounded preceding AND  
                  CURRENT ROW ) AS  
                current_active_count
         FROM   event_weights)  
SELECT *  
FROM   running_totals;

I thought about these kinds of solutions a lot. Modifying counters via different case statements won’t work. This can not be a solution since the contract can have many many flows.

It can go between paused and canceled event indefinitely and the current active count would be negative! Also, the data isn’t perfect. Sometimes, we may have two create events at the same time for the same contract (one is redundant). In rare cases, we would have resume event after create event.

This is real data that we are talking about. Maybe there was a batch operation and the user does that. We can not blame him. But we should write the code so that in handles every scenario. Every scenario that a DFA supports.

How can we count how many active, paused and cancelled contract we have chronologically (at any given time)?

Think about this. How would you handle this? In general, how can you model any DFA using SQL?

My Solution

If you think about the problem a little, you would soon understand that we need to somehow record the previous state in order to understand what impact our action has.

We can’t simply count contract events.

The idea here is that for each event, we have to have its previous event. This can be done using a simple lag operation over contract_id (and shop_id for safety).

Then, we can calculate the Impact by aggregating the Inflow (entries) and subtracting the Outflow (exits) for each specific state. This is how we can correctly know the total count of each state at any given time.

For example for Active state this would be our state machine.

Previous EventCurrent EventLogicActive Impact
NULLcontract_createdEntry: Starts lifecycle as active.+1
contract_pausedcontract_resumedEntry: Returning to active from frozen.+1
contract_canceledcontract_resumedEntry: Returning to active from churned.+1
contract_createdcontract_pausedExit: Moving to frozen.-1
contract_resumedcontract_pausedExit: Moving to frozen.-1
contract_createdcontract_canceledExit: Moving to churned.-1
contract_resumedcontract_canceledExit: Moving to churned.-1
Same as CurrentAnyStutter: Redundant event; ignored.0
For paused state:
Previous EventCurrent EventLogicPaused Impact
contract_createdcontract_pausedEntry: Active contract is put on hold.+1
contract_resumedcontract_pausedEntry: Active contract is put on hold.+1
contract_canceledcontract_pausedEntry: Re-activating a churned user into frozen.+1
contract_pausedcontract_resumedExit: Moving back to active.-1
contract_pausedcontract_canceledExit: Moving from frozen to churned.-1
NULLcontract_createdNeutral: Created contracts skip this bucket.0

We can write similar state machines for canceled state. (But I won’t waste your time).

Then we convert these transition tables to simple case statements in SQL.

WITH distinct_contract_events
     AS (SELECT DISTINCT shop_id,
                         shop_currency,
                         shopify_domain,
                         customer_id,
                         contract_id,
                         event_date,
                         event
         FROM   contract_events
         GROUP  BY shop_id,
                   shop_currency,
                   shopify_domain,
                   customer_id,
                   contract_id,
                   event_date,
                   event),
     previous_events
     AS (SELECT DISTINCT shop_id,
                         shop_currency,
                         shopify_domain,
                         customer_id,
                         contract_id,
                         event_date,
                         event,
                         Row_number()
                           OVER (
                             partition BY shop_id, contract_id
                             ORDER BY event_date ) AS event_row_number,
                         Lag(event)
                           OVER (
                             partition BY shop_id, contract_id
                             ORDER BY event_date ) AS previous_event,
         FROM   distinct_contract_events)
SELECT shop_id,
       shop_currency,
       shopify_domain,
       customer_id,
       contract_id,
       event_date,
       event_row_number,
       event,
       previous_event,
       CASE
         WHEN event = previous_event THEN 0
         WHEN event = 'contract_resumed'
              AND previous_event = 'contract_created' THEN 0
         WHEN event IN ( 'contract_created', 'contract_resumed' ) THEN 1
         WHEN previous_event IN ( 'contract_created', 'contract_resumed' ) THEN
         -1
         ELSE 0
       END AS active_impact,
       CASE
         WHEN event = previous_event THEN 0
         WHEN event = 'contract_paused' THEN 1
         WHEN previous_event = 'contract_paused' THEN -1
         ELSE 0
       END AS paused_impact,
       CASE
         WHEN event = previous_event THEN 0
         WHEN event = 'contract_canceled' THEN 1
         WHEN previous_event = 'contract_canceled' THEN -1
         ELSE 0
       END AS canceled_impact
FROM   previous_events 
shop_idcustomer_idcontract_idevent_dateeventprev_eventactive_impactpaused_impactcanceled_impact
10155501C12024-01-01contract_creatednull100
10155501C12024-01-15contract_pausedcontract_created-110
10155501C22024-01-20contract_creatednull100
10155501C12024-02-01contract_resumedcontract_paused1-10
10155501C12024-02-01contract_resumedcontract_resumed000
10155501C12024-03-01contract_canceledcontract_resumed-101
10155501C22024-03-10contract_canceledcontract_created-101
10299902C32024-01-05contract_creatednull100
10299902C32024-01-10contract_resumedcontract_created000
10299902C32024-02-15contract_pausedcontract_resumed-110
10299902C32024-03-10contract_pausedcontract_paused000
10299902C32024-03-20contract_canceledcontract_paused0-11

At this state, we know the impact of each event for each state. If we sum those impacts we would know the running total per state. As simple as that.

SELECT   shop_id,  
         shopify_domain,  
         customer_id,  
         contract_id,  
         event_date,  
         event_row_number,  
         event,  
         previous_event,  
         sum(active_impact) OVER ( partition BY shop_id, customer_id ORDER BY event_date rows BETWEEN UNBOUNDED PRECEDING AND      CURRENT row )   AS running_active_contracts,  
         sum(paused_impact) OVER ( partition BY shop_id, customer_id ORDER BY event_date rows BETWEEN UNBOUNDED PRECEDING AND      CURRENT row )   AS running_paused_contracts,  
         sum(canceled_impact) OVER ( partition BY shop_id, customer_id ORDER BY event_date rows BETWEEN UNBOUNDED PRECEDING AND      CURRENT row ) AS running_canceled_contracts,  
         sum(  CASE  WHEN event = 'contract_created' THEN 1  ELSE 0 END ) 
         OVER ( partition BY shop_id, customer_id ORDER BY event_date rows BETWEEN UNBOUNDED PRECEDING AND      CURRENT row ) AS running_total_created,  
FROM contract_state_history
shop_idcustomer_idevent_datecontracteventActive CountPaused CountCanceled CountTotal Created
101555012024-01-01C1created1001
101555012024-01-15C1paused0101
101555012024-01-20C2created1102
101555012024-02-01C1resumed2002
101555012024-02-01C1resumed2002
101555012024-03-01C1canceled1012
101555012024-03-10C2canceled0022
102999022024-01-05C3created1001
102999022024-01-10C3resumed1001
102999022024-02-15C3paused0101
102999022024-03-10C3paused0101
102999022024-03-20C3canceled0011

Look at how beautiful the above table is. In chronological order, for each state, we know how many events are active, paused or cancelled. Total created column is also a sort of checksum column. The number of created events should be equal to sum of active + paused+canceled at any given time.

But wait. We needed to have subscriber events. We wanted to have subscriber information.

But The rest is just easy. Chronologically, for each date, we can check running active and paused and cancelled value to see what state the subscriber belongs to.

Also, we do need lag operation again to see if the customer is new or old! Here we use our modeling again to reach into results.

WITH subscriber_states  
     AS (SELECT shop_id,  
                shopify_domain,  
                customer_id,  
                event_date,  
                CASE  
                  WHEN running_active_contracts > 0 THEN 'activated'  
                  WHEN running_paused_contracts > 0 THEN 'frozen'  
                  ELSE 'unsubscribed'  
                END AS customer_state,  
         FROM   subscriber_contract_count),  
     subscriber_state_changes  
     AS (SELECT shop_id,  
                shopify_domain,  
                customer_id,  
                event_date,  
                customer_state  AS current_subscriber_state,  
                Lag(customer_state)  
                OVER (  
                    partition BY shop_id, customer_id  
                    ORDER BY event_date ) AS previous_subscriber_state  
         FROM   subscriber_states)  
SELECT shop_id,  
       shopify_domain,  
       customer_id,  
       CASE  
         WHEN current_subscriber_state = 'activated'  
              AND previous_subscriber_state IS NULL THEN  
         'subscriber_new_activated'  
         WHEN current_subscriber_state = 'activated'  
              AND previous_subscriber_state IS NOT NULL THEN  
         'subscriber_reactivated'  
         WHEN current_subscriber_state = 'frozen' THEN 'subscriber_frozen'  
         ELSE 'subscriber_unsubscribed'  
       END AS event,  
       event_date,  
FROM   subscriber_state_changes  
WHERE  previous_subscriber_state IS NULL  
        OR current_subscriber_state != previous_subscriber_state
shop_idcustomer_idevent_dateeventLogic Explanation
101555012024-01-01subscriber_new_activatedFirst activation (C1).
101555012024-01-15subscriber_frozenActive count hits 0 (C1 paused).
101555012024-01-20subscriber_reactivatedActive count > 0 via C2.
101555012024-03-10subscriber_unsubscribedActive & Paused counts hit 0.
102999022024-01-05subscriber_new_activatedFirst activation (C3).
102999022024-02-15subscriber_frozenActive count hits 0 (C3 paused).
102999022024-03-20subscriber_unsubscribedPaused count hits 0.

The beauty of this solution is not that we can apply this to any DFA. Any DFA has a set of states and a state transition diagram. We can simply do a lag operation to maintain memory. Then we map state transition diagram via simple case statements. Then we sum over them. This is how we know at each time, what entities are in what state.

Note: This Code was written using BigQuery Syntax. But I’m pretty sure It would easily apply to other databases as well.