Android Development

Modeling DAU forecasts utilizing cohort matrices

25 June 2025

One problem in forecasting DAU for a product is that varied teams might exhibit completely different retention charges that change meaningfully. An apparent instance of that is customers acquired from completely different channels, but it surely is also true for various geographies, completely different platforms (eg., iOS vs. Android), and over time, and retention typically degrading with every subsequent cohort.

With a view to accommodate this impact, retention charges ought to be utilized to DAU projections for these teams, with the projections being aggregated into a world forecast. That is the aim of Theseus, my open supply Python library for advertising and marketing cohort evaluation. On this put up, I’ll unpack the analytical logic behind how Theseus works and supply an instance of how one can implement it in a one-off evaluation in Python.

The atomic models of a forecast are the group’s cohort sizes (eg., the variety of individuals from some group that onboarded to the product throughout some time frame) and the historic retention curve for that group. Every of those atomic models is represented as a vector over some timeline. The cohort vector represents variety of customers from the group onboarding onto the product; the retention curve vector represents the historic retention charges for that group on sure days from onboarding. Every of those timelines may be arbitrarily lengthy, and they’re impartial of one another (the cohort timeline doesn’t have to match the retention curve timeline). The notation for these atomic models may be represents as:

Observe right here that the retention charge vector would possible be generated by becoming a retention mannequin to historic retention knowledge for the group. Extra on that in this put up.

With these elements, it’s potential to assemble a DAU matrix over the retention timeline (mathbf{D_r}) that may seize the cohort decay in that interval. A useful place to begin is an upper-triangular Toeplitz matrix, (mathbf{Z}), of dimension (D_r instances D_r) with the retention charge vector working alongside the diagonal:

(mathbf{Z}) right here simply populates a matrix with the retention charges. In sensible phrases, the diagonal is 1, or 100%, since, tautologically, 100% of the cohort is current on day of the cohort’s onboarding. With a view to get to DAU, the cohort sizes have to be broadcast to (mathbf{Z}). This may be accomplished by developing a diagonal matrix, ( mathbf{diag}(mathbf{c}) ) from (mathbf{c}):

It’s essential to notice right here that, to be able to broadcast the cohort sizes towards the retention charges, ( mathbf{diag}(mathbf{c}) ) have to be of dimension (D_r instances D_r). So if the cohort dimension vector is longer than the retention charge vector, it must be truncated; conversely, if it’s shorter, it must be padded with zeroes. The toy instance above assumes that (D_c ) is the same as (D_r ), however notice that this isn’t a constraint.

Now, a 3rd matrix of DAU values, (mathbf{DAU_{D_r}}) may be created by multiplying (mathbf{Z}) and ( mathbf{diag}(mathbf{c}) ):

This produces a sq. matrix of dimension (D_r instances D_r) (once more, assuming (D_c = D_r)) that adjusts every cohort dimension by its corresponding every day retention curve worth, with Day 1 retention being 100%. Right here, every column within the matrix represents a calendar day and every row captures the DAU values of a cohort. Summing every column would offer the entire DAU on that calendar day, throughout all cohorts.

Whereas that is helpful knowledge, and it’s a projection, it solely captures DAU over the size of the retention timeline, (D_r ), ranging from when the primary cohort was onboarded. What could be extra helpful is a forecast throughout the retention timeline for every cohort; in different phrases, every cohort’s DAU projected for a similar variety of days, no matter when that cohort was onboarded. It is a banded cohort matrix, which supplies a calendar view of per-cohort DAU.

This matrix has a form of (D_c instances (D_r + D_c – 1)), the place every row is that cohort’s full (D_r)-length DAU projection, padded with a zero for every cohort that preceded it. With a view to arrive at this, the banded retention charge matrix, (mathbf{Z}_text{banded}) ought to be constructed, which stacks the retention curve (D_c) instances however pads every row (i) with (i-1) zeroes on the left and (D_c – 1 + i) zeroes on the fitting such that every row is of size (D_r + D_c – 1). To do that, we will outline a shift-and-pad operator (S^{(i)}):

Once more, this ends in a matrix, (mathbf{Z}_text{banded}), of form (D_c instances (D_r + D_c – 1)) the place every row (i) has (i – 1) zeroes padded to the left and ((D_c – i)) zeroes padded to the fitting so that each cohort’s full (D_r)-length retention curve is represented.

With a view to derive the banded DAU matrix, (mathbf{DAU}_text{banded}), the banded retention matrix, (mathbf{Z}_text{banded}), is multiplied by (mathbf{c}^{mathsf{T}}), the transposed conversion charges vector. This works as a result of (mathbf{Z}_text{banded}) has (D_c) rows:

Implementing this in Python is simple. The crux of the implementation is beneath (full code may be discovered right here).

## create the retention curve and cohort dimension vectors
r = np.array( [ 1, 0.75, 0.5, 0.3, 0.2, 0.15, 0.12 ] )  ## retention charges
c = np.array( [ 500, 600, 1000, 400, 350 ] )  ## cohort sizes

D_r = len( r )
D_c = len( c )
calendar_days = D_c + D_r - 1

## create the banded retention matrix, Z_banded
Z_banded = np.zeros( ( D_c, calendar_days ) ) ## form D_c * D_c + D_r - 1
for i in vary( D_c ):
    start_idx = i
    end_idx = min( i + D_r, calendar_days )
    Z_banded[ i, start_idx:end_idx ] = r[ :end_idx - start_idx ]

## create the DAU_banded matrix and get the entire DAU per calendar day
DAU_banded = ( c[ :, np.newaxis ] ) * Z_banded
total_DAU = DAU_banded.sum( axis=0 )

The retention and cohort dimension values used are arbitrary. Graphing the stacked cohorts produces the next chart:

LEAVE A REPLY Cancel reply