How I Built a Churn Prediction System That My Colleagues Actually Used

When we first released a new mobile app, the dashboard everyone had their eyes on was downloads. That number rose fast enough to keep leadership happy. A few weeks later, a quieter signal rose to the top: users who simply stopped coming back.

We already had retention metrics, but the problem was timing. By the time churn showed up in reports, campaigns had happened, onboarding flows were frozen, and support teams were reacting instead of prioritizing.

This article is about how we built a churn prediction system that shortened that feedback loop and made its way into how product and marketing organized their work. The project lived in an automotive context, but the structure holds up for any app that counts on ongoing engagement rather than one-off conversion.

I’ll write less about model novelty and more about the decisions that held, the ones we fixed, and what changed once other teams trusted the output.

Start With a Data Contract, Not a Notebook

Before I ever opened a notebook, we agreed on the shape of the data the system would depend on. That mattered more than the model choice.

Two upstream sources fed into the system:

App event logs
Customer and lead data from internal systems

We defined a base table with one row per user per snapshot date, and we wrote down its contract: required columns, definitions, and labeling rules.

Why this contract? Out of friction. Prior to it, multiple teams built their own churn extracts. Each used slightly different logic for “last activity”, and even for “inactive”. Reviews stalled because no one could reconcile why the same user looked healthy in a cohort and churned in another.

A one-page contract eliminated that headache. It gave product and marketing a steady reference, and gave the data team something clean to version and review. We didn’t incubate any heavy governance. The contract lived next to the training code and changed only through pull requests. That’s it.

Features That Product Could Argue With

Feature engineering was deliberately constrained. If a feature could not be explained using terms product managers already understood, it was out.

The core signals were no surprise:

Session frequency and recency
Recent support interactions
Progress through in-app education
Geographic distance from active sales regions

No high-dimensional behavioral embeddings or temporal aggregates were allowed. Those tested well. They failed another requirement entirely: peer review of individual scores.

Questions quickly popped up once we shared early outputs. Why this user? Why now? A data dictionary, in plain language, and directly tied to user pursuits, resolved the bulk of these discussions and left the processing time a user took to review his score lower.

That document got referenced more than the model code.

Conservative Models, By Design

We evaluated three families: Regularized logistic, K-Nearest neighbors, and Random forests.

Logistic stayed; it allowed individual scores to be explained using the concepts the product team already jived with: activity drop-off, missed onboarding steps, recent friction. Coefficients reverted nicely back to the feature dictionary.

Random Forests stayed as a secondary. It nicely captured obvious non-linearities. It also produced the feature “rankings” which were supplemental what cool models we had going into the deck for review queues.

More complex models were purposefully excluded. Not because they performed poorly, but because they obfuscated score justification process speed and easy deployment. Interpretability came standard.

Here’s a clip from our training code. It’s only cut down enough to keep the article from being a tutorial.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

df = pd.read_parquet("user_churn_base.parquet")
target_col = "churn_label_60d"

num_cols = ["num_sessions_14d", "avg_session_minutes_14d", "support_ticket_30d"]
cat_cols = ["country_code", "device_os"]
bool_cols = ["completed_education"]

X = df[num_cols + cat_cols + bool_cols]
y = df[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="passthrough",
)

lr = Pipeline(
    steps=[
        ("prep", preprocess),
        ("clf", LogisticRegression(penalty="l1", solver="liblinear", max_iter=300)),
    ]
)

lr.fit(X_train, y_train)
lr_proba = lr.predict_proba(X_test)[:, 1]
print(f"LogReg AUC: {roc_auc_score(y_test, lr_proba):.3f}")

rf = Pipeline(
    steps=[
        ("prep", preprocess),
        ("clf", RandomForestClassifier(
            n_estimators=250,
            max_depth=6,
            min_samples_leaf=50,
            random_state=42,
            n_jobs=-1,
        )),
    ]
)

rf.fit(X_train, y_train)
rf_proba = rf.predict_proba(X_test)[:, 1]
print(f"RandomForest AUC: {roc_auc_score(y_test, rf_proba):.3f}")

For model mechanics, scikit-learn’s documentation remained the most practical reference:

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

For explanation patterns and governance framing, this resource stayed useful throughout:

https://christophm.github.io/interpretable-ml-book/

A Lightweight Governance Loop

The system self-freshened quarterly, which gave it some backbone.

Every run did three things:

Data checks: feature distributions, label rate shifts, missing values
Model checks: AUC over time, calibration, top features
Process checks: data contract version and upstream ownership

All of this lived in the same repo as the train code. Each run also generated a short report that was shared with product and marketing leads. When something looked weird, we had to pause the release long enough until we aligned.

This was less about control and more about confidence. Teams trusted the scores because they could see when things changed.

Turning Scores Into Actions

A churn score that is not tied to action soon becomes pointless, no matter how good the model is.

We began segmenting users into low, medium, and high risk bands and connecting those segments to existing workflows: in-app educate prompts, email sequences, prioritized outreach from support or sales.

We did not automate decisions wholesale. We reordered attention with scores.

One concrete change: we discovered a campaign we’d designed to target recent signups could also be Scoped to go target that high risk user who was stalled mid-onboarding. Product rotated in-app guidance to that subgroup. Marketing cut spend on the user already deep into A/b testing features to maximize engagement.

The manual triage spreadsheet we were maintaining ended up being jettisoned too. We were managing with the churn segments now.

Planning conversations started with risk segments instead of just looking at last quarter’s attrition chart.

What Survived, and What Did Not

Some decisions held up on impact:

The data contract prevented downstream drift
Interpretable features reduced review friction
Simple models accelerated adoption
Small governance checks preserved trust

Other bits had to be corrected. We were taking snapshots too infrequently in the beginning. Feature definitions tightened after the first quarter. But none of it broke the system, because structure kept us iterating without having to start a full re-litigation.

What You Can Steal

If you want to build something like this:

Write a clear base-table contract and version it
Choose features that map to real user stories
Favor models you can explain under pressure
Add lightweight checks and share them regularly
Attach scores to one or two concrete actions first

It’s one that ran, made it through review, and changed how teams worked.

We found that mattered more.