Three Years of Zepargn: The Product, the System, and Why It Now Runs Without Me

Zepargn has been live since 2023. A mobile savings, investment and budgeting product for francophone West Africa, starting in Bénin. iOS, Android, web. Around eight thousand active users, real money flowing daily, three years of operating decisions.

I built Zepargn solo in 2023. Over the following eighteen months I hired and trained a team of five engineers and one junior PM, and progressively handed over the operational responsibilities. For roughly the past year, I have not been in the daily loop. The product still ships, the team owns it end-to-end, the dashboards stay green most days. The reason that is possible is the subject of this article.

Running it solo at the start, on the PM side AND the engineering side at the same time, taught me things that do not make sense from either job alone. In consumer fintech, the PM who cannot read the database makes worse product decisions. The engineer who cannot read the user research ships worse infrastructure. The two skills feed each other, and the architectural decisions that come out of that dual perspective are what eventually let me hand the product over without it falling apart.

What follows is the deliberately mixed retrospective. Product calls and system calls in roughly the order they actually mattered, and a closing section on what it means to build something that runs without you.

The substrate that sets the spec

Consumer fintech in West Africa is not Western consumer fintech minus the Stripe integration. The rails are different. The dominant payment surface is mobile money (MTN, Moov in Bénin, plus aggregators like KKiaPay across the UEMOA zone). The dominant interface for moving money is still USSD codes typed on a feature phone. The connection profile is intermittent and metered. The device profile is entry-level Android, often one or two generations old.

This shapes everything that follows. The product cannot assume cards. The product cannot assume a stable connection at the exact moment of payment. The product cannot assume a phone that handles complex animations smoothly. The product cannot assume an app-store-native user; many arrive through WhatsApp.

If you build for the substrate of the market you actually serve, the product feels native. If you build for the substrate of the market you wish you served, the product feels foreign no matter how well it is designed.

The boring stack, defended

The application layer is intentionally conservative: Node.js with Express, MongoDB Atlas, Redis on Upstash, BullMQ for async work, React Native and Expo for the mobile app. No exotic database, no microservices, no Kubernetes. The smaller the surface area of unfamiliar infrastructure, the more attention can go to the parts that are actually novel.

The novel parts are not the stack. They are the rails.

Mobile money operators in this region have inconsistent APIs, partial documentation, occasional bugs they will not fix, varying latency profiles, asymmetric idempotency guarantees. Stripe is a luxury in this market in the sense that it actually behaves. Most of our integrations do not.

The decision, stated as a principle: spend the engineering complexity budget on the rails (operator clients, retries, reconciliation, idempotency, operator-state observability), not on the application stack. The stack is boring on purpose so the rails can be opinionated on purpose.

High availability without a platform team

Two production nodes, one in San Francisco (sfo3) and one in Frankfurt (fra1). Both run the API and the worker in PM2 cluster mode. A Cloudflare Load Balancer with health checks sends traffic to whichever is healthy. If the primary fails its health check, traffic moves to the failover in under thirty seconds with no human in the loop.

There is a subtle trap when you run the same code on two nodes: any scheduled job that fires on both will double-fire. Both nodes will send the same Telegram bot reply, both will run the same cron, both will dispatch the same email. The product becomes annoying very fast.

The fix: the worker process runs in leader election mode through Redis. Both nodes start the worker, but only the node that holds the Redis lock actually runs the crons and the conversational bot. The other waits. If the leader dies, the lock expires, the other claims it. Crons, push notifications, Telegram bot, withdrawal pollers: all of them fire exactly once across the cluster, ever.

The MongoDB layer is a three-node replica set on Atlas, multi-availability-zone. The primary election is on Atlas’s side. The application layer is failover-aware on the Cloudflare side. The data layer is failover-aware at the database. Three layers of independent failover, all working in concert, none of them requiring a human at three in the morning.

Idempotency as a product promise (three layers)

The product promise: this app cannot double-debit you, ever. Under any network condition. Across any retry, any disconnection, any tap of the button twice.

Once you state the promise that way, the engineering follows. Three layers, all required, none sufficient alone.

Client layer. Every payment initiation carries a client-generated Idempotency-Key header (UUID v4). The mobile axios interceptor propagates it on every retry of the same intent. Two taps on the same button produce the same key, which produces the same outcome.
Operator layer. Every operator call carries a deterministic reference derived from the client request ID. MTN’s X-Reference-Id, for example, is the same value across retries. The operator deduplicates on its side.
Database layer. A unique index on ContributionHistory.transactionId guarantees a single payment cannot land in the system twice, even if both client and operator misbehave.

Three years, three layers, zero double-debit incident. The invisible product surface that lets me sleep.

Auto-healing because operators bug

I assume the operators will hang. This is a product assumption, not an engineering one, because the product breaks if I assume otherwise.

Three watchdogs, layered:

Axios global timeout of 30 seconds. No outbound HTTP call can hang the event loop indefinitely. This was the change that fixed the silent-deadlock incident where one operator’s API took twelve minutes to respond to a single request and the entire payment pipeline backed up behind it.
Cron tick watchdog. If a cron takes more than two minutes to complete, it is force-released. The payment poller stays alive even if a single operator call hangs.
Mongo watchdog. If the process accumulates ten consecutive server-selection errors, it exits voluntarily. PM2 restarts it with a fresh topology. This resolves the zombie-after-Atlas-election cases that no documentation warned me about.

Fail loud, recover fast. Better a visible restart than silent corruption.

Observability native

I cannot run a fintech blind. Three layers of telemetry, each answering a different question.

/health answers process health: Mongo reachable, Redis reachable, Firebase reachable, HA topology coherent. The Cloudflare load balancer calls this every few seconds. If it goes red, traffic shifts.
/health/payment-pipeline answers domain health: is the cron tick fresh, is the backlog small, is the operator failure rate within bounds, is the Mongo watchdog calm? This is what I check first when something feels off.
/metrics in Prometheus format: HTTP latency, payment counters per operator, queue depth, mongo connection pool usage. Prometheus scrapes every fifteen seconds, Grafana renders, alerts fire to Slack and SMS when anything degrades.

All self-hosted in Docker. Total observability budget: free. Total operational value: I can tell within sixty seconds whether the product is healthy.

The state model lesson (PENDING_PAYMENT)

This was a product bug masquerading as an engineering bug, and it is the lesson I would tell every PM building a transactional product on top of a third-party rail.

We had a state called PENDING on group savings. It meant: this group has been created but the founder has not yet paid the creation fee.

We also had a state called PENDING on the group withdrawal vote. It meant: the withdrawal request has been raised, members are voting, no decision yet.

Same word, two completely different conditions, on two completely different flows. The collision did not cause obvious bugs at first. It caused subtle ones, the kind that surface as “why did this group disappear from the list” or “why does the vote tab show empty when I know I created a vote.”

The fix was a refactor: PENDING_PAYMENT for the unpaid creation, PENDING_VOTE for the open vote. One word change at the data model, propagated through every screen, every notification, every analytics event. The kind of refactor nobody puts on a quarterly roadmap because it does not move a top-line metric. The kind of refactor that pays back every time you ship the next feature on either flow, because the model no longer lies to you.

The lesson generalises: when two flows produce a state called by the same name, you do not have a vocabulary problem, you have a product problem. The two flows are about to interfere with each other in ways that are hard to debug because the language hides them.

The PostHog diagnostic that turned a 38% drop into a single-digit one

The group savings payment flow used to have a 38% abandonment rate on the creation-fee step. Real money left on the table at scale.

PostHog was the compass. The funnel revealed four failure buckets that the previous generic error message had been hiding:

Users whose mobile money wallet had insufficient balance at the moment of payment.
Users whose USSD payment confirmation failed silently on the operator side.
Users whose aggregator initialisation failed because of an operator API issue.
Users whose network dropped during the operator handshake.

The previous version handled all four with the same “an error occurred” screen. The fix was not a single change. It was four product changes, each one informed by what the funnel was saying:

A pre-payment intro screen that prompts the user to verify their wallet balance (with the right USSD code for the operator) before the operator call begins. The most common failure (insufficient balance) becomes the easiest to prevent.
Idempotency-key-based retry on aggregator initialisation, so a transient operator hiccup does not require restarting from scratch.
A clearer error state for the USSD-failed case, with a one-tap “try a different payment method” CTA.
The semantic state split from the previous section, so a payment-pending group no longer collided with a vote-pending group in the analytics funnel itself.

Four changes, none glamorous. The abandonment rate moved into the single digits.

The diagnosis required reading the funnel correctly (the PM side). The fix required writing the interceptor that propagates the Idempotency-Key without breaking the request cache (the engineer side). The semantic state split required understanding why the data model mattered (product) and how to migrate it without losing in-flight transactions (engineering). Three different skills, one decision. This is the decision that taught me to stop trying to separate the two halves of the job.

The “première cagnotte offerte” A/B

When you sell a group product (savings that requires two or more participants), the first group is the hardest sale. The user has to convince friends or family to join a product they themselves are new to, and pay a creation fee for the privilege.

The experiment: cover the creation fee for the first group, branded première cagnotte offerte on the creation screen. Two changes to the UI, one feature flag on the backend, no actual change to the underlying product.

Result, broadly: meaningful lift on first-group creation rate, no measurable cannibalisation on subsequent group creations. The economics work because the first-group conversion is the bottleneck, not the per-group margin. Removing the friction on the bottleneck is worth more than the lost fee on first-time users who would not have converted anyway.

In any social product where the first activation requires getting another person on board, the cost of the first user is the cost of the second user too. Price accordingly.

What I cannot ship (the READ_SMS lesson)

The most-requested feature in user interviews: automatically read mobile money confirmation SMS to categorise spending and revenue in the budget tab. Technically feasible. Three days of engineering for the parser, maybe a week with edge cases.

Not shipping it. Google Play has tightened the READ_SMS permission policy aggressively since 2019. The intersection of apps allowed to read SMS is now essentially default SMS handlers, two-factor authentication clients, and a small handful of approved use cases. A budgeting app reading payment SMS is not on that list, and getting it on that list is a lawyer-and-Google-Play-review project, not an engineering project.

The lesson for the next consumer fintech PM who hits this: the technical feasibility of an Android feature does not predict its Play Store eligibility. Platform policy is product constraint. The earlier you read the policy, the fewer features you scope and then have to un-scope.

The product answer is being built via a different path (operator portal OAuth, with explicit user consent). Slower, more friction, survives Play Store review.

The economics

Total infrastructure cost for around eight thousand active users and roughly 2,500 transactions per day: approximately $330 per month.

The breakdown is unglamorous. Two DigitalOcean droplets, MongoDB Atlas at the smallest dedicated tier, Redis on Upstash’s mid plan, Cloudflare Load Balancer, Resend for transactional email, Twilio for SMS (the line item with the most variance), Firebase on the free tier, Grafana and Prometheus self-hosted on the droplets themselves.

The reason this number matters: when you are solo and the product has to support itself before it pays a salary, the unit economics of your infrastructure choices ARE the unit economics of your runway. Boring tech is not just easier to operate; it is dramatically cheaper to operate. Microservices and Kubernetes would have multiplied this number by ten and improved exactly zero user metrics.

What this lets me ship

On top of the architecture above, the product surface looks like ten distinct things to the user:

Individual savings goals with target dates and recurring deposits.
Group savings (2 to 100 members, democratic vote on withdrawals).
Zlock: term deposits with interest, locked for 6 to 36 months.
Zflex: micro-credit up to 100K FCFA, with tiered scoring.
Challenges: 52-week savings challenges and custom variants.
Trips: bookings tied to a savings goal.
Zgere: financial education with lessons, quizzes, badges.
ZPoints: loyalty and referral system.
ZBudget: monthly budgeting with category tracking.
WhatsApp and Telegram bots: full conversational interface backed by GPT-4o-mini intent detection and Whisper transcription (with experimental Yoruba and Fon support, because asking French-as-second-language users to interact in perfect text is leaving users on the table).

Ten product surfaces, one architecture, one person on the roadmap. The architecture is what makes the product surface possible. The product surface is what makes the architecture investment worth it. Both halves required.

One year hands-off, and the product still ships

The real test of any of this is not whether the architecture works when the founder is at the keyboard. It is whether it works when the founder is not.

I built Zepargn solo in 2023. Over the next eighteen months I hired and trained a team of five engineers and one junior PM, and progressively handed over the operational responsibilities. For about the last year, I have not been in the daily loop. The team owns the codebase, owns the deployments, owns the operator relationships, runs the on-call rotation, ships releases on their own cadence. The product has continued to grow active users in my absence, which is the most flattering data point I will mention in this article.

What makes this possible, in roughly decreasing order of importance:

The runbook. Three years of incidents distilled into docs/RUNBOOK.md, a step-by-step procedure for every failure mode I have personally debugged. A junior on-call at three in the morning can follow it without calling me. The runbook is the institutional memory the team did not have to live through.
The CI/CD pipeline as the senior reviewer. Branch protection on main, 208 unit tests that must pass in 2 seconds at every push, smoke test after deploy, automatic rollback on health failure within 30 seconds. No human reviewer can bypass it. The pipeline enforces the standards I would otherwise enforce manually in review, and enforces them more consistently.
The observability stack. Anyone on the team can answer “is the product healthy?” in sixty seconds without me. The /health, /health/payment-pipeline, and /metrics endpoints are the same dashboards a junior and a senior look at. The information asymmetry I might have had as the founder does not exist.
The team I hired. Structurally able to ship: they own the codebase end-to-end, they wrote a meaningful fraction of it, they understand the operator quirks first-hand. I did not hire generalists who would need me to translate.

What I am proudest of in this entire project is not the architecture, not the product surface, not any metric. It is that I built it well enough that it does not depend on me. That is the actual job of a Chief Technology and Product Officer at the founder phase: not to be the smartest person in the room, but to design the conditions under which the room runs without you.

Three years, four lessons

Four things I would carry forward to the next consumer fintech project.

1. Boring on the stack, opinionated on the rails. Spend the complexity budget where it earns its keep. Operator integrations earn it. Application frameworks rarely do. The result is a product that costs $330 per month to run and sleeps through Atlas elections.

2. Reliability is product. Idempotency, retries, observability, reconciliation. These are not engineering hygiene tasks. They are the product promise the user cannot articulate but can absolutely feel the absence of. Budget them like features. Defend them like features when something else is competing for the sprint.

3. The PM and the builder cannot be two people in the founder phase. When the same person who reads the funnel also writes the interceptor that fixes it, the diagnostic-to-fix loop is hours. When they are two people, it is weeks. For consumer fintech in markets where iteration speed determines whether the product survives, that gap is fatal. This rule expires the moment the product is mature enough to support a team. Then the rule reverses (see lesson 4).

4. Build for the day you leave. The architecture I described above (boring stack, opinionated rails, three-layer idempotency, three-layer auto-healing, native observability) was not built for me. It was built for the team that would inherit it. The runbook, the CI/CD enforcement, the dashboards anyone can read: all of it exists so that the product does not become founder-dependent. One year hands-off and growing is the test that says the architecture passed. If your product can only run with you at the keyboard, you have built a job, not a system.

Zepargn is still running, and the team is shipping the next chapter without me writing the code: the OAuth-based operator import (the legal way around the READ_SMS wall) and a credit product extension scoped for the second half of the year. Both rest on the lessons above, which is the actual point of writing them down.

Try Zepargn

Web: zepargn.com
iOS: apps.apple.com/cm/app/zepargn
Android: play.google.com/store/apps/details?id=com.digitalelevate.zepargnmobileapp