How We Achieved 99.9+% Crash-Free Users: Simpl Mobile App Stability Journey

At Simpl, we've learned that mobile app stability isn't just an engineering metric—it's a direct driver of user trust and business outcomes. Here's how our small mobile team systematically reduced crashes from affecting 2% of users to just 0.06%, and the lessons that any mobile team can apply.

📍 The Wake-Up Call

October 2024 was a turning point for our mobile team at Simpl. Our Simpl mobile app, serving ~2.5M monthly users with 300K daily active users, was experiencing a stability crisis that demanded immediate attention.

The numbers were concerning:

Around 2% of our user base was experiencing crashes. With our scale, this translates to approximately 50,000 users facing interrupted experiences.
Critical flows were failing, directly impacting user trust and revenue.
The crash rate was increasing as we shipped more features, revealing gaps in our monitoring and tracking systems.

Crashes directly impact trust and retention, which are critical for a financial product. Instead of waiting for complaints to escalate or metrics to dip, we made crash‑free users a first‑class product KPI, measured and acted on every single week.

🛠 Our Approach: Tools, Process, and Ownership

Our approach to solving this challenge was multi-dimensional, focusing on comprehensive monitoring, systematic processes, and proactive engineering practices. To move from scattered fixes to consistent improvement, we built a lightweight yet disciplined process:

✅ 1. Comprehensive Monitoring with the Right Tools

We use a dual-tool approach that covers our entire React Native stack:

Crashlytics handles our native crashes (Java/Kotlin on Android, Objective-C/Swift on iOS). It's particularly effective for:

Native module crashes
Memory-related issues
Platform-specific problems

Sentry manages our React Native/JavaScript-level crashes, giving us visibility into:

JS errors and exceptions
Redux state issues
Bridge communication problems

Both tools are integrated with our JIRA workflow, allowing us to create tickets directly from crash reports and maintain traceability from detection to resolution.

📅 2. Weekly Crash Review: Our Stability Heartbeat

Every week, our mobile engineering team gathers for a focused 1-hour crash review meeting. This isn't just a status update – it's our primary mechanism for maintaining stability and discipline.

We track:
- New crashes, recurring issues, regressions
- Crashes introduced by new features
Action items are logged in a Google Sheet: crash details, assignee, active versions, crash counts, crash-free users/sessions, and linked Jira tickets.
Tickets are created directly from Crashlytics and Sentry.
Plan crash fixes for the upcoming weekly release.

🚀 3. Release Engineering and On-Call Rotation

Our release engineer rotation is crucial to maintaining our stability standards. Each week, one team member takes on-call responsibility for:

Release Monitoring:

Monitoring both Crashlytics and Sentry during rollouts
Driving the weekly crash review meeting
Updating our tracking data
Coordinating hotfixes when needed

Staged Rollout Strategy: We use a cautious approach: 5% → 100% over a week, giving us time to catch issues before they impact our entire user base.

Hotfix Criteria: If a crash affects more than two-digit users, we immediately:

Investigate and fix the issue.
Release a hotfix
Monitor the fix effectiveness

Less critical crashes get pipelined to the next weekly release.

📦 4. Version Management: Keeping the Ecosystem Healthy

We maintain only the last 4-5 versions as "stable" versions, using a combination of soft and force updates:

Soft Updates: Default approach for regular version transitions.

Force Updates: Reserved for critical scenarios:

High-impact crashes/critical feature bugs affecting major user flows
Security vulnerabilities

This strategy ensures we're not supporting too many versions while giving users the flexibility to update at their convenience.

🛠 5. Third-Party SDK Challenges: When Patching Becomes Necessary

While rare, we occasionally encounter crashes in third-party SDKs that we can't wait for the vendor to fix. In these cases, we:

Create code-level patches to work around the issue.
Share patches with the SDK maintainers via GitHub issues or pull requests.
Maintain our patches until official fixes are released.

This approach has saved us from being blocked by external dependencies while contributing back to the open-source community.

📊 6. Data-Driven Decision-Making

We've built a comprehensive data pipeline that feeds into Databricks, giving everyone in the organisation visibility into our stability metrics.

Dashboards visualise:

Last 30 days of crash trends
Android vs. iOS breakdowns
Week-over-week change in crash-free metrics

This makes data accessible beyond engineering to product and business teams.

🔔 7. Alerting and Rapid Response

We've set up multiple alert channels:

Slack alerts for new crashes
Email notifications for trending stability issues
Real-time monitoring during releases

This multi-channel approach ensures that critical issues are never missed, regardless of where team members are focusing their attention.

⚙ The Technical Reality: React Native Considerations

Working with React Native adds complexity to crash management:

Native Layer: Memory issues, platform-specific bugs, third-party SDK problems.
JavaScript Layer: State management issues, async operation failures, bridge communication problems

Our dual-tool approach (Crashlytics + Sentry) gives us complete visibility across both layers, which is crucial for maintaining stability in a hybrid architecture.

📈 The Results

This disciplined, product-driven approach produced a measurable impact:

Crash-free users improved from ~98% to ~99.94%
Current platform breakdown: Android ~99.96%, iOS ~99.92%
Crash-free sessions reached 99.98%.
Over 30+ critical crashes fixed and removed from active user impact
Crash-affected users dropped from ~2% to just 0.06%.

All while maintaining a weekly app release and continuing to deliver new features.

💡 Business Impact: Why This Matters

The numbers tell a compelling story:

User Experience: Reducing crash-affected users from ~2% to 0.06% means over 50,000+ users now enjoy a smoother, uninterrupted experience each month, directly improving trust and retention.

Development Velocity: Counterintuitively, focusing on stability has made us faster. By spending less time firefighting production crashes, the team can invest more time in building and shipping features confidently.

This alignment of engineering quality with product and business outcomes shows why crash reduction isn’t just a technical initiative — it’s a strategic product investment.

✅ Key Lessons Learned

Little Effort, Big Impact

You don’t need huge infrastructure changes to improve stability. Our approach was largely process-driven, supported by targeted tooling like Crashlytics, Sentry, and Databricks dashboards.

Consistency Beats Perfection

Weekly crash reviews and disciplined tracking mattered more than perfect tools or frameworks. The steady rhythm of paying attention to stability every week kept standards high.

Ownership is Critical

A dedicated release engineer on rotation ensured someone always had eyes on stability metrics, crash dashboards, and critical fixes.

Data Visibility Drives Behaviour

Making stability data accessible to product, engineering, and business teams turned crashes from a siloed engineering concern into a shared organisational priority.

Proactive > Reactive

Catching issues during staged rollouts — before they reach 100% of users — is far better than firefighting after deployment.

🔭 Final Thoughts

Achieving ~99.9% crash-free users wasn’t about a single big change — it was about building a culture of ownership, supported by disciplined processes and the right tools.

Every mobile team can move toward similar results by focusing on:

Comprehensive monitoring and alerting
Regular, structured crash reviews
Proactive release management (including staged rollouts and hotfixes)
Data-driven decision-making is accessible across teams.

For us, the payoff has been clear:

A smoother, more reliable app for 2.5 million monthly users
Fewer drop-offs and support tickets
More focused engineering time spent building, not firefighting

Ultimately, investing in stability is about investing in user trust and product quality, and that’s something worth celebrating.