Sustainable Scaling of Anti-Fraud ML Infrastructure

Migration from DynamoDB to DocumentDB and Key Improvements

In engineering, success is often measured by a system's speed, efficiency, and responsiveness. While these performance metrics are important, focusing solely on them can overlook other critical factors like sustainability and cost-effectiveness.

This realization became clear during one of our key projects: fraud prevention among long-term users. Although the system performed well in detecting fraud, it faced challenges during peak periods, highlighting the need for improvement.

What started as an effort to fix these issues evolved into a broader rethinking of how we design scalable, cost-efficient systems. We needed to balance accuracy, scalability, and efficiency while controlling costs, prompting critical questions:

Can we reduce costs without compromising accuracy?
What trade-offs are acceptable to ensure long-term sustainability?

TL;DR

As part of our efforts to scale a real-time inference ML system sustainably, we identified several key bottlenecks, with storage being one of the most impactful. The inefficiency of using DynamoDB for this purpose was driving up operational costs. To address this, we migrated to Amazon DocumentDB, anticipating a 50% reduction in costs. While the migration appeared straightforward, we faced some challenges along the way. Ultimately, this transition not only led to significant cost savings but also enhanced the accuracy and overall performance of the model.

In parallel, we revisited and redesigned the codebase, making key components more modular and flexible. This redesign reduced the codebase by a net total of +4130 -7608 lines, which not only improved maintainability but also made it easier to add new features to the model in the future.

About the Project

Our fraud detection system monitors user activity across a merchant network to identify potential fraud. It analyzes transactions, behavior patterns, and other key signals to predict which users might become delinquent. This type of fraud doesn’t just cause financial losses—it also erodes trust in the Simpl platform, making effective prevention absolutely critical.

The ML Model

At the core of the system lies an LSTM (Long Short-Term Memory) network, a specialized type of Recurrent Neural Network (RNN). Unlike traditional RNNs, which struggle with long-term dependencies due to the vanishing gradient problem, LSTMs use gating mechanisms to regulate the flow of information. This allows them to selectively remember, forget, or update data, making them ideal for identifying subtle behavioral trends in fraud detection.

LSTMs capture both short- and long-term patterns, enabling the system to recognize nuanced behaviors that signal risk. During each prediction cycle, the LSTM generates two key states - immediate patterns (H state) and long-term patterns (C state). Together, these states provide a holistic understanding of user behavior, improving the system’s predictive accuracy and reliability.

Reference: https://engineering.getsimpl.com/basics-of-rnn-and-lstm-for-non-data-scientists/

Data Modeling and Processing

Our fraud detection system processes real-time events using Apache Kafka and Apache Flink, two robust tools for handling data streams. Kafka acts as the messaging layer, collecting user activity data such as transactions and behavior logs. The Flink pipeline performs three critical tasks:

User Filtering: Identifies relevant users whose data requires analysis.
Data Aggregation: Combining transactional and behavioral data into a unified, structured dataset.
Feature Preparation: Fetching supplementary data from external sources and integrating it with the dataset to enhance prediction accuracy.

The system uses two types of historical user data:

Model-Specific Metrics: Includes LSTM-generated H&C states (H&C vectors), which help maintain continuity in predictions. H&C vectors generated from the previous inference will have to passed to the next inference for the continuity.
Generic User Features: Broader indicators like transaction frequency or payment history that provide valuable context for predictions.

The enriched dataset is then passed from the Flink pipeline to a Model as a Service (MaaS), where the LSTM model is hosted as an HTTP service. The model evaluates the data to identify high-risk users, flagging potential fraudulent activity. When fraud is detected, the system automatically triggers corrective actions, such as blocking accounts or reducing credit limits, to mitigate risk in real time.

Pre-population of Metrics

To ensure accurate and reliable real-time predictions, the database was preloaded with key historical data before the model went live. This process began by generating H&C vectors and event metadata from our Data Lake to provide the model with a comprehensive historical context.

Once the data is prepared, it is converted into a format compatible with the database and uploaded to S3. From there, it was efficiently ingested into the datastore. This pre-population process was critical in making the system production-ready. By ensuring historical features were readily available, the model could seamlessly operate in real-time, providing accurate predictions.

Resolving Bottlenecks in the System

The project faced two primary bottlenecks: ML Model Deployment and Model-Specific Metrics Datastore.

ML Model Deployment

Initially, our system leveraged a FastAPI server alongside the TensorFlow Python package for real-time predictions via Model as a Service (MaaS). However, it struggled to handle the high throughput demands. To resolve this, we integrated TensorFlow Serving with FastAPI, which optimized the model serving process and allowed for higher throughput without compromising the clear separation of concerns between the Flink pipeline and MaaS. This shift alleviated performance and led to a ~60% reduction in infrastructure costs. For a deeper dive into this transition, an earlier blog post from our team provides a comprehensive overview.

https://engineering.getsimpl.com/efficient-ml-model-deployment/

Model-Specific Metrics Datastore

The datastore, initially built on DynamoDB, was designed to store H&C vectors and event metadata efficiently. While DynamoDB provided several advantages, including high read performance, low storage costs, and minimal operational overhead, writes were significantly more expensive. With the system’s near 1:1 write-to-read ratio, the high frequency of write operations became a major cost driver. The rest of the article describes how we improved on this aspect.

Datastore Improvements

To support the system’s predictive capabilities, two critical types of data were stored:

H&C Vectors: These enable the LSTM model to maintain continuity in its predictions by building on past insights. These are updated on each inference.
Event Metadata: This includes details such as event type, timestamp, and merchant ID, allowing the system to calculate features like the time difference between transactions or patterns of repeated interactions. These metadata are upserted on receiving an event in the pipeline.

Data Models

To store the system’s metrics efficiently, we used DynamoDB with the following structure, designed to optimize read and write operations:

Primary Key: The user’s phone number was used as the primary key, ensuring fast access to individual user data.
Sort Key: The sort key enabled efficient filtering and querying by dividing the data into two categories:
1. Model Metrics: Stored as H&C vectors, these metrics help maintain continuity in analyzing user behavior over time.
2. User’s Historical Data: Represented in the format last-event#{merchant-id}:{event-type} (e.g., last-event#5678:purchase), which captures the details of the user’s most recent transaction, ensuring we have up-to-date event information for fraud prediction.

This data model was designed based on our read and write access patterns.

Migration to Amazon DocumentDB

To address the limitations of DynamoDB, we transitioned our system to Amazon DocumentDB. This move provided significant advantages in handling our I/O-intensive workload. The migration process was relatively straightforward because the data schema remained unchanged, reducing the development effort required.

Evaluating Other Database Options

We evaluated several alternative databases before deciding on Amazon DocumentDB. Postgres Aurora was a close contender in terms of cost, but it presented challenges when it came to horizontal scalability. Although PostgreSQL can scale horizontally using extensions like Citus or Postgres-XL, this requires additional setup and configuration, adding complexity and operational overhead. Achieving distributed scaling in PostgreSQL requires external tools or manual partitioning, which made it less ideal for our needs.

MongoDB stood out due to its native sharding capabilities, which are highly suited for workloads requiring horizontal scalability. However, we decided against using Mongo Atlas because of its higher cost structure, primarily driven by the mandatory use of read replicas. For our use case, the cost of read replicas did not offer proportionate benefits.

We also considered ScyllaDB and Couchbase, they presented significant cost inefficiencies, making both unsuitable for our needs.

Why We Chose Amazon DocumentDB over Mongo Atlas?

After a thorough evaluation, we selected Amazon DocumentDB for its cost-effectiveness and simpler setup compared to Mongo Atlas. We chose to operate DocumentDB in a single-node configuration, without replicas, and within a single availability zone. The AWS’s 99% SLA for availability ensures a recovery time of 8-10 minutes. By eliminating read replicas, we maintained system simplicity and cost efficiency, while still meeting our reliability needs.

A key improvement in our system was adopting Amazon DocumentDB’s I/O-Optimized storage, designed for I/O-intensive workloads. Unlike traditional storage models that incur pay-per-request I/O charges, I/O-Optimized storage removes this cost structure. While the compute and storage costs are slightly higher, the overall trade-off is minimal when considering the significant savings on I/O operations. This approach not only enhanced operational efficiency but also bolstered cost-effectiveness, ensuring a scalable and resilient infrastructure.

To further enhance resilience and avoid data loss during the DocumentDB downtime, we designed the pipeline to stop consuming the events from Kafka with necessary health checks, though the system remains unresponsive during the disaster.

Kafka as a Buffer During Downtime

When DocumentDB faces downtime or any disruptions, Kafka acts as a critical buffer. We have configured it to retain seven days of historical data, it helps us replay and recover events once the database is back online. This buffering mechanism reduces the dependency on immediate database failover, providing a flexible way to handle short-term outages without losing data or service continuity.

Health Checks for System Stability

In addition to Kafka's buffering, we also implemented health checks within our Flink pipeline. These checks continuously monitor for critical failures such as database outages or external service interruptions. If a failure is detected, the system automatically pauses data consumption, halting any new commits to ensure that no inconsistent data is written while the issue is being addressed. This pause prevents data corruption and ensures that once the system stabilizes, it can continue functioning smoothly, with data integrity intact.

Data Organization for Better Efficiency

To optimize storage and data retrieval, we reorganized our data management strategy. We separated model-metrics and user-event JSONs into distinct collections, ensuring each collection served its purpose efficiently. The user-event JSONs will also be reused by other LSTM models to generate time-based features, significantly reducing data duplication across multiple feature stores. This restructuring improved storage efficiency and simplified the process of deriving and sharing features across our system.

Cost Savings

The migration to Amazon DocumentDB led to a substantial reduction in operational costs, cutting our monthly expenses by 55%. These savings were achieved without compromising the scalability or reliability of the system. The decision to adopt DocumentDB proved to be a cost-effective solution that suited the evolving needs of our project.

Was it all roses? No…

Increase in Read Latencies

DynamoDB provided excellent read and write performance, with GetItem averaging around 3ms and UpdateItem around 5ms. In comparison, DocumentDB exhibited higher read latency, averaging 9ms, which impacted the throughput of our pipeline and required additional optimizations to maintain performance. However, DocumentDB excelled in write latency, delivering around 2ms, while DynamoDB's write latency was approximately 5ms. This was particularly beneficial for write-heavy tasks like updating model metrics and event metadata. Despite the improved write performance, maintaining efficient reads remained crucial, given the system's balanced write-to-read ratio, prompting efforts to address read latency issues effectively.

To optimize performance, we introduced concurrency in database operations, allowing multiple calls to execute simultaneously and reducing overall latency. We also consolidated several database operations into single network calls, minimizing round trips and improving throughput.

While adding concurrency to the system, we encountered a new challenge: skipped events. We call an event skipped if a user’s event is out of order based on its timestamp. In these cases, the system would simply ignore the event and move forward, potentially losing valuable data. Before deploying, we run our sequence models through two critical tests:

How does the model react to duplicate events?
Since Kafka can send duplicate events (e.g., retries or multiple event sources), we examine how the model handles this redundancy. It's important to ensure that duplicate data doesn’t distort predictions, particularly for fraud detection models, where multiple interactions could signal suspicious behavior.
How does the model handle skipped events?
Skipped events happen when a user’s event is out of order due to timestamp discrepancies. In such cases, the system skips the event, which could potentially lead to missed insights. The key question is whether the model can maintain prediction accuracy when it misses data points and whether it can handle these gaps effectively.

We were keen to understand the model's tolerance for discrepancies, drawing from my background in electronics, where we often account for resistor tolerances to determine acceptable variations. In both cases, the goal is to ensure performance remains stable and reliable despite small deviations from the ideal.

To address the skipped-event issue, we introduced Flink’s KeyBy operator into the pipeline. This operator ensures that a stream of user events is sent to the same operator based on the user's phone number, preserving the event order. Prior to these changes, we were running at a skipped-event rate of around ~1%. With the addition of concurrency, this rate could have skyrocketed. However, after implementing the KeyBy operator, we successfully reduced the skipped-event rate to just ~0.02%, improving the system’s reliability.

Import from S3

DynamoDB provided a cost-efficient and user-friendly way to import data directly from S3, simplifying the process of bulk data ingestion. This feature allowed compressed JSON files to be uploaded to S3 and seamlessly imported into DynamoDB tables without requiring additional scripts or manual intervention. For example, preprocessed model and user-event data could be easily dumped into S3, triggering an automatic import process.

In contrast, DocumentDB lacked a direct import-from-S3 feature, which introduced significant operational challenges. Files had to be downloaded from S3 to an intermediary system, such as an EC2 instance, before using the mongoimport utility to import data into DocumentDB. This process not only required additional infrastructure but also necessitated maintaining external scripts to handle the data transfer and transformation.

Additionally, copying gigabytes of data from S3 to EC2 machines and running the import into network-access-restricted DocumentDB instances created significant operational overhead. In comparison, DynamoDB’s ability to directly ingest data from S3 without intermediary steps saved both time and effort, highlighting a gap in the capabilities of DocumentDB and MongoDB Atlas, which also requires a similar multi-step process for imports.

MongoDB version

Amazon DocumentDB is not fully compatible with MongoDB. While it supports MongoDB's API to some extent, it is based on AWS's Aurora platform, which introduces several limitations. For instance, DocumentDB only supports MongoDB version 4.0 API, meaning that many features from later MongoDB versions (such as sharding improvements and more recent indexing options) are unavailable. According to some references, tests show that DocumentDB fails around 66% of MongoDB API correctness tests.

MongoDB Atlas remains the better choice for applications requiring the full feature set and flexibility of MongoDB, particularly for workloads involving complex queries and dynamic scaling.

Deployment Strategy

To safely roll out changes and improvements, we adopted the Shadow Deployment Strategy. This involved deploying a new Flink pipeline with a separate consumer group that listened to the same Kafka events as the existing pipeline. By comparing the throughput of the two pipelines in real-time, we could evaluate the impact of our changes under actual production load without disrupting the user experience. This approach allowed us to test, observe, and adjust the pipeline step by step, ensuring smooth performance and stability before making the new version live.

Another advantage of this setup was the ability to run experiments. Using Kafka's seven-day event retention policy, we could rewind consumers to any point in time and test how the pipeline handled a wide variety of events. While most scenarios were tested in staging, production often revealed unique edge cases. Observing how the pipeline performed in these situations gave us greater confidence in its reliability.

After addressing the engineering challenges, we shifted our focus to the model's performance. We carefully analyzed its metrics to ensure it met expectations and worked well in real-world conditions.

"Every engineering decision comes with trade-offs, and the true art lies in leaning the scales toward the positives—much like the deliberate choices we make in life."