This is the story of how our transaction systems evolved to handle the growing scale of our business.
What is Transaction Systems
Simpl Pay Later is a platform that allows users to buy now & pay later. This payment option is available only for pre-approved users and is integrated into the merchant/gateway's system using APIs hosted by Transaction-Systems.
As a legacy application written in the very beginning of time transaction-systems has many responsibilities. But its main responsibility is simple: it maintains users' credit worthiness status (we call it approval status), credit limit, and facilitates e-commerce transactions. It primarily serves two requests.
1) Read Transaction approval of the user
As Simpl Pay Later is available only for pre-approved users, we expect merchant to show this payment option only to approved users. So merchant makes a https call to our system to check the approval status of the user and then shows Simpl Pay Later as a payment option. This request contributes to over 80% of our service's traffic.
2) Write Transaction to user’s ledger
Once Simpl pay later option is shown to eligible users, merchant makes a call to our system to create a transaction to user's ledger. This happen once customer selects Simpl pay later as the payment option. It takes place in two steps. First we block the transaction amount against the credit limit - this is done just before the transaction happens on the merchant. Once the transaction is complete, the amount is marked as utilised against the credit limit.
The First Avatar
Initially, Simpl used Ruby on Rails as the primary language for developing Transaction-Systems. The language & platform has advantages such as less development time and access to many useful libraries & frameworks. The system is built with PostgreSQL as the database.
However, as the system began scaling, some issues arose. Firstly, high throughput and low latency endpoints became problematic, and secondly, the resource usage was high. For example, to run one instance (docker based containers) of Transaction-Systems, it required around 1.5 GB of memory and that instance could serve only 1500 RPM. Therefore, to serve 30K RPM, we needed 20 instances of the application (30K/1500=20)
, and it required a total of 30 GB memory. These instances needed to be hosted in many EC2 servers. Furthermore, reducing the response time in Rails applications to less than 150 ms was very difficult, since even request parsing took a lot of time. Rails is a fantastic framework for many use cases and Simpl uses it extensively. However, for this particular use case, we decided that it was better to move to something else.
Time to “GO” over Ruby on Rails
We realised that if we wanted to scale further, we would need more machines and would eventually end up paying more in AWS bills. Transaction-Systems had a lot of other responsibilities besides serving the merchant’s requests, but merchant requests constituted the bulk of the application throughput (80%+). So, we decided to move only those requests to another microservice. However, we did not want to use Ruby On Rails because of its memory footprint & request parsing overhead. Additionally, we expected this service to handle a peak load of at least 100K RPM with minimum resources and P99 of less than 120 ms. Therefore, we decided to pick a statically typed language with a low memory requirement compared to a scripting language.
We picked Golang and selectively re-implemented the low latency, high throughput merchant request use cases while keeping the back office process-heavy use cases in Rails. Rails shines in background processing use cases with the help of Sidekiq whereas we did not find a very good library similar to it in Golang. So we left the majority of the background processing and cron jobs in the RoR application.
Since the system was already running in production and endpoints were already integrated by merchants, we decided to first write the same endpoint with the same functionality in Golang and connect to the same database and cache. However, merchants were still calling our legacy Rails systems, so we decided to proxy selected merchant requests through to the new Golang service. Eventually, once we developed enough confidence, the legacy transaction systems proxied all the requests to the Golang based transaction systems. The setup looked something like this.
This experiment ran for months so that we could allow for soak-time and weed out any issues (edge cases, race conditions etc.) before exposing our merchants directly to the new service. Once this was done, we changed the DNS cName record from legacy transaction systems to the Golang transaction systems.
It took approximately 6 months to migrate endpoint by endpoint from RoR to Golang and monitor the process to ensure nothing was missed. However, the effort was worth it as there was a 20% improvement in P99 response time. Additionally, the memory required by the Golang transaction systems per instance (docker container) was only 120 MB and each could serve 2500 requests in a minute.
RoR Application | GoLang Application | |
---|---|---|
Projected Peak Throughput | 100K RPM | 100K RPM |
Memory | 1.5 GB | 120 MB |
Load served by single task | 1500 RPM | 2500 RPM |
Tasks needed to serve load | 100K/1500 = 68 | 100K/2500 = 40 |
Overall memory needed | 68 * 1.5 GB = 102 GB | 40 * 120 MB = 4800 MB or 4.6 GB |
Number of EC2 machines needed | 102 GB / 14.4 GB = 7.0833 | It needs only 4.6 GB, so |
Machine Cost(m6a.xlarge | $0.1728 * 24 * 31 * 8 | $0.1728 * 24 * 31 = $128.5632 |
To handle a load of 100K RPM, the RoR application would require 8 m6a.xlarge
EC2 machines, while GoLang only needed one. As a result, we observed a monthly savings of more than 80% on our AWS EC2 bills with respect to Transaction-Systems.
The Next Level of Bottlenecks: Tail Latencies for Tenured users
After our switch to the Golang application, everything was working well. Until our CEO reported that he was unable to see the Simpl Pay Later payment option in several merchants. Upon investigation, we discovered that, for our CEO, our Read Transaction approval was returning a status of true, but it was taking over 900 ms to return the response. The response times for most of our transaction systems were good, with P99 response times under 120 milliseconds. However, we started noticing increased response times for certain user requests, particularly those related to Read transaction approval. This was a new performance bottleneck that we had not encountered before.
After analysing the issue, we found that the affected users were repeat users who had been using Simpl Pay Later beyond a certain threshold of transaction activity (users with more than 1000 transaction entries).
Despite this, the P99 response time of read transaction approval still fell within the range of 120 milliseconds. However, the P99.9 response time was much higher, ranging from 800 milliseconds to 1200 milliseconds. We were experiencing tail latencies!
Before we delve deeper into the problem, let's first understand what Read Transaction approval does. Essentially, it runs through a series of checks. If any of these checks fail, it returns an error and does not proceed with any further checks. This results in the approval being marked as false. The checks are as follows:
- Check if the user is not blocked to use credit
- Check if the user’s credit limit > transaction value
- If the user does not have any transaction, return user is approved to do the transaction
- Else calculate utilised & blocked credit limit, check if the user’s credit limit - credit utilised/blocked for user > transaction value
In the above process, the third and fourth steps were causing the problem. Let us jump to solutions now.
Slow Queries turned to Swift Solutions
Existence Over Enumeration
In the third step mentioned above, we need to check if the user has any ledger transactions already. The query to do the same looked like this.
select count(*) from ledger_transactions where user_id = ${user_id}
The above query determines if a user is a repeat user or a new user based on count of ledger transactions. Initially, we were confident in the query because the user_id
column is indexed and it should fetch the records efficiently. However, as the number of ledger transactions increased, the query started to slow down for users with more transactions (tenured users!). Upon analysing the situation, we realised that our primary need was to determine if the user had transacted before or not, rather than the number of transactions they had performed. As a result, we optimised the count with an exist query, which was now focused solely on identifying if the user has conducted any transactions.
select exists(select 1 from ledger_transactions where user_id = ${user_id})
This query did not take more time for users with large number of transactions. It solved a part of the problem and improved the performance of the read transaction approval endpoint for repeat users. After this improvement P99.9 improved by around 100-200 ms, but it did not reduce to our acceptable level of under 120-150 ms. So we dug deeper.
Index Optimisation: Saying Goodbye to the Unnecessary
One of the key steps in the approval call (the 4 steps mentioned earlier) is calculating the amount blocked in the user's ledger. The query used to perform this calculation looked like this:
select SUM(amount) from ledger_transactions where user_id = ${user_id} and category = 'debit_blocked'
The table ledger_transactions
had two different indices on the user_id
and category
columns. The majority of the entries in the ledger_transactions
will have a category attribute of either debit
or credit
, and only a small number of entries will be present with the category debit_blocked
. We expected that this query would use the index on the user_id
column (and ignore the category
index) and return data with no bottlenecks. However, for users with more rows, the Postgres query plan behaved differently: Postgres used both indices separately to query all the ledger_transactions
using the user_id
index and query all ledger_transactions
using the debit_blocked
category index. It then performed a bitmap AND to compute the final result. The query plan looks like this.
Aggregate (cost=895.75..895.76 rows=1 width=8)
-> Bitmap Heap Scan on ledger_transactions (cost=891.73..895.75 rows=1 width=4) (actual time=17130.036..17130.037 rows=0 loops=1)
Recheck Cond: ((user_id = ''::uuid) AND ((category)::text = 'debit_blocked'::text))
Rows Removed by Index Recheck: 15
Heap Blocks: exact=16
-> BitmapAnd (cost=891.73..891.73 rows=1 width=0)
-> Bitmap Index Scan on index_ledger_transactions_on_user_id (cost=0.00..11.10 rows=337 width=0)
Index Cond: (user_id = ''::uuid)
-> Bitmap Index Scan on index_ledger_transactions_on_category (cost=0.00..880.38 rows=35707 width=0)
Index Cond: ((category)::text = 'debit_blocked'::text)
The problem here is that even if the user has around 1000 transactions, there will be a minimum of 50,000 transactions in the debit_blocked
category (which is around 50x the count) and the query looks at both, which slows down the overall performance. So, just removing the category
index from the table fixed this problem completely.
We learned a hard lesson that an index may not always improve performance; in fact, it can sometimes significantly deteriorate it. Post this fix we saw a very good improvement in the response time in range of more than 300-400 ms. Overall P99.9 dropped from around 1000 ms to 600 ms, but still it is not in the expected range of under 120-150 ms. We needed to do more.
Introduction of User Ledger Summary
The step to calculate how much credit the user has utilised was still a bottleneck. The ledger transactions table had both the credit (user paying simpl back and refunds) and debit(transaction user did on the merchant) entries. So this calculation will look something like this
select SUM(amount*(CASE
WHEN category = "debit" THEN -1
WHEN category = "credit" THEN 1)) as amount_due
from ledger_transactions where user_id = ${user_id}
As mentioned earlier, the user_id
column of our database table was already indexed, but the query was not fast enough for users with more than 1000 transactions (as said earlier we were still taking 600 ms). This is because the query engine needed to go through all the rows and return the sum, which is not efficient. We need to come up with a different strategy to solve this issue.
Furthermore, if we continue with our current strategy of calculating the due amount using all the ledger transactions for a single user, we will never be able to archive old entries in this table and institute a data retention policy for our service. This could cause potential scale problems in the future.
Hence, we decided to create a new table called user_ledger_summary
that will store the utilised credit amount for any given user. We will update this entry every time a ledger transaction is written. This way, when determining a user's approval for a read transaction, instead of calculating the due amount using ledger transactions, we can simply fetch it from the user_ledger_summary
table. This will ensure that even for users with thousands of transactions, we just need to access a single row and hence our response time will be the same for variously tenured users.
The tricky part here is the implementation of this in a running system with more than 5 Million users and 30 Million ledger transactions - we need to backfill this summary table for the older transactions so that newer transactions can work on top of that summary.
We initially considered running an independent task which can backfill the user_ledger_summary
entries for all users overnight, and in the next day, release the code changes which will incrementally update the backfilled summary. But this method will miss any ledger transactions which happens after the user_ledger_summary entry creation starts and up to the point of release of code changes.
So we decided to go with a maker checker approach. This is a two step process.
- Initially we continue to determine the credit limit utilised, by calculating it from the ledger on the fly like before, but we create the summary in the background.
- Once the summary is available for all, we move away from the active calculation in the approval call and start relying on the summary and incremental updates done to the summary table.
The details of first step in the Read transaction approval flow for calculation of utilised credit (or user due amount) is depicted below:
As described above, the calculation of the utilised credit limit is based on ledger transactions as before. But an additional step in this approval flow was introduced where we create a summary of user ledger information (aka maker), called user_ledger_summary
, if it does not already exist. We then compare this summary with the actual amount due (checker), if it exists and alert in case of discrepancy. This extra step is triggered within a Go Routine, which ensures that it does not increase the response time of the endpoint.
We ran this comparison process for three months to ensure that there was no discrepancy between the summary and the actual amount due. This concluded the first step.
After this, we embarked on the second step. We made changes to the approval criteria for Read transactions, to use the user_ledger_summary.due_amount
for 10% of cases, before eventually scaling it to all users.
Once the change was completely rolled out, we noticed a significant reduction in the tail latencies of Read transaction approval.
With this fix, the bottleneck in scaling transactions was removed. As a result, the P99.9 significantly reduced from 600 ms to 120 ms. Additionally by shifting to the user_ledger_summary
, one of the key RDS bottlenecks - IOPS - reduced by over 50%. This has resulted in a considerable reduction in infrastructure costs.
The Overall Journey
The Transactions-Systems has undergone significant changes in its journey to satisfy the scaling needs of the business. We have done bold architectural changes as well as simple query fixes to improve system scalability and performance. This illustrates the fact that, as engineers we need to use every tool in our toolbox to improve the scalability of a system. Let us never forget the small ants and just think about the big elephants.
Another key learning was, there has to be process and method to introducing changes to a running high scale system and we need to take steps to ensure that application uptime and integrity is never compromised to improve the scalability of the system.