The possibility to send a high number of emails without keeping your own server for email delivery. Goals and challenges
- Track user interaction with the email (up to 10K notifications per minute).
- Reduce load on 3rd party app that sends high number of emails daily.
- Webhook receiver’s downtime shouldn’t affect main application flow.
- System has to be scalable and guarantee to deliver within one day, when hook target isn’t available.
- Convenient and timely email delivery status reporting and displaying this info on a dashboard.
- Service is easy to use and customizable to meet changing requirements.
- Availability and comprehensiveness of documentation.
How WE helped
We've implemented the hooks delivery system with Amazon Simple Notification Service (SNS) and proxy server that processes, subscribes, and redirects requests. Little coding was required and responsibility of webhook delivery is assigned to an external service.
- Email delivery system with the user’s activity tracking feature that processes a high number of events (up to 20K notifications per minute).
- Information about user interaction with the email is captured and stored.
- Redelivery solution is available when webhook receiver is unreachable.
- System is separated and does not affect main app services.
- Solution is easy to scale.
- 2 times higher level of throughput.
To achieve project objectives, we’ve considered several implementation options:
- Naive implementation.
- Implementation via SQS queue.
- Webhooks delivery SNS notification.
We have the following components that are responsible for events in our system: 1. Event consumer. 2. Pull queue and decide what we need to do with events. 3. Event storage. 4. Store all events in our system. 5. 3rd party app. 6. Have to be informed about events related to the app account.
The simple way to inform a 3rd party API is just to make HTTP request the hook URL directly from the consumer, as displayed on the following diagram:
- Easy to implement.
- Main application flow is affected, thus if a 3rd party app won't respond, we will receive an error in our internal consumer.
- Consumer cannot be scaled separately from webhooks which are sent to 3rd party app.
- Impossible to implement webhooks redelivery correctly.
Accordingly, we came to the conclusion that this approach will not give us expected results or satisfy our needs in a comprehensive manner.
Therefore, we’ve started to consider other options.
Implementation via SQS Queue
We use a lot of AWS services at the moment and already have a bunch of SQS Messaging Queues. The main idea for the solution it is to use a separate queue for hooks (webhooks) and consumer to process this queue. Take a look at the schema that shows how SQS queue works: In this manner our main app flow does not depend on the working state of an external app and we are able to implement retries via SQS queue.
The flow of a retry is as follows:
- Pull webhooks events from queue.
- Send HTTP request with events in the body to 3rd party app.
- If request is completed successfully, then remove message from queue.
- If request has failed, retry with the specified delay and send a message once again.
- Set maximum receive count that equals delivery retry count.
The flow is displayed on diagram below: Pros:
- Main application flow is not affected.
- Redelivery can be easily implemented.
- Independent main event consumer can be scaled.
- More complex to implement.
- Max redelivery delay is 15 minutes (since max delay is 15 minutes in SQS queues), thus we can’t make a logarithmic retry for our messages.
- Logic of retries and responsibility for webhook delivery has to be managed manually.
This option was way better, however, not all of our primary goals would be met by implementation via SQS queue.
Webhooks delivery by SNS Notification
Another way to implement webhooks delivery is by using AWS SNS. The service was developed for delivering notifications and can send them via HTTP, the same way as webhooks do. Also, if notification delivery has failed, SNS is responsible for retry delivery, and we can use linear or logarithmic function to manage retry frequency.
This solution will give us desirable outcomes; however, it cannot be applied due to these two reasons:
- SNS send confirmation requests to subscribe endpoint (webhooks destination).
- SNS provides redundant metadata to delivered messages.
The SNS was designed for communication between different components within one system and not for communication among different components of different systems with different owners. Hence, we need a proxy component inside our system that SNS can communicate with
Let's take a look at the scheme of proposed webhooks implementation using SNS:
Flow of hooks delivery using SNS:
- The consumer sends events to a certain SNS topic.
- We subscribe a 3rd party app to the topic and the URL should contain an endpoint that we need a redirect for.
- When the proxy receives a request for subscription confirmation, it will send a request to SNS with a proper confirmation token.
- When the proxy receives a request with a notification (an event, in our case), it will format event data and redirect to a URL that contains the URL path part.
- All responses from the hooks’ endpoints pass as responses to SNS requests.
Schema of proxy API is shown below:
- Main application flow isn’t affected by the state of the hooks’ receiver application.
- The responsibility for message delivery is shifted to SNS.
- Redelivery strategy is already implemented by AWS.
- The simple proxy has good throughput, very lightweight and can be scaled.
- Little code is needed (less code, fewer bugs).
- The proxy has to be developed.
After considering each implementation option in terms of benefit it will bring, and required resources for its execution, we’ve chosen to go with the webhooks delivery by SNS notification. With SNS and the simple proxy server, we’ve implemented the hooks delivery system with minimum coding involved and shifted responsibility of webhook delivery to an external service. As a result, we are able to send about 20,000 events per minute without any scaling, which is even higher compared to the planned 10,000 events per minute. The major bottleneck is the proxy, but we designed this one for scaling so throughput can be easily increased.
We wrote our proxy using Node.js platform. The proxy hosted on EC2 T2 small instance and can process about 4,000 concurrent requests with throughput of about 400 req/sec. The proxy doesn’t have any interaction with the database since all data for redirects are passed in URLs. This gives us an option to do a horizontal scaling. If we place our proxy behind a load balancer, then we will be able to add more server instances and as a result obtain two times higher throughput.
Vertical scaling won’t give us the same effect. When we scale node processes within one server instance, we can get about 30% higher throughput for each additional processor kernel. Therefore, horizontal scaling is a more suitable and effective approach.
We don’t need to worry about scaling SNS because AWS does it automatically, we only have to send the proper count of messages.