- read

A notification architecture for social networks — send millions of notifications per day

cloudeasyclub 17

Notifications are the lifeblood of all mobile applications. If a non-popular app stops sending notifications, people tend to forget it’s even installed on their phones. In most cases, the type of notifications sent by the apps fall under two different categories -

  • Transactional notifications — To update you about some event/action associated with your account. To give you a few examples — When someone follows you on Twitter, you receive a notification. When you order food online, you keep receiving notifications about the order update. When you book a cab, you keep receiving notifications about the booking status. These are all transactional notifications
  • Promotional notifications — To update you about the latest offers/contests/marketing campaigns run by the app to promote sales/time spent on the app or move other metrics.

In most cases, the number of transactional notifications is limited in nature. Every product feature or flow approximately has 1–5 notifications, and only a handful of product flows are used 80% of the time by the user. Mobile apps often send you more promotional notifications than transactional notifications. But, there are a few exceptions to this pattern, and those exceptions are Social networks and messaging applications.

For social networks and messaging applications, the number of transactional notifications is dependent on variable factors such as -

  • Number of daily active users (DAU)
  • Number of connections built on the network
  • Activity per connection (you have a few friends you text a lot)
  • Your prominence on the network — People with more visibility tend to receive more activity and hence more notifications.
  • Content creation behavior — For example, the number of tweets per account is high compared to Instagram posts per account.

I am sure I missed a few points, but you get the gist. The number of notifications isn’t really in control of the organization as they are dependent on user behavior. Companies use different techniques and provide tools to their top influencers to limit the distraction (by restricting the notifications), but even with that, the number of notifications sent (or generated/updated) is relatively high.

This brings us to the question — How to design a notification sending architecture for social networks?

I had a similar challenge while designing the notification sending architecture for Leher App. It’s a community-based audio/video first social network(similar to Clubhouse, but much older and with many more functionalities). Our primary audience is people in India’s tier 2/ tier 3 cities. At the time of writing this article, we deliver millions of notifications per day! The following functionalities generate these notifications -

  • Direct messages — When User A sends a message to User B
  • Club messages — Leher is a community-first social network. Think of a club like a WhatsApp group but with no limit on the number of members. We have numerous clubs above 10k+ members, and the admins can broadcast messages to all of them.
  • Live audio/video rooms — A large part of the communication on the platform is done through live audio/video rooms (similar to Clubhouse and Twitter spaces), and a lot of notifications are sent because of them.
  • Networking notifications — when someone follows your account or when someone joins your community etc
  • Payment notifications — We have two different types of virtual currency on the platform with the ability of P2P transfer between users. All those flows have a lot of notifications associated with them (for example — Payment successful, failed, etc.)
  • Daily games scheduled notifications — Daily scratch cards, user streaks, etc. These are sent to all the active users of the platform.
  • Promotional notifications — Earlier, we used to leverage third-party tools like Clevertap, etc., for sending promotional notifications. But they became costly for us with scale, and we have started leveraging our transactional notification delivery system for these promotional notifications.

Again, I am sure I missed a few notification types (which my engineering team will undoubtedly point out), but I hope it paints a good picture of the system’s complexity. Let’s talk about the high-level product requirements I had when designing the solution.

High-level product requirements

We are a startup growing at a crazy pace. So, writing down exact product requirements for a notification system is virtually impossible. Because, unlike Twitter/Linkedin/Instagram, we have to keep adding new features (and removing some old ones) to meet user requirements. Each of these features and product journeys can add multiple requirements related to notifications. We also have to run various experiments in our promotional notification system.
So, I only had the following list of high-level product requirements -

  • Notification templates should be configurable — We needed the ability to keep changing notification templates as many times as possible without impacting the functioning of other services. The idea is to have the ability to run the A/B test for different notification types and also to have the ability to extend those templates to other Indian languages in the future. (we currently support Hindi and English only)
  • We should be able to create and remove new types of notifications without too much hassle.
  • Ability to keep some notifications push only — We have a notifications screen on the app, but we don’t store all the notifications. Like many organizations, we store only a few notifications in our persistent storage (with some customizations for specific notification types).
  • Ability to schedule notifications — we needed the ability to schedule notifications for an X point in the future for a Y user.
  • Ability to make rapid changes to the notification architecture without making changes to individual microservices — we have a microservices-based architecture running 80+ services to support various platform functionalities. We needed the ability to design a notification architecture that doesn’t require changing these other independent services multiple times. (as it just increases the complexity further). So the communication format between the individual microservices and the notification architecture should be as generic as possible.
  • Personalization controls for users — A user should be able to turn off receiving a particular notification type. They should also be able to set the frequency of notifications they want to receive.

These were the high-level product requirements. Now, let’s talk about the high-level engineering requirements (or the engineering problems to solve in such a system)

Engineering problems to solve

Apart from the product requirements, I had to consider a lot of engineering problems. Following are a few of them -

  1. High scalability — A notification not delivered is one less daily active user! This is the truth for a growing startup. So, it was necessary to make the system highly reliable, available, and horizontally scalable.
  2. Language/tool agnostic — Like most microservices architectures, ours uses multiple programming languages and tools (databases, file systems, etc.). We needed an architecture that shouldn’t be affected by this.
  3. Observability — You cannot optimize what you cannot measure. We needed the ability to measure, track and accurately pinpoint problems to solve them faster.
  4. Scaling the scheduler — Anybody who has ever written a scheduling system will understand this. For others, wait for a different blog post.
  5. Security — We needed to keep this system as private and secure as possible.
  6. Cost-effective — Do I need to explain this? It’s the de facto requirement for every system I have ever designed.

Now that you have an idea about the complexity of the problem statement in such a system, let me present my solution to you. To keep the explanation as beginner-friendly and straightforward as possible, I will only explain the high-level architecture, data flow, and the tools used in the execution. I will not go into the detailed implementation of each service. But, If you are working on such a system and face problems, my DMs are always open. Please find me on LinkedIn and Twitter.

The solution: High-level design of our notification system

Working of the system — Flow of data

The entire system is based on asynchronous communication between different microservices with a clear separation of concerns. This enables us to make faster changes, control the overall throughput, and monitor the system with relative ease. Let’s go through the step by step data flow

  1. Individual microservices (responsible for different functionalities) fire a Pub/Sub event with a generic message format into a Pub/Sub queue called Receiver-selection-Q. This generic message format ensures that we can standardize it across services. I cannot disclose the exact message format here, but it’s similar to something like this — {notificationType, entityType, entityId, sendType}
  2. These messages are consumed by a receiver selection service responsible for finding the receiver(s) of this notification. This service contains the business logic for finding these receivers. It is also aware whether the particular notification type is supposed to be scheduled for later or delivered instantly. Our receiver selection service is a complex system in itself and beyond the scope of this blog. I will write a different blog post for this in the future. Please note that it receives only one Pub/Sub event per the notification, irrespective of the number of recipients.
  3. If a notification is supposed to be scheduled for sending later, it is sent to another pub/sub queue called scheduler queue. It is then consumed by a scheduled notification service that saves the notification in a database. A polling service then queries the database at scheduled intervals and sends these scheduled notifications to a builder queue for further processing.
  4. If the notification is supposed to be sent instantly, it directly goes to the builder queue. Please note: builder-q receives individual events for each recipient. For example, suppose you are sending a message in a community, and the community contains 1000 members. In that case, the queue will receive 1000 unique Pub/Sub messages — one for each notification to be delivered.
  5. Messages in the builder queue are then consumed by a template builder service responsible for generating the final notification text according to the notification type and then sending it to the instant notification queue. This template builder service is also responsible for understanding the user’s language preferences and generating the notification in that language. It also enables us to do A/B testing by changing the configuration of one particular notification type.
  6. If the notification message is supposed to persist in our database, the template builder service also sends the same message to a storage queue. A dedicated service responsible for just storing the notification (and also grouping it for the notification screen) consumes this message and does its job.
  7. A final consumer service consumes the messages from instant-notification-q and then is responsible for delivering them to the end-user. It also factors in end-user notification preferences and personalizations before sending the notification.
  8. As mentioned earlier, our promotional notification system also leverages the same notification sending system, so it directly sends the messages in the builder queue for sending it to the relevant uses. It is also an exciting system we have developed, and I will explain it in a separate blog post.

The above architecture and the data flow enable us to solve all product and engineering problems I mentioned earlier. Please feel free to ask questions in the comments if you have any.

Tools and technologies used

Now that you understand the architecture diagram and the data flow, let’s talk about the various tools and technologies used.

  • Pub/Sub by google cloud — receiver-selection-q, builder-q, storage-q and instant-notification-q are all pub-sub queues powered by GCP Pub/Sub. If you want to use this, check the dead-letter queue and exponential backoff setting as well
  • Services — receiver selection service, template builder service, and all the other services mentioned in the above diagram are written in the Go programming language.
  • Databases — for storage and retrieval, we use MongoDB as our primary database with the help of Redis in certain places.
  • Deployment — All services are orchestrated using deployments in a Kubernetes cluster with a Horizontal Pod AutoScaling setup. This deployment setup makes the entire application layer auto-scalable. And it keeps on increasing and decreasing the number of instances of these applications based on the load.

Scalability and future challenges

Architecturally, the entire system is horizontally scalable except for the receiver selection service and polling service. These are the only two-part that will need re-architecting when we move to a million notifications per hour kind of scale. We found this after benchmarking and testing the system at 100X of the current scale. It also proved that the system would comfortably scale up to 50 million notifications per day without breaking. We are already in the process of re-writing these two components to solve our future scaling requirements. But, for a growing social networking company, the above architecture should be more than enough to meet their notification delivery needs.

Shameless plug

Hi there. I hope you liked this blog post. If you are someone looking to learn more about building such a highly performant cost-effective distributed systems, I have created a dedicated learning community called cloudeasy.club just for you. You can learn about system design, software architecture, and distributed systems by joining this club. You can find us on Discord (discord.com/invite/a5yWrRnyRf) or Whatsapp group (link.cloudeasy.club/kBMf). Also, feel free to ping me on Twitter or LinkedIn if you have any questions.