A brief exploration
In the last few years, the adoption of event-driven systems has seen a considerable increase among big companies. What are the main reasons behind this trend? Is it purely based on hype or are there any valid reasons to support the adoption of this architecture? From our perspective, the main reasons why many companies are following this path are:
Having individual components that interact in an asynchronous manner through events achieves a low coupling. We can modify, deploy and operate these components independently at different times without problem; this is a huge benefit in terms of maintenance and productivity costs.
This advantage is highly related to loose coupling, but we thought it was worth mentioning individually due to its importance. One of the greatest advantages of these systems is that an event can be fired and we don’t really care how and when this event gets processed, only if it persisted and became durable in the corresponding topic.
Once a type of event is sent to a topic, new consumers interested in the event can subscribe and start processing them. The producer won’t have to do any work at all to integrate new consumers, as the producer and consumers are totally decoupled.
In a traditional synchronous HTTP communication between components, the producer would have to call every consumer independently. Considering that each call to each consumer could potentially fail, the cost of implementing and maintaining this is much higher than following the event-driven approach.
One of the main benefits is that components do not necessarily need to be operating at the same time. For instance, one of the components could be unavailable and that will not affect the other component at all (as long as it has work to do).
Easy to fit some business models into event-based systems
Some models, especially those where entities go through different stages during their lifecycle, fit really well into event-based systems. It becomes very easy and sensible to define the model as a set of interconnected relationships based on a cause and an effect. Every “cause” originates an event that takes a given “side effect” onto a particular entity or even onto multiple entities in the system.
Considering all the benefits cited above, we can clearly see why this architecture has been becoming more popular in recent years. For companies, agility and cost-efficient maintenance are of huge importance. Choosing one architecture over another can have a considerable impact on a company’s performance. We must understand that in the upcoming years, many companies will be tech-powered and that’s why our work and our architectural decisions are so important for the success of any company.
Having said that, we must also know that building event-driven systems is not an easy task. There are some common mistakes that beginners working with these systems often make. Let’s go through some of them.
As in any system where asynchronous operations are taking place, many things could go wrong. However, we’ve detected a small set of mistakes that are reasonably easy to spot in projects that are not in advanced stages yet.
Not guaranteeing order
The first of the problems we’ll look into is when we overlook the need of guaranteeing the order of certain operations over a given entity in our business model.
For example, let’s say we have a payments system where every payment goes through different states based on certain events. We could have a
ProcessPayment event, a
CompletePayment event and a
RefundPayment event; each of these events (or commands in this case) transitions the payment to its corresponding state.
What would happen if we don’t guarantee ordering in this case? We could be in situations where, for example, a
RefundPayment event could be processed before a
CompletePayment event for the same payment. This will mean that the payment would stay as completed despite our intention being to get a refund for that payment.
This happens because at the time we process the
RefundPayment event, the payment is still in a state that doesn’t admit refunds. We could put other measures in place to overcome this issue, although that wouldn’t be efficient. Let’s see this situation with a diagram to understand the issue better!
In the illustration above, we can see how consumers can consume payment events at the same time for the same payment ID. This will definitely be a problem for different reasons, the main reason being that we always want to wait for the processing of one of the events before proceeding with the next event for that payment.
Pub/Sub messaging systems like Kafka or Pulsar provide a mechanism to easily achieve this. For example, in Apache Pulsar, you could use key shared subscriptions to ensure that events for a given payment ID are always processed by the same consumer in order. In Kafka, you’d probably have to use partitions and ensure that all the events for a given payment ID get assigned to the same partition.
In the new scenario, all the events belonging to an existing payment will be processed by the same consumer.
Non-Atomic multiple operations
Another common mistake is to do more than one thing within a business-critical section and assume that every operation will always work. Always keep this in mind: if something can fail, it will fail.
For example, one of the typical scenarios is when we persist an entity and send an event immediately after the entity was persisted. Let’s look at one example to understand the issue better:
What would happen if sending the
UserCreated event fails? The user would be persisted at that point, therefore there would be an inconsistency between our system and our consumers’ systems. Some people would say in this case, what if you send the
UserCreated event first? Well, what happens if persisting the user fails after having sent the event?
The consumers would think the user has been created but it hasn’t!
The main question now is how can we solve this problem; there are different ways to solve this problem. Let’s go through some of them.
Use of transactions
One easy way to solve this inconsistency problem is to take advantage of our database system and use transactions if they’re supported. Let’s assume we are provided with a
withinTransaction method that will start a new transaction and rollback if anything in the closure we provide fails.
Please keep in mind that in some cases transactions could take a burden on performance, check your database documentation before making any decision.
Transactional outbox pattern
Another way to solve this problem would be to send the event in the background after the user has persisted by implementing the transactional outbox pattern. You can check how it can be implemented here.
Sending multiple events
A similar problem happens when we try to send multiple events within a business-critical section. What happens if sending an event fails after other events have been sent previously? Again different systems could be in an inconsistent state.
There are some ways to avoid this issue. Let’s see how.
Probably the best way to solve this problem is simply to avoid sending multiple events at one time. You can always chain events so they get processed sequentially instead of in parallel.
For example, let’s look at this scenario. We need to send a “registration successful” email when a user is created.
If sending the
SendRegistrationEmailEvent fails, we won’t be able to recover from that error even if we retry. Therefore the registration email will never be sent. What if we do this instead?
By splitting the events, we allow other consumers to carry on after
UserCreated event was sent, and at the same time split the sending of a registration email to a separate functionality that can be retried as many times as we want independently.
If your messaging system supports transactions you could also make use of them to be able to rollback all the events when something goes wrong. For instance, Apache Pulsar supports transactions if needed.
The last common problem we’re going to look at is not taking into consideration that events have to be backward-compatible when we modify existing events.
For example, let’s say we’re adding some fields to an existing event and at the same time removing an existing field from this event.
In the image below, we can see the wrong and the right way to release a change like the one we just described.
In the first case, we immediately remove
middleName field and add the new
dateOfBirth field. Why is this problematic?
This first change will cause problems and leave some existing events in a blocked state. Why?
Imagine that there are some
UserCreated events in our
Users topic when we trigger the deployment of our new version. The most common way of deploying applications without downtime is the so-called rolling upgrade, so we’ll assume from now on that this is the way our application is being deployed.
During our deployment, it could be happening that some of our nodes contain the code changes for the new event version. However, there will still be some nodes running the old version of our code and therefore not supporting the new event version.
We also have to keep in mind that the code released in our first deployment has to make the new fields optional to be able to process old events waiting in the topic to be processed. Once we’re 100% sure of not having old events in our topics, we can remove this restriction and make them all mandatory.
We’ve seen how useful event-driven architectures can be. However, implementing it in the right way is not an easy task and it requires some experience to get it right. In this case, in particular, it’s especially useful to do pair programming with someone experienced in working with this kind of architecture as it can save us a very precious time.
That’s all from us today. I really hope you’ve enjoyed our article and hopefully learned something new.
As always, we hope to see you back soon!