When Kafka is not the right Move

When designing distributed systems, event streaming platforms such as Kafka are the preferred solution for asynchronous communication. In fact, on Kafka’s official website, the first use-case listed is messaging¹. My believe is that this default choice can lead to problems and unnecessary complexity. The main reason for this is the conversion between state and events, which we’ll look at in this article through a game of chess.

Big Chess, Inc

Imagine you work at “Big Chess, Inc.” and are tasked to develop a distributed chess platform. The system is distributed so that teams may work autonomously, and thus quickly, on the features they are tasked to develop. You create a Chess microservice that takes inputs from two users and makes chess moves accordingly. Of course, only allowing legal moves. Users play moves using a REST API and the state is stored in a relational database so users may pause playing and continue later on.

The following is a simple implementation of the chess_board table:

CREATE TABLE chess_board (
  file CHAR(1) CHECK (file IN ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h')),
  rank INTEGER CHECK (rank BETWEEN 1 AND 8),
  piece VARCHAR(3),
  PRIMARY KEY (file, rank)
);

Big Chess, Inc. is in the business of live streaming chess games. As such, your colleagues are developing a downstream service that will re-enact the played chess moves on the big screen, presumably with some added flair such as spectator insights.

To keep the system loosely coupled and asynchronous you decide in an architecture meeting to use an event streaming platform like Kafka to publish the moves to the downstream service. Moves are published using the standard algebraic notation.

To guarantee at-least-once delivery of the turn events, you apply the outbox pattern². With the outbox pattern events are written to an outbox table in the same transaction as the write to the chess_board table. The outbox messages are then asynchronously published to Kafka by a background job.

Unfortunately for your colleagues, algebraic chess notation requires the current state of the board in order to disambiguate moves. As such, they’ll have to store the state of the board locally. They do this by “borrowing” your chess_board DDL and converting the algebraic notation back into state as they consume the events.

Besides the duplication of the data model and the annoyance of having to convert back into state, a subtle problem arises. The following example shows this problem:

Chess demonstration showing example of a service mesh with two chess boards

Enable JavaScript to see an interactive chess demo.

Move Explanations:

1. Qd5: White plays Qd5, meaning their queen moves from e4 to d5. The chess microservice decides this move is legal. The downstream service looks up the position of white's queen and updates the board state to reflect this move. All seems well.

2. Rd8: Black plays Rd8. The chess service knows this could only mean the rook moving from e8 to d8, as the other rook moving to d8 would be an illegal move. The downstream service receives the event and looks up the position of black's rook. It turns out there are two rooks on the board. This move is undecidable for a downstream service that doesn't contain the chess "business logic".

Chess Service

Moves

Downstream

Make a move to see the explanation

By publishing the moves as events, the downstream team is not only forced to re-implement the data model, but also the business logic. The consumer has to re-implement the chess rules to figure out which rook is moving. If there are multiple consumers, they might even each apply a slightly different version of the logic if they use a different (version of a) chess library or have a different interpretation of what an event means. Chess may be simple enough to avoid major inconsistencies, but more complex business domains risk drift.

The Alternative

Let’s consider an alternative: rather than communicating using changes in state (a.k.a. events), why not communicate that state directly? Rather than going back and forth between state and events, the state can simply be replicated to consumer databases.

This simplifies things greatly:

No additional business logic required to create events and post events to the outbox table.
No need for an outbox table in your data model.
No infrastructure needed to process the outbox table and publish events to Kafka.
No need to write and maintain (potentially) complicated event consumers.

There’s one caveat though: if you apply replication naively you would end up tightly coupling the consumer to the data model of the producer. The solution there is to make sure you don’t replicate your state as-is but encapsulate the intricacies of your data model by applying a transformation from your internal model to the one consumers get. Versioning would have to be involved as well, but that’s a topic for another article.

Conclusion

Kafka is an excellent tool for many use cases. However, it’s not always the best solution for communication between (micro)services. Event streaming, as the name suggests, forces the use of events. This is unnecessary for many use cases. A direct replication strategy can be used instead for a greatly simplified system.

Confluent claims that with “stream-table duality” you can “can easily turn a stream into a table, and vice versa”³. The existence of this duality I don’t argue with, but calling it “easy” is a bit a stretch.

When Kafka is not the right Move

When Kafka is not the right Move

Big Chess, Inc

Chess Service

Moves

Downstream

The Alternative

Conclusion

Footnotes