Tuesday, 27 July 2021

Avro Schema Evolution

When you work with data, schema evolution plays a very crucial role.


For instance, while working on Kafka with Avro data you might have observed many exceptions that the producer/consumer clients face wrt schema compatibility. It's important to understand the concept of schema evolution.

What is Avro?
~ In order to transfer data over the network you need data to be serialized (data conversion to binary format) and Avro is one such system or tool that helps you achieve it.
~ Avro depends on schema and you can also think of it as a JSON format with the schema attached to it. And that's the main reason why Avro is preferred over JSON because you can't enforce schema with JSON.
~ Avro is fast & compact.
~ Data is fully typed and named.

A simple Avro schema can look like this -
{
"type": "record",
"name": "Employee",
"fields" : [
{"name":"emp_name", "type":"string"},
{"name":"emp_id", "type":"long"},
{"name":"department", "type":"string"}
}

Schema Evolution -
In order to evolve an Avro schema, you need to keep some important things in mind in order to make the changes compatible.

Backward Compatibility -
Your producer application is producing messages/data using an old schema and your consumer application can read the data using a newly evolved schema.

Let's use the above schema to understand it.
Suppose the Producer app created a record using the following schema -

{
"type": "record",
"name": "Employee",
"fields" : [
{"name":"emp_name", "type":"string"},
{"name":"emp_id", "type":"long"},
{"name":"department", "type":"string"}
}

(so the record has emp_name, emp_id, and department)

Now, the Consumer app on the other side reads this record using a newly evolved schema that doesn't contain the field department.

{
"type": "record",
"name": "Employee",
"fields" : [
{"name":"emp_name", "type":"string"},
{"name":"emp_id", "type":"long"}
}

But the consumer is still able to read the record and the data would just have emp_name and emp_id (the department is silently ignored).

Forward Compatibility -
Your producer app uses a new schema to write messages and your consumer app can read the messages using an old schema.

~ There could exist a combination of both backward and forward compatible schema as well, a fully compatible schema.

Some of the rules that I personally found useful for creating compatible schemas -
- You can easily add a field with a default value in the new schema.
Now suppose the producer is writing using an old schema and the consumer uses this new schema, as we have a default value associated with the newly added field we don't need to worry about this field missing in the producer schema and the field would get default value on the consumer side.

- You can easily remove a field having a default value in the new schema.
- You can't rename a field but you can use aliases.
- You can't change the data type.

Credit goes to Mayank Ahuja

No comments:

Post a Comment