Avro
Apache Avro is an open source data serialization system. You can also find documentation on confluent which may be more user-friendly.
- Avro is a schema format created and used with Kafka.
- Maps to JSON (works with many programming language)
- Allow for better robustness with the evolution of the data over time (with the registry)
- Support for logical types (for types languages like Java)
With Schema Registry
The Schema registry makes sure your avro schemas stay synced between apps. The most known is the Confluent solution. the basic idea is to provide a source of truth so that every consumer and producer understands each other.
It keeps track of the schema versioning. You can set the schema to be compatible following some types for example:
Subject Name Strategy
When in use with a schema registry, you can set your schema to follow a certain strategy:
- TopicNameStrategy: To set one schema to one topic (default)
- RecordNameStrategy: To put multiple schemas on one topic based on the record name
To set it, you may use those variables with TopicNameStrategy
or RecordNameStrategy
:
confluent.value.subject.name.strategy=io.confluent.kafka.serializers.subject.RecordNameStrategy
Union
You can also join your schemas in a com.github.event.PostEvents.avsc
which contains other schemas.
[
"com.github.event.PostUpdated",
"com.github.event.PostCreated",
"com.github.event.PostDeleted"
]
Using this union, you can use a TopicNameStrategy
with the com.github.event.PostEvents
schema and thus have more
than one schema available in your topic.
Schema
The basis
When using avro with a jvm based language, it’s better to use as a namespace the path to the schema.
.
└── src
└── main
├── avro
| └── com
| └── github
| └── event
| ├── Example.avsc
| └── Custom.avsc
└── java
└── // Your java classes
An avro record Example.avsc
as an example:
{
"namespace": "com.github.event",
"type": "record",
"name": "Example",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "custom",
"type": "com.github.event.Custom"
}
]
}
That’s a simple schema, with simple fields with a primitive type string
and a custom logical
type com.github.event.Custom
.
Basic fields
For custom types (e.g. com.github.event.Custom
) or simple types (e.g. “string”, “boolean”, “double”), you can
define a field simply with:
{
"name": "isEnabled",
"type": "boolean"
}
Decimal fields
For integers and long you have access to int
and long
. For number, you can also use float
or double
.
For more flexibility and precision for your custom decimal fields, you can use the type bytes
.
{
"name": "myDecimalField",
"type": {
"type": "bytes",
"logicalType": "decimal",
"precision": 30,
"scale": 10
}
}
Enum fields
Enum fields only allows the values described in the schema. If you update the values of an enum it can be considered as a breaking change.
{
"name": "enumField",
"type": {
"name": "EnumField",
"type": "enum",
"symbols": [
"FIRST_VALUE",
"SECOND_VALUE"
]
}
}
Array fields
For arrays, you’ll have to use the type array
then specify in item
the type of array.
{
"name": "myArraySuperField",
"type": {
"type": "array",
"items": "com.github.event.MyArrayItem"
}
}
Here in java it could translate to something like:
import com.github.event.MyArrayItem;
List<MyArrayItem> myArraySuperField;
Nullable fields
Keep in mind that you can’t set non-nullable field to null otherwise it will create a serialization / deserialization error.
{
"name": "myNullableField",
"type": [
"null",
"string"
]
}
And here you have a field that accepts null as value.
⚠️ Boolean fields can’t be nullable
Deserialization error can happen with a record, you may want to use deserializers that can catch those errors, so your consumers don’t get stuck on a bad record (this is called the poison pill pattern).
Optional fields
Optional fields can be omitted from the schema because they have a default value. Optional fields follow this syntax:
{
"name": "myOptionalField",
"type": [
"null",
"string"
],
"default": null
}
- The “default” value must be
null
Just setting the type as "type": ["null", "string"]
is not enough to create an optional field, because it only puts
the field as nullable. Meaning it can’t be omitted from the schema, but it can be set to null.