Protobuf Encoding
March 07, 2024
Overview
Protocol Buffers (a.k.a. Protobufs) are the preferred messaging construct to use for communication over gRPC. Data is encoded in a highly efficient format that allows for data to be transmitted in a smaller message size.
Table of Contents
Encoding
Protobuf encoding is very simple. Its simplicity is enabled by a simple convention: omit field names from the binary encoding
in favor of field name indexes (as defined in the .proto
file). When you import these language-specific libraries that are auto-generated from the .proto
files, they
do contain the field names. It's only when the code is interpreted or compiled that the field names are omitted entirely in favor of their indexes.
This has versioning consequences that are discussed below.
The exact binary encoding is in segments in the following order:
- Field tag (a. field name index, b. field datatype) where the first 5 bits of the octet are reserved for the field name index and the last 3 bits of the octet are reserved for the datatype.
- Value's length if datatype is variable length (like strings or other message types), otherwise this is omitted.
- Value.
- Next field (repeat 1) if there is one.
- If a field value is not set, using this encoding, it can be omitted entirely from the record.
- If the exact same message is nested twice in a list field (prefixed with
repeated
) or in different fields altogether, they are both encoded in their entirety. The focus is simplicity with Protocol Buffers.
Learn more about Protobuf encoding from Designing Data Intensive Applications.
Versioning
The manageable downside to this efficient encoding is that it tightly couples the order of message fields to the Protobuf (a.k.a. API in this context) version. Changing this order requires creating a new version of the Protobuf API. You can add new fields to existing versions, but you can't delete them, because this would change the order of fields in the binary encoding (breaking the API for clients who don't update).
This tight coupling of the message field order to API version is not a concern if you own the clients and server using the API.
Performance
Smaller message sizes have the secondary effect of decreasing latency because all networks (including networks between containers or virtual machines in a physical machine) have inherent limits on bandwidth that cause congestion and latency given a high enough throughput. Lowering the size of each message lowers this congestion and latency, given a high enough throughput. Furthermore, if there are any mechanisms en-route that need to read these messages (e.g. for Deep Packet Inspection), they can complete this task in less time.
For benchmarks, visit:
- Beating JSON performance with Protobuf
- Is protobuf much faster than json even in simple web server response requests?
Smaller message sizes also reduce usage of:
- CPU: because it takes less time for CPUs to process them.
- Memory: because it takes less space to store them while processing them.
- I/O: because it takes less time to send/receive them.
Code Generation
Protobuf's encoding is the same regardless of the language. .proto
files are auto-generated (at build time) into
language-specific code that can be imported into your application.
For examples of Protobuf .proto
files (i.e. their developer interface), see: Object.
Use Cases
Protobufs are ideal for inter-service communication between microservices in a distributed system.
The API contract provided by .proto
files and their auto-generated code allows for seamless collaboration between services and their teams.
Conclusion
Protobufs are a powerful encoding that efficiently packages key-value pairs in message objects. It supports a wide range of data types alongside message nesting. Most importantly, it supports defining RPC clients and servers to be auto-generated into a wide range of languages.