Protobuf Encoding

March 07, 2024

Overview

Protocol Buffers (a.k.a. Protobufs) are the preferred messaging construct to use for communication over gRPC. Data is encoded in a highly efficient format that allows for data to be transmitted in a smaller message size.

Table of Contents

Encoding
Versioning
Performance
Code Generation
Use Cases
Conclusion

Encoding
^

Protobuf encoding is very simple. Its simplicity is enabled by a simple convention: omit field names from the binary encoding in favor of field name indexes (as defined in the .proto file). When you import these language-specific libraries that are auto-generated from the .proto files, they do contain the field names. It's only when the code is interpreted or compiled that the field names are omitted entirely in favor of their indexes. This has versioning consequences that are discussed below.

The exact binary encoding is in segments in the following order:

  1. Field tag (a. field name index, b. field datatype) where the first 5 bits of the octet are reserved for the field name index and the last 3 bits of the octet are reserved for the datatype.
  2. Value's length if datatype is variable length (like strings or other message types), otherwise this is omitted.
  3. Value.
  4. Next field (repeat 1) if there is one.
  • If a field value is not set, using this encoding, it can be omitted entirely from the record.
  • If the exact same message is nested twice in a list field (prefixed with repeated) or in different fields altogether, they are both encoded in their entirety. The focus is simplicity with Protocol Buffers.

Learn more about Protobuf encoding from Designing Data Intensive Applications.

Versioning
^

The manageable downside to this efficient encoding is that it tightly couples the order of message fields to the Protobuf (a.k.a. API in this context) version. Changing this order requires creating a new version of the Protobuf API. You can add new fields to existing versions, but you can't delete them, because this would change the order of fields in the binary encoding (breaking the API for clients who don't update).

This tight coupling of the message field order to API version is not a concern if you own the clients and server using the API.

Performance
^

Smaller message sizes have the secondary effect of decreasing latency because all networks (including networks between containers or virtual machines in a physical machine) have inherent limits on bandwidth that cause congestion and latency given a high enough throughput. Lowering the size of each message lowers this congestion and latency, given a high enough throughput. Furthermore, if there are any mechanisms en-route that need to read these messages (e.g. for Deep Packet Inspection), they can complete this task in less time.

For benchmarks, visit:

Smaller message sizes also reduce usage of:

  • CPU: because it takes less time for CPUs to process them.
  • Memory: because it takes less space to store them while processing them.
  • I/O: because it takes less time to send/receive them.

Code Generation
^

Protobuf's encoding is the same regardless of the language. .proto files are auto-generated (at build time) into language-specific code that can be imported into your application.

For examples of Protobuf .proto files (i.e. their developer interface), see: Object.

Use Cases
^

Protobufs are ideal for inter-service communication between microservices in a distributed system. The API contract provided by .proto files and their auto-generated code allows for seamless collaboration between services and their teams.

Conclusion
^

Protobufs are a powerful encoding that efficiently packages key-value pairs in message objects. It supports a wide range of data types alongside message nesting. Most importantly, it supports defining RPC clients and servers to be auto-generated into a wide range of languages.