Protobuf comes with a minor problem: it does not have support for
handling "type tagged structures", that is, something reminiscent of
objects in OOP
lingo, so if one is going to have a heterogeneous sequences of
messages, you have to roll it yourself. For that reason, I
added a
transport frame for the messages in the binary log
that wraps each with some extra information. In addition to allowing
the binary log to be a sequence of messages, it also adds some
integrity-checking data and simplifies some administrative tasks.
Transport frame with message
| Length |
| Type Tag |
| Message |
| Checksum |
The format of each message in the sequences is given in the table in
the margin. where the
length is a specially encoded length
that we will go through
below,
type is a single byte being the type tag,
message
being one of the messages given in the specification, and
checksum being a checksum to ensure the integrity of the
transport.
Checksum. As checksum, the plan is to use a
CRC-32. We don't want it to be too large to affect performance, and we
want it to catch reasonable losses of integrity. I'm considering
storing this as a varint after the actual message, but for the time
being, it is given as 4 raw bytes (it is not implemented at all
yet). Please give me feedback on this: if we make it a varint, we can
stuff the checksum in there, but that will also run the risk of not
being able to read the checksum due to corruption of the checksum, so
offhand, I would say that a fixed number of bytes is preferable.
Type Tag. The type tag is a single byte giving the
type of the event. This means that we are limited to 255 events, but
considering that we don't even have 26 events in 5.1 right now, I
don't see that we will run into that limit very soon. It is possible
to put the type tag in the message as well, and specifying it as an
enum inside the protobuf specification, but that will just provide the
information in two places, so it is better to keep it separately.
Length. The length is the length of the
message, that is, it does not include the type tag, length,
nor the length itself. This simplifies the normal processing, and in
the event that one needs to skip an event, it is easy to compute the
next position of a transport frame by just decoding the length (see
below).
Length encoding
The length is encoded using a special scheme to allow for very little
overhead for small events while still leaving room for giving the
length for very large events. This scheme currently allows for a
compact representation of lengths from 2 bytes 4 GiB (Gigabytes) and,
if you don't need to have a compact representation of 2, you can
represent lengths in the range 3 bytes to to 16 EiB (Exabytes)
[sic].
The basic idea is to note that the length of a message can never be
zero, and the minimal length in this case is actually 16 bytes. Since
we will never have a length that is less than 16 for an event, that
leaves the lengths 0-16 available for denoting other information. The
obvious solution is to let the first byte denote either the length, if
it is, say greater than 8, or the number of bytes following the byte
that gives the length if it is less than or equal to 8, but we can
actually do better.
Storing Length requires
ceil(log256(Length)) bytes, so if we let the first
byte L be
ceil(log2(ceil(log256(Length)))) - 1
(that is, the smallest power of two that is greater than the number of
bytes needed for the length, minus 1), we can get away with reserving
significantly fewer values. So, for example:
| Bytes |
Value |
| 2A |
|
42 |
| 00 |
FF |
02 |
|
767 |
| 01 |
FF |
01 |
01 |
00 |
66047 |
Computing the number of bytes can be done by computing 1
<< (L + 1), but computing the inverse is a little more
involved.
The following two functions does the job. Although it looks like a lot
of code, length_encode() is actually only 29 instructions
on my machine (no function calls), while length_decode()
is about 7 instructions.
inline unsigned char *
length_encode(size_t length, unsigned char *buf)
{
unsigned char *ptr= buf;
assert(length > 1);
if (length < 256)
*ptr++= length & 0xFF;
else {
int_fast8_t log2m1= -1; // ceil(log2(ptr - buf)) - 1
uint_fast8_t pow2= 1; // pow2(log2m1 + 1)
while (length > 0) {
// Check the invariants
assert(pow2 == (1 << (log2m1 + 1)));
assert((ptr - buf) <= (1 << (log2m1 + 1)));
// Write the least significant byte of the current
// length. Prefix increment is used to make space for the first
// byte that will hold log2m1.
*++ptr= length & 0xFF;
length >>= 8;
// Ensure the invariant holds by correcting it if it doesn't,
// that is, the number of bytes written is greater than the
// nearest power of two.
if (ptr - buf > pow2) {
++log2m1;
pow2 <<= 1;
}
}
// Clear the remaining bytes up to the next power of two
while (++ptr < buf + pow2 + 1)
*ptr= 0;
*buf= log2m1;
assert(ptr == buf + pow2 + 1);
}
return ptr;
}
inline const unsigned char *
length_decode(const unsigned char *buf, size_t *plen)
{
if (*buf > 1) {
*plen = *buf;
return buf + 1;
}
size_t bytes= 1U << (*buf + 1);
const unsigned char *ptr= buf + 1;
size_t length= 0;
for (unsigned int i = 0 ; i < bytes ; ++i)
length |= *ptr++ << (8 * i);
*plen= length;
return ptr;
}