Message-ID Database Daemon (msgiddbd)

msgiddbd is a database I designed and wrote to store Usenet Message-ID information for Newzbin. It is written in C – the source can be seen on Github.

The goal was to store all the information associated with Usenet posts in as little space as possible. Previously this information was in a MySQL database and due to the generic nature of MySQL I was quite confident I could tailor a solution to store the same information in less space, and probably make it a good deal faster.

The information stored in this database was subsequently used to create NZB files.

Design

From the ground up, the database was designed to make best use of every last byte for storing Message-IDs. The MySQL database already stored hundreds of millions of rows and it would only ever grow as Newzbin kept pace with the Usenet providers increasing their retention.

Competing with generic databases was never a goal, so hand-crafting the table format to store Usenet specific data was the main advantage.

The data structures heavily use typedef‘d datatypes, all of which are set out in defines.h.

The decision to use an unsigned 32-bit type for dates was quite deliberate; it would be safe until 2106 (the code comment incorrectly says 2039) and by then we could just increment the table version if we needed to update it – probably by redefining when the epoch was.
Saving an extra 32 bits of space in one of our most frequent structs was a bigger advantage than the downside of needing to do an update in a hundred years.

I also wanted redundancy and load sharing, so replication is built in. To facilitate that, a binary log is written as new data is fed into the server. This log is used to relay data to any connected slaves, and can also be used for crash recovery if a table is corrupted.

Internally, data consistency was an important point. To this end, magic numbers are employed heavily, and the most important data (segment tables) is hashed. There are disaster recovery routines which attempt to repair data when possible.

Network Protocol

A simple network protocol allows clients to communicate with the database, loosely modelled after SMTP (send a command, get a response with the first 3 bytes being an easy-to-parse numeric return code).

When a client connects it gets a welcoming banner, and then the server awaits commands.

Each command is given a response of which the first 3 bytes are always numeric, intended to be easily parsed for success or failure. Broadly speaking, they were 2xx for success, 3xx for ‘nothing done’, 4xx for user error, and 5xx for server error – which should sound familiar; they’re more or less HTTP errors. See client.h for the full list.

[tags: c, newzbin, project]