This is the first of several posts on the implementation of my ActivityPub-enabled Rust web application, FediDict. Each of these posts was released to my Patreon patrons before being make publicly available. I recommend that you read the security design post first.

The core of any stateful service is its datastore, and FediDict uses PostgreSQL, a very performant, ACID-compliant, free and open source relational database. In this post, I’ll look at how to make an efficient, normalized database for FediDict’s data.

That’s a lot of jargon, so let’s break it down.

“Performant”, here, means that it is capable of many simultaneous reads, since that is likely to be the main load.
“ACID-compliant” basically means that PostgreSQL won’t lose data, unless I explicitly ask it to.
“Free and open source” is important becuase I want anyone to be able to run FediDict.
“Relational” means that relations, such as “user x owns definition y”, are encoded in the database itself, rather than being something FediDict has to handle at the application level. This is good for both performance and security.

From the Ground Up

FediDict is mosly interested in its list of definitions, and that’s where I’ll begin with the database implementation as well. A definition requires a lot of attributes. These will go in the Definitions table of the database.

Definition:

An arbitrary unique ID value
The term being defined
The part of speech being defined (refers to a part of speech; a foreign key for PartOfSpeech)
The actual text of the definition
A list of related (“see also”) terms
Date of creation
Created by (foreign key for User)
Date of approval (it may not exist; we call this nullable)
Approved by (foreign key for User; nullable)

Normalization

This is a lot of columns for one database table, and furthermore, some of these items depend on one another. For instance, it’s valid to have no approval date AND no approved by user ID, or to have both an approval date and an approved by user ID, but not one or the other.

This is not desirable; it’s better to have such constraints encoded in the database, and this is impossible with a single table solution. Furthermore, a bulk approval could result in multiple records storing data that refers to a single event, which is a violation of database normalization. To avoid this, I’ll split out approval into its own table.

I’ll also replace the two fields for date of approval and approved by user ID with a single nullable foreign key reference for an Approval. If it’s null, the definition has not been approved; if it has a value, the definition has been approved, and the system can look up which user approved it.

Definition:

An arbitrary unique ID value
The term being defined
The part of speech being defined (that is, “kick, n.” is not the same as “kick, v.”)
The actual text of the definition
A list of related (“see also”) terms
Date of creation
Created by (foreign key for User)
Approval record (nullable, foreign key for Approval)

Written in SQL, the language of the database, that looks like:

CREATE TABLE Definitions (
    id SERIAL PRIMARY KEY,
    term TEXT NOT NULL,
    part_of_speech INT NOT NULL,
    definition TEXT NOT NULL,
    see_also TEXT NOT NULL,
    created_at TIMESTAMP NOT NULL,
    created_by INT NOT NULL,
    approval INT,
    FOREIGN KEY (part_of_speech) REFERENCES PartsOfSpeech(id),
    FOREIGN KEY (created_by) REFERENCES Users(id),
    FOREIGN KEY (approval) REFERENCES Approvals(id)
);

Approval:

An arbitrary unique ID value
The user responsible for approving the definition (foreign key for User)
The date on which the definition was approved

In SQL, this looks like:

CREATE TABLE Approvals (
    id SERIAL PRIMARY KEY,
    approved_by INT NOT NULL,
    approved_at TIMESTAMP NOT NULL,
    FOREIGN KEY (approved_by) REFERENCES Users(id)
);

The final definition-related table is PartsOfSpeech. I considered using a hard-coded list of parts of speech, but I realized that this isn’t scalable across languages, so I will definitely need to let users define their own. Fortunately, it’s very simple.

Part of Speech:

An arbitrary unique ID value
The name of the part of speech, like “noun”
The plural of the name, like “nouns”
The symbol for the part of speech, like “n.”

In SQL:

CREATE TABLE PartsOfSpeech (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    plural TEXT NOT NULL,
    symbol TEXT NOT NULL
);

Identity Crisis

The other important entity in the system is the user. Users have already been referred to several times in the schema, but given the discussion in the previous post, how I refer to them is pretty important.

User:

An arbitrary unique ID value
The user’s name
The user’s email
The hash salt for the user’s password
The hashed password used for login
The federation partner associated with this user (foreign key for FederationPartner, nullable)
The date upon which the account was created in FediDict’s database
The RBAC record for the user (foreign key for Roles)

This definition has brought in a new structure, as well: a Federation Partner. That’s what I’m calling the other ActivityPub sites that FediDict will cooperate with. For now, I’ll keep them simple.

FederationPartner:

An arbitrary unique ID value
A domain name for the partner
The date upon which this partner became known to FediDict.

Users with a non-null partner ID will be written as username@partner.domain, while users with a null partner ID are just username.

Here’s how these two look in SQL:

CREATE TABLE Users (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    email TEXT NOT NULL,
    salt TEXT NOT NULL,
    passhash TEXT NOT NULL,
    partner INT,
    created_at TIMESTAMP NOT NULL,
    roles INT NOT NULL,
    FOREIGN KEY (partner) REFERENCES FederationPartners(id),
    FOREIGN KEY (roles) REFERENCES Roles(id)
);

CREATE TABLE FederationPartners (
    id SERIAL PRIMARY KEY,
    domain TEXT NOT NULL,
    created_at TIMESTAMP NOT NULL
);

Finally, I need to define the RBAC table. This is based on the previous post:

CREATE TABLE Roles (
    id SERIAL PRIMARY KEY,
    definition_reader BOOLEAN,
    definition_submitter BOOLEAN,
    definition_evaluator BOOLEAN,
    definition_remover BOOLEAN,
    queue_reader BOOLEAN,
    queue_approver BOOLEAN,
    queue_rejecter BOOLEAN,
    account_creator BOOLEAN,
    account_remover BOOLEAN,
    account_role_assigner BOOLEAN,
);

And that’s it!

Next Steps

Now that the database is defined, I need to actually create it. I’m going to use the Diesel database mapper for Rust. In essence, Diesel allows me to write Rust code that is automatically converted into SQL, with all the type safety benefits of Rust code.

Installing Diesel

Most Rust crates are just a Cargo.toml edit away, but Diesel is a bit more complex, as it also has a CLI tool which I’d like to use. So, I first have to:

$ cargo install diesel_cli

Note that this requires both the mysql and postgresql client libraries be installed on my system, despite the fact that I’m only using PostgreSQL.

Now I do need to add it to my Cargo.toml, along with some other libraries. I’m using serde for serialization and deserialization, along with its code generator from serde_derive and JSON functionality from serde_json. I’m also using dotenv to configure the database with a .env file and chrono to handle dates and times.

[package]
name = "fedidict"
version = "0.1.0"
authors = ["Leonora Tindall <nora@nora.codes>"]

[dependencies]
serde = "1"
serde_derive = "1"
serde_json = "1"

diesel = { version ="1", features = ["postgres"] }
dotenv = "0.13"
chrono = "0.4"

Installing PostgreSQL

I’m going to use Docker to handle the database.

docker create -p 127.0.0.1:5432:5432 \
    --name fedidict-db \
    -e POSTGRES_PASSWORD="password" \
    -t postgres:alpine

docker start fedidict-db

And, in a file called .env in the project folder, I’ll make an environment variable:

DATABASE_URL="postgresql://postgres:password@localhost/fedidict"

and then run source .env to get this variable in my shell.

Now I can use Diesel to set up the database. This creates both the database and a directory to hold SQL files for database setup.

diesel setup

Heading North

Diesel’s CLI tool helps manage database state, and in fact I’ll use it to create the database to begin with.

I’ll add a new migration using diesel_cli and put some SQL code in it telling Diesel how to set up (and tear down) the database. This is the only SQL I expect to write, since Diesel will handle interaction with the database in most cases.

$ diesel migration generate initial_schema
Creating migrations/2018-09-10-170446_initial_schema/up.sql
Creating migrations/2018-09-10-170446_initial_schema/down.sql

In Diesel, migrations go both “up” and “down”; up migrations add new functionality, while down migrations remove that functionality.

The up migration is the combination of the SQL I’ve written so far in the post, but the order is important; tables must be created in dependency order. For instance, Approvals must be created before Definitions because Definitions has a foreign key that points to Approvals.

The order I decided on is PartsOfSpeech, FederationPartners, Roles, Users, Approvals, Definitions.

The down migration is much simpler. I simply drop all the tables I created in tue up migration.

DROP TABLE Roles, Definitions, Approvals, PartsOfSpeech, Users, FederationPartners;

It’s good practice to test migrations by running them and then rerunning them.

$ diesel migration run                                   
Running migration 2018-09-10-170446_initial_schema
$ diesel migration redo 
Rolling back migration 2018-09-10-170446_initial_schema
Running migration 2018-09-10-170446_initial_schema

This verifies that the down migration at least doesn’t leave anything obvious hanging around.

After running this migration, two new files will show up: diesel.toml, Diesel’s config file, and src/schema.rs, the Rust mapping of the database schema.

That’s it - the database is set up. In the next post, I’ll discuss how I map these data structures into Rust, how I test them, and what behavior I need to set up for them.