Overview
This is the spec for the Garlic Farm wire protocol, based on JRaft, its "exts" code for implementation over TCP, and its "dmprinter" sample application [JRAFT]. JRaft is an implementation of the Raft protocol [RAFT].
We were unable to find any implementation with a documented wire protocol. However, the JRaft implementation is simple enough that we could inspect the code and then document its protocol. This proposal is the result of that effort.
This will be the backend for coordination of routers publishing entries in a Meta LeaseSet. See proposal 123.
Goals
- Small code size
- Based on existing implementation
- No serialized Java objects or any Java-specific features or encoding
- Any bootstrapping is out-of-scope. At least one other server is assumed to be hardcoded, or configured out-of-band of this protocol.
- Support both out-of-band and in-I2P use cases.
Design
The Raft protocol is not a concrete protocol; it defines only a state machine. Therefore we document the concrete protocol of JRaft and base our protocol on it. There are no changes to the JRaft protocol other than the addition of an authentication handshake.
Raft elects a Leader whose job is to publish a log. The log contains Raft Configuration data and Application data. Application data contains the status of each Server's Router and the Destination for the Meta LS2 cluster. The servers use a common algorithm to determine the publisher and contents of the Meta LS2. The publisher of the Meta LS2 is NOT necessarily the Raft Leader.
Specification
The wire protocol is over SSL sockets or non-SSL I2P sockets. I2P sockets are proxied through the HTTP Proxy. There is no support for clearnet non-SSL sockets.
Handshake and authentication
Not defined by JRaft.
Goals:
- User/password authentication method
- Version identifier
- Cluster identifier
- Extensible
- Ease of proxying when used for I2P sockets
- Do not unnecessarily expose server as a Garlic Farm server
- Simple protocol so a full web server implementation is not required
- Compatible with common standards, so implementations may use standard libraries if desired
We will use an websocket-like handshake [WEBSOCKET] and HTTP Digest authentication [RFC-2617]. RFC 2617 Basic authentication is NOT supported. When proxying through the HTTP proxy, communicate with the proxy as specified in [RFC-2616].
Credentials
Whether usernames and passwords are per-cluster, or per-server, is implementation-dependent.
HTTP Request 1
The originator will send the following.
All lines are teriminated with CRLF as required by HTTP.
GET /GarlicFarm/CLUSTER/VERSION/websocket HTTP/1.1
Host: (ip):(port)
Cache-Control: no-cache
Connection: close
(any other headers ignored)
(blank line)
CLUSTER is the name of the cluster (default "farm")
VERSION is the Garlic Farm version (currently "1")
HTTP Response 1
If the path is not correct, the recipient will send a standard "HTTP/1.1 404 Not Found" response, as in [RFC-2616].
If the path is correct, the recipient will send a standard "HTTP/1.1 401 Unauthorized" response, including the WWW-Authenticate HTTP digest authentication header, as in [RFC-2617].
Both parties will then close the socket.
HTTP Request 2
The originator will send the following, as in [RFC-2617] and [WEBSOCKET].
All lines are teriminated with CRLF as required by HTTP.
GET /GarlicFarm/CLUSTER/VERSION/websocket HTTP/1.1
Host: (ip):(port)
Cache-Control: no-cache
Connection: keep-alive, Upgrade
Upgrade: websocket
(Sec-Websocket-* headers if proxied)
Authorization: (HTTP digest authorization header as in RFC 2617)
(any other headers ignored)
(blank line)
CLUSTER is the name of the cluster (default "farm")
VERSION is the Garlic Farm version (currently "1")
HTTP Response 2
If the authentication is not correct, the recipient will send another standard "HTTP/1.1 401 Unauthorized" response, as in [RFC-2617].
If the authentication is correct, the recipient will send the following response, as in [WEBSOCKET].
All lines are teriminated with CRLF as required by HTTP.
HTTP/1.1 101 Switching Protocols
Connection: Upgrade
Upgrade: websocket
(Sec-Websocket-* headers)
(any other headers ignored)
(blank line)
After this is received, the socket remains open. The Raft protocol as defined below commences, on the same socket.
Caching
Credentials shall be cached for at least one hour, so that subsequent connections may jump directly to "HTTP Request 2" above.
Message Types
There are two types of messages, requests and responses. Requests may contain Log Entries, and are variable-sized; responses do not contain Log Entries, and are fixed-size.
Message types 1-4 are the standard RPC messages defined by Raft. This is the core Raft protocol.
Message types 5-15 are the extended RPC messages defined by JRaft, to support clients, dynamic server changes, and efficient log synchronization.
Message types 16-17 are the Log Compaction RPC messages defined in Raft section 7.
Message | Number | Sent By | Sent To | Notes |
---|---|---|---|---|
RequestVoteRequest | 1 | Candidate | Follower | Standard Raft RPC; must not contain log entries |
RequestVoteResponse | 2 | Follower | Candidate | Standard Raft RPC |
AppendEntriesRequest | 3 | Leader | Follower | Standard Raft RPC |
AppendEntriesResponse | 4 | Follower | Leader / Client | Standard Raft RPC |
ClientRequest | 5 | Client | Leader / Follower | Response is AppendEntriesResponse; must contain Application log entries only |
AddServerRequest | 6 | Client | Leader | Must contain a single ClusterServer log entry only |
AddServerResponse | 7 | Leader | Client | Leader will also send a JoinClusterRequest |
RemoveServerRequest | 8 | Follower | Leader | Must contain a single ClusterServer log entry only |
RemoveServerResponse | 9 | Leader | Follower | |
SyncLogRequest | 10 | Leader | Follower | Must contain a single LogPack log entry only |
SyncLogResponse | 11 | Follower | Leader | |
JoinClusterRequest | 12 | Leader | New Server | Invitation to join; must contain a single Configuration log entry only |
JoinClusterResponse | 13 | New Server | Leader | |
LeaveClusterRequest | 14 | Leader | Follower | Command to leave |
LeaveClusterResponse | 15 | Follower | Leader | |
InstallSnapshotRequest | 16 | Leader | Follower | Raft Section 7; Must contain a single SnapshotSyncRequest log entry only |
InstallSnapshotResponse | 17 | Follower | Leader | Raft Section 7 |
Establishment
After the HTTP handshake, the establishment sequence is as follows:
New Server Alice Random Follower Bob
ClientRequest ------->
<--------- AppendEntriesResponse
If Bob says he is the leader, continue as below.
Else, Alice must disconnect from Bob and connect to the leader.
New Server Alice Leader Charlie
ClientRequest ------->
<--------- AppendEntriesResponse
AddServerRequest ------->
<--------- AddServerResponse
<--------- JoinClusterRequest
JoinClusterResponse ------->
<--------- SyncLogRequest
OR InstallSnapshotRequest
SyncLogResponse ------->
OR InstallSnapshotResponse
Disconnect Sequence:
Follower Alice Leader Charlie
RemoveServerRequest ------->
<--------- RemoveServerResponse
<--------- LeaveClusterRequest
LeaveClusterResponse ------->
Election Sequence:
Candidate Alice Follower Bob
RequestVoteRequest ------->
<--------- RequestVoteResponse
if Alice wins election:
Leader Alice Follower Bob
AppendEntriesRequest ------->
(heartbeat)
<--------- AppendEntriesResponse
Definitions
- Source: Identifies the originator of the message
- Destination: Identifies the recipient of the message
- Terms: See Raft. Initialized to 0, increases monotonically
- Indexes: See Raft. Initialized to 0, increases monotonically
Requests
Requests contain a header and zero or more log entries. Requests contain a fixed-size header and optional Log Entries of variable size.
Request Header
The request header is 45 bytes, as follows. All values are unsigned big-endian.
Message type: 1 byte
Source: ID, 4 byte integer
Destination: ID, 4 byte integer
Term: Current term (see notes), 8 byte integer
Last Log Term: 8 byte integer
Last Log Index: 8 byte integer
Commit Index: 8 byte integer
Log entries size: Total size in bytes, 4 byte integer
Log entries: see below, total length as specified
Notes
In the RequestVoteRequest, Term is the candidate's term. Otherwise, it is the leader's current term.
In the AppendEntriesRequest, when the log entries size is zero, this message is a heartbeat (keepalive) message.
Log Entries
The log contains zero or more log entries. Each log entry is as follows. All values are unsigned big-endian.
Term: 8 byte integer
Value type: 1 byte
Entry size: In bytes, 4 byte integer
Entry: length as specified
Log Contents
All values are unsigned big-endian.
Log Value Type | Number |
---|---|
Application | 1 |
Configuration | 2 |
ClusterServer | 3 |
LogPack | 4 |
SnapshotSyncRequest | 5 |
Application
Application contents are UTF-8 encoded [JSON]. See the Application Layer section below.
Configuration
This is used for the leader to serialize a new cluster configuration and replicate to peers. It contains zero or more ClusterServer configurations.
Log Index: 8 byte integer
Last Log Index: 8 byte integer
ClusterServer Data for each server:
ID: 4 byte integer
Endpoint data len: In bytes, 4 byte integer
Endpoint data: ASCII string of the form "tcp://localhost:9001", length as specified
ClusterServer
The configuration information for a server in a cluster. This is included only in a AddServerRequest or RemoveServerRequest message.
When used in a AddServerRequest Message:
ID: 4 byte integer
Endpoint data len: In bytes, 4 byte integer
Endpoint data: ASCII string of the form "tcp://localhost:9001", length as specified
When used in a RemoveServerRequest Message:
ID: 4 byte integer
LogPack
This is included only in a SyncLogRequest message.
The following is gzipped before transmission:
Index data len: In bytes, 4 byte integer
Log data len: In bytes, 4 byte integer
Index data: 8 bytes for each index, length as specified
Log data: length as specified
SnapshotSyncRequest
This is included only in a InstallSnapshotRequest message.
Last Log Index: 8 byte integer
Last Log Term: 8 byte integer
Config data len: In bytes, 4 byte integer
Config data: length as specified
Offset: The offset of the data in the database, in bytes, 8 byte integer
Data len: In bytes, 4 byte integer
Data: length as specified
Is Done: 1 if done, 0 if not done (1 byte)
Responses
All responses are 26 bytes, as follows. All values are unsigned big-endian.
Message type: 1 byte
Source: ID, 4 byte integer
Destination: Usually the actual destination ID (see notes), 4 byte integer
Term: Current term, 8 byte integer
Next Index: Initialized to leader last log index + 1, 8 byte integer
Is Accepted: 1 if accepted, 0 if not accepted (see notes), 1 byte
Notes
The Destination ID is usually the actual destination for this message. However, for AppendEntriesResponse, AddServerResponse, and RemoveServerResponse, it is the ID of the current leader.
In the RequestVoteResponse, Is Accepted is 1 for a vote for the candidate (requestor), and 0 for no vote.
Application Layer
Each Server periodically posts Application data to the log in a ClientRequest. Application data contains the status of each Server's Router and the Destination for the Meta LS2 cluster. The servers use a common algorithm to determine the publisher and contents of the Meta LS2. The server with the "best" recent status in the log is the Meta LS2 publisher. The publisher of the Meta LS2 is NOT necessarily the Raft Leader.
Application Data Contents
Application contents are UTF-8 encoded [JSON], for simplicity and extensibility. The full specification is TBD. The goal is to provide enough data to write an algorithm to determine the "best" router to publish the Meta LS2, and for the publisher to have sufficient information to weight the Destinations in the Meta LS2. The data will contain both router and Destination statistics.
The data may optionally contain remote sensing data on the health of the other servers, and the ability to fetch the Meta LS. These data would not be supported in the first release.
The data may optionally contain configuration information posted by an administrator client. These data would not be supported in the first release.
If "name: value" is listed, that specifies the JSON map key and value. Otherwise, specification is TBD.
Cluster data (top level):
- cluster: Cluster name
- date: Date of this data (long, ms since the epoch)
- id: Raft ID (integer)
Configuration data (config):
- Any configuration parameters
MetaLS publishing status (meta):
- destination: the metals destination, base64
- lastPublishedLS: if present, base64 encoding of the last published metals
- lastPublishedTime: in ms, or 0 if never
- publishConfig: Publisher config status off/on/auto
- publishing: metals publisher status boolean true/false
Router data (router):
- lastPublishedRI: if present, base64 encoding of the last published router info
- uptime: Uptime in ms
- Job lag
- Exploratory tunnels
- Participating tunnels
- Configured bandwidth
- Current bandwidth
Destinations (destinations): List
Destination data:
- destination: the destination, base64
- uptime: Uptime in ms
- Configured tunnels
- Current tunnels
- Configured bandwidth
- Current bandwidth
- Configured connections
- Current connections
- Blacklist data
Remote router sensing data:
- Last RI version seen
- LS Fetch time
- Connection test data
- Closest floodfills profile data for time periods yesterday, today, and tomorrow
Remote destination sensing data:
- Last LS version seen
- LS Fetch time
- Connection test data
- Closest floodfills profile data for time periods yesterday, today, and tomorrow
Meta LS sensing data:
- Last version seen
- Fetch time
- Closest floodfills profile data for time periods yesterday, today, and tomorrow
Administration Interface
TBD, possibly a separate proposal. Not required for the first release.
Requirements of an admin interface:
- Support for multiple master destinations, i.e. multiple virtual clusters (farms)
- Provide comprehensive view of shared cluster state - all stats published by members, who is the current leader, etc.
- Ability to force removal of a participant or leader from the cluster
- Ability to force publish metaLS (if current node is publisher)
- Ability to exclude hashes from metaLS (if current node is publisher)
- Configuration import/export functionality for bulk deployments
Router Interface
TBD, possibly a separate proposal. i2pcontrol is not required for the first release and detailed changes will be included in a separate proposal.
Requirements for Garlic Farm to router API (in-JVM java or i2pcontrol)
- getLocalRouterStatus()
- getLocalLeafHash(Hash masterHash)
- getLocalLeafStatus(Hash leaf)
- getRemoteMeasuredStatus(Hash masterOrLeaf) // probably not in MVP
- publishMetaLS(Hash masterHash, List<MetaLease> contents) // or signed MetaLeaseSet? Who signs?
- stopPublishingMetaLS(Hash masterHash)
- authentication TBD?
Justification
Atomix is too large and won't allow customization for us to route the protocol over I2P. Also, its wire format is undocumented, and depends on Java serialization.
Notes
Issues
- There's no way for a client to find out about and connect to an unknown leader. It would be a minor change for a Follower to send the Configuration as a Log Entry in the AppendEntriesResponse.
Migration
No backward compatibility issues.
References
[JRAFT] | https://github.com/datatechnology/jraft |
[JSON] | (1, 2) https://json.org/ |
[RAFT] | https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf |
[RFC-2616] | (1, 2) https://tools.ietf.org/html/rfc2616 |
[RFC-2617] | (1, 2, 3, 4) https://tools.ietf.org/html/rfc2617 |
[WEBSOCKET] | (1, 2, 3) https://en.wikipedia.org/wiki/WebSocket |