VexIn progress
A Vector Database in Rust
- Rust
- HNSW
- proptest
- CI
Overview & Problem
Vector databases have become infrastructure, but the internals — graph indexes, memory layout, distance kernels — are easy to use and hard to actually understand. Vex is a from-scratch Rust implementation built to learn those parts by writing them.
What I Built
Phase 1: a vex-core / vex-cli Cargo workspace with 22 unit tests, integration tests, and property-based coverage via proptest, all enforced in CI.
Phase 2 (in progress): the HNSW graph index following Malkov & Yashunin (2016).
Architecture
Vex is built around an HNSW index — a layered proximity graph. Inserts thread a vector down through the layers; queries enter at the sparse top layer and greedily descend toward the densest region, returning the k nearest neighbours. The graph and vectors persist to disk so an index can be reloaded without rebuilding.
See it in action
benchmarks (placeholder)
1M × 768d · recall@10 ≥ 0.95p99 query latency · lower is better
throughput · higher is better
Key Technical Decisions & Tradeoffs
HNSW was the right first index because it delivers strong recall and low query latency out of the box, with incremental inserts and only a few tuning knobs (M, efConstruction, efSearch). IVF needs a separate k-means training pass and PQ adds lossy compression that trades accuracy for memory — both are worth reaching for at scale, but the wrong place to start when the goal is to understand a graph index by building one. The price HNSW pays is memory: its edges make it heavier in RAM than a clustered or compressed index, an acceptable tradeoff at the scale Vex targets.
Graph indexes have invariants that are awkward to cover with example-based tests — each layer should be a subset of the one below it, the entry point should live in the top layer, neighbour lists should respect the degree bound. proptest generates randomised insert/query sequences and asserts these invariants across thousands of cases, catching structural bugs hand-written tests would miss.
Distance computation is the inner loop of every search, so it's the one place hand-tuned SIMD genuinely pays off. The plan keeps a clear scalar reference implementation as the correctness baseline and adds SIMD kernels behind it only where profiling shows the distance calculation dominates — optimising the hot path without scattering unsafe, hard-to-verify code through the codebase.
Results
Phase 1 shipped a vex-core / vex-cli Cargo workspace with 22 unit tests alongside integration and property-based coverage, all enforced in CI.
Phase 2 — the HNSW index following Malkov & Yashunin (2016) — is in progress; head-to-head latency and recall benchmarks against faiss and qdrant will be published here once it lands.