You are browsing as a guest. Sign up (or log in) to start making projects!

Open comments for this post

1h 10m 17s logged

Oh wow I locked in

so remember yesterday when I said “today was all architecture, all decisions, all research”?

yeah.

today I wrote the architecture. all of it. 27 files. 3,795 lines added (1,900 ish is just Cargo.lock btw). one commit.

I sat down to “maybe scaffold the embed crate” and then I just… didn’t stop. the hyperfixation hit and I looked up and every single crate had real code in it. not stubs. not //! penumbra-whatever. actual implementations with actual tests.

let me try to explain what happened.


three embedders walked into a crate

(no the title is NOT a punchline to a joke, however much it sounds like one)

penumbra-embed went from a one-line comment to three full embedding providers:

CandleEmbedder is the real one. it loads Snowflake-Arctic-Embed-XS from HuggingFace Hub, tokenizes input, runs it through a BERT-style forward pass (word embeddings, mean pooling, linear encoder), and L2-normalizes the output into a 384-dim vector. the model weights come from safetensors via candle_nn::VarBuilder::from_mmaped_safetensors. the whole Candle pipeline is behind a feature flag so you don’t pull in the ML universe if you don’t need it.

SimpleEmbedder is the clever one. it slides character trigrams across the input, hashes each one, and accumulates them into buckets across a fixed-dimension vector. then L2-normalizes. it’s deterministic, fast, needs zero ML deps, and produces embeddings where similar text actually lands in similar regions of the vector space. not semantically meaningful in the way a transformer is, but way better than random for development and testing. (You know? Well actually probably you don’t know, most people didn’t spend all of yesterday doing a bunch of NLP and ML research lollll)

NullEmbedder is the lazy one. returns a zero vector. exists for when you need the trait satisfied but don’t care about the output. sometimes you just need a stub and that’s fine. (aka a lot of tests, DO NOT USE THIS anywhere else… or maybeee hehe /j)

all three implement EmbeddingProvider from core. swap them at construction time. the rest of the system doesn’t know or care which one is running.
(i love doing that pattern. it’s just the best. Thanks, Saikuro)


the vector index

penumbra-index wraps USearch with a clean VectorIndex trait: insert, remove, search, len, is_empty. the USearchIndex backend handles all the HNSW configuration. SearchHit carries a NoteId and a score.

the trait exists so I can mock the index in tests without spinning up the real HNSW structure every time. (learned that lesson from Honzo’s test suite.)


the graph got a cleanup

penumbra-graph didn’t get new features but it got a cargo fmt.

NoteId lost its redundant to_string() method because… it already derives Display. why did I write that. past me, explain yourself. what were you planning.


core got an Index error variant

PenumbraError::Index(String). three lines. because the index crate needed its own error type and I wasn’t going to abuse Embedding for it.


the stuff I did but still isn’t fully done

the layout engine with Barnes-Hut quadtree (i tried a simpler manual implementation, but it’s icky, i’m switching to a different crate), the hybrid search engine (i should probably switch this to a usearch method or different crate idk), and the storage layer (c’est mostly bon) all got real implementations in this commit too. plus a full test suite: core_matrix.rs, embed_matrix.rs, events_matrix.rs, graph_matrix.rs, index_matrix.rs, search_matrix.rs, storage_matrix.rs. seven test files. the naming convention carries over from Honzo because it works and I’m not fixing what isn’t broken.


You might notice that I’m running one commit behind lol.
Welllll I am, yeah
Ooops
Yeahhhh devlogging after every commit is the best way to do it, but sometimes you just wanna code

0
2

Comments 0

No comments yet. Be the first!