Bonsai 1-bit LLM (1.7B, Qwen3 arch) compiled from Almide to WebAssembly. Pure Almide inference — not a JS ML framework wrapper.
Logs
Chat: Qwen3 chat template is applied (tokenizer.apply_chat_template → ChatML with
<|im_start|> markers) so base Bonsai behaves as an assistant. Streams with
KV cache: first call does full prompt eval, then each generated token is one attention step
over cached K/V (predict_step_kv_bytes). KV state round-trips through
List[Bytes] between steps — JS copies caches out, __heap_restores
scratch, feeds them back. Sampling (temperature + top-k + repetition penalty) inside Almide;
JS supplies PRNG-derived rand01. Generation stops on
<|im_end|>. Weights stay packed in Q1_0 via
matrix.linear_q1_0_row_no_bias. Model cached in IndexedDB on first run.