Bonsai × Almide — Browser chat

Bonsai 1-bit LLM (1.7B, Qwen3 arch) compiled from Almide to WebAssembly. Pure Almide inference — not a JS ML framework wrapper.

Startup

initialising…

Chat

System:

Max new tokens: Temp: Top-k: Penalty: Seed: waiting for model…

Logs

Chat: Qwen3 chat template is applied (tokenizer.apply_chat_template → ChatML with <|im_start|> markers) so base Bonsai behaves as an assistant. Streams with KV cache: first call does full prompt eval, then each generated token is one attention step over cached K/V (predict_step_kv_bytes). KV state round-trips through List[Bytes] between steps — JS copies caches out, __heap_restores scratch, feeds them back. Sampling (temperature + top-k + repetition penalty) inside Almide; JS supplies PRNG-derived rand01. Generation stops on <|im_end|>. Weights stay packed in Q1_0 via matrix.linear_q1_0_row_no_bias. Model cached in IndexedDB on first run.