🧠 SAM-MM — a 58M multimodal model that reasons

Pick a scene. SAM-MM perceives the rendered frames (and, for audio scenes, a log-mel spectrogram), then answers in [CHAT] text or a [ACTION] JSON record. Frames are synthetic — this is the model's native world. Nothing is hard-coded: each scene is freshly generated, the model decodes token-by-token, and the answer is checked against ground truth computed independently.

Checkpoint

Scene

max new tokens

24 96

What SAM-MM sees

Honest notes. Physics & motion are SAM-MM's strength (its world-model carries real dynamics). OCR generalizes to unseen numbers but isn't perfect. The cross-modal [ACTION] family is weaker. Audio is the weak modality — it was trained on synthetic pseudo-mel, so strong audio scores here partly reflect that, not true listening. Architecture internals are proprietary and not exposed.