[SUCCESS] Intel Mac + AMD GPU Local LLMs Are Finally Usable: ToshLLM (Native SwiftUI, Metal, No Cloud)

engeldlgado

New member
AMD OS X Member
Jun 19, 2026
2
0
1
CPU:
I5-10400
Hi everyone,

I wanted to share a project that could be really useful for anyone running an AMD GPU on a Hackintosh, especially if you've ever tried local LLM inference on macOS and hit the usual walls.

The issue: most local LLM tooling on macOS is built for Apple Silicon. If you’re on an Intel Mac with a discrete AMD GPU, stock llama.cpp under Metal tends to produce corrupted output and runs painfully slow over PCIe.

That’s why I built ToshLLM.

It’s a native SwiftUI app (pure Swift Package Manager, no external dependencies) that bundles llama.cpp with AMD-specific patches and wraps it in a real GUI. The goal is simple: make local LLMs actually usable on Intel Macs with AMD GPUs.

image.png.b154be6e94c71402aa430ddf07957d76.png

What you get:
  • Correct Metal output on AMD dGPUs at full speed
  • Qwen3‑8B Q4: ~101 t/s prompt / ~57 t/s generation
  • Qwen3.6‑35B‑A3B (MoE, hybrid offload): ~123 t/s / ~18.6 t/s, up to ~25.7 t/s with MTP
  • Native chat UI with Markdown, code copy, and file attachments
  • Model manager with per‑model VRAM/RAM estimates
  • Automatic MoE CPU offload
  • MTP speculative decoding
  • Two engines (official + TurboQuant for 100k+ contexts)
  • Built‑in benchmarks
  • OpenAI‑compatible API
  • Bilingual (English / Spanish)
  • New macOS 26 “Tahoe” Liquid Glass interface (falls back to translucent materials on macOS 14/15)
Hardware: developed on RX 6700 XT 12 GB with NootRX, should run on any working Metal setup.

Important: it’s still beta. The DMGs aren’t notarized yet, so on first launch you’ll need to use “Open Anyway” or run:

Bash:
xattr -dr com.apple.quarantine /path/to/ToshLLM.app

The AMD patches live in the repo (patches/), so if you prefer, you can also build from source.

License: GPL‑3.0. Repo, source, and DMG releases:

Link to GitHub Project

I’d really appreciate testing reports from other macOS-supported AMD cards:

RDNA 1: RX 5500 / 5600 / 5700
RDNA 2: RX 6600 / 6700 / 6800 / 6900
Older Polaris / Vega

At the moment it’s not working on GCN/Polaris GPUs like the RX 560 / RX 580, but I’m working on a patch for that and hope to get it fairly soon.

What’s new – June 18, 2026

Update (v0.81.25+):

  • You can now attach files in chat, including PDFs. Text is extracted automatically, and scanned PDFs are read with on-device OCR. Image input for vision models is also landing: drop in a vision model with its mmproj (for example, gemma-3-4b) and you can attach an image and ask questions about it. Vision is experimental; the image encoder runs partly on CPU on AMD GPUs (some Metal ops aren’t supported), so it works but isn’t fully GPU-accelerated yet.
New features and improvements:
  • AMD Flash Attention kernel (experimental engine): a from-scratch Metal kernel that runs attention on the AMD GPU for quantized/turbo KV caches, which otherwise fall back to the CPU and collapse at depth. Supports head dims 128, 256, and now 512, so Google’s Gemma 4 global-attention layers run on the GPU too (≈4× faster than the CPU fallback). Big win for long contexts and quantized KV.
  • Remember conversations (disk cache): in testing, each chat’s KV cache is persisted. Reopening a conversation or restarting the app skips re-processing the prompt. Reload is byte-exact (verified faithful even under sampling); a long chat comes back in well under a second instead of re-prefilling.
  • Faster cold start for external clients (VS Code / Cline): in testing, the engine now pre-warms its cache across restarts, so the big fixed 15–19k-token prompt those clients send isn’t re-processed from scratch on the first request (which used to take minutes).
  • Split model across all GPUs: experimental. For machines with multiple cards, splits the model’s layers over every detected GPU. Flagged as unvalidated on AMD/Metal in the UI – testing reports from multi-GPU setups are very welcome.
  • Redesigned model manager: card-based UI (Recommended / Browse / My models), hardware-aware recommendations per use case (fastest / balanced / quality / coding), live “Trending on Hugging Face,” and downloads with live progress + pause/resume.
  • Dedicated Logs tab with search, severity filter, and diagnostics export, plus context size selectable up to 256k for testing.
  • Default UI language is now English (switchable to Spanish in Settings).

As always, feedback and testing reports on other AMD cards are hugely appreciated.
 

Attachments

  • image.png.7e1a1500cf70e81ad27a520476091a80.png
    image.png.7e1a1500cf70e81ad27a520476091a80.png
    227.4 KB · Views: 0
  • image.png.0dbf45c2885b98ae93f09ca667d89712.png
    image.png.0dbf45c2885b98ae93f09ca667d89712.png
    420.8 KB · Views: 0
  • image.png.f79fa8b58474f1668e2233c283c4baab.png
    image.png.f79fa8b58474f1668e2233c283c4baab.png
    273.5 KB · Views: 0
  AdBlock Detected
Sure, ad-blocking software does a great job at blocking ads, but it also blocks some useful and important features of our website. For the best possible site experience please take a moment to disable your AdBlocker.