[SUCCESS] Intel Mac + AMD GPU Local LLMs Are Finally Usable: ToshLLM (Native SwiftUI, Metal, No Cloud)

engeldlgado · Jun 19, 2026

Hi everyone,

I wanted to share a project that could be really useful for anyone running an AMD GPU on a Hackintosh, especially if you've ever tried local LLM inference on macOS and hit the usual walls.

The issue: most local LLM tooling on macOS is built for Apple Silicon. If you’re on an Intel Mac with a discrete AMD GPU, stock llama.cpp under Metal tends to produce corrupted output and runs painfully slow over PCIe.

That’s why I built ToshLLM.

It’s a native SwiftUI app (pure Swift Package Manager, no external dependencies) that bundles llama.cpp with AMD-specific patches and wraps it in a real GUI. The goal is simple: make local LLMs actually usable on Intel Macs with AMD GPUs.

image.png.b154be6e94c71402aa430ddf07957d76.png

What you get:

Correct Metal output on AMD dGPUs at full speed
Qwen3‑8B Q4: ~101 t/s prompt / ~57 t/s generation
Qwen3.6‑35B‑A3B (MoE, hybrid offload): ~123 t/s / ~18.6 t/s, up to ~25.7 t/s with MTP
Native chat UI with Markdown, code copy, and file attachments
Model manager with per‑model VRAM/RAM estimates
Automatic MoE CPU offload
MTP speculative decoding
Two engines (official + TurboQuant for 100k+ contexts)
Built‑in benchmarks
OpenAI‑compatible API
Bilingual (English / Spanish)
New macOS 26 “Tahoe” Liquid Glass interface (falls back to translucent materials on macOS 14/15)

Hardware: developed on RX 6700 XT 12 GB with NootRX, should run on any working Metal setup.

Important: it’s still beta. The DMGs aren’t notarized yet, so on first launch you’ll need to use “Open Anyway” or run:

Bash:

xattr -dr com.apple.quarantine /path/to/ToshLLM.app

The AMD patches live in the repo (patches/), so if you prefer, you can also build from source.

License: GPL‑3.0. Repo, source, and DMG releases:

Link to GitHub Project

I’d really appreciate testing reports from other macOS-supported AMD cards:

RDNA 1: RX 5500 / 5600 / 5700
RDNA 2: RX 6600 / 6700 / 6800 / 6900
Older Polaris / Vega

At the moment it’s not working on GCN/Polaris GPUs like the RX 560 / RX 580, but I’m working on a patch for that and hope to get it fairly soon.

What’s new – June 18, 2026

Update (v0.81.25+):

You can now attach files in chat, including PDFs. Text is extracted automatically, and scanned PDFs are read with on-device OCR. Image input for vision models is also landing: drop in a vision model with its mmproj (for example, gemma-3-4b) and you can attach an image and ask questions about it. Vision is experimental; the image encoder runs partly on CPU on AMD GPUs (some Metal ops aren’t supported), so it works but isn’t fully GPU-accelerated yet.

New features and improvements:

AMD Flash Attention kernel (experimental engine): a from-scratch Metal kernel that runs attention on the AMD GPU for quantized/turbo KV caches, which otherwise fall back to the CPU and collapse at depth. Supports head dims 128, 256, and now 512, so Google’s Gemma 4 global-attention layers run on the GPU too (≈4× faster than the CPU fallback). Big win for long contexts and quantized KV.
Remember conversations (disk cache): in testing, each chat’s KV cache is persisted. Reopening a conversation or restarting the app skips re-processing the prompt. Reload is byte-exact (verified faithful even under sampling); a long chat comes back in well under a second instead of re-prefilling.
Faster cold start for external clients (VS Code / Cline): in testing, the engine now pre-warms its cache across restarts, so the big fixed 15–19k-token prompt those clients send isn’t re-processed from scratch on the first request (which used to take minutes).
Split model across all GPUs: experimental. For machines with multiple cards, splits the model’s layers over every detected GPU. Flagged as unvalidated on AMD/Metal in the UI – testing reports from multi-GPU setups are very welcome.
Redesigned model manager: card-based UI (Recommended / Browse / My models), hardware-aware recommendations per use case (fastest / balanced / quality / coding), live “Trending on Hugging Face,” and downloads with live progress + pause/resume.
Dedicated Logs tab with search, severity filter, and diagnostics export, plus context size selectable up to 256k for testing.
Default UI language is now English (switchable to Spanish in Settings).

As always, feedback and testing reports on other AMD cards are hugely appreciated.

AlexMark777999 · 2026-06-28T15:31:51+0100

"Hey! I want to express my huge gratitude to the ToshLLM author. The interface is stunning – everything is intuitive, beautiful, and it’s genuinely a pleasure to work with the program. Thanks also for the frequent updates! I’ve run the local Gemma-3-4b model on my Hackintosh (Ryzen 5-7500F + Radeon RX 6800 XT) under macOS Tahoe. The benchmark results were simply amazing: the generation speed was 113.8 tokens per second! The model is practically flying, responses appear instantly. This is the best software I’ve seen for working with local AI. Thank you for your hard work!"

engeldlgado · 2026-06-28T19:21:32+0100

Hey Alex, thanks for your words. I made this app for the community and also for my personal use. I appreciate that you are testing it and that it is working wonderfully. I apologize in advance if there are no updates in the coming days, as there was an earthquake in my city. My family and friends are fine, but communications were heavily affected and part of my home has been damaged. I am currently organizing myself to resume the project as soon as possible... If you find anything that doesn't work as expected, I invite you to share it. As soon as I can, I will fix the issues reported by the community so that it becomes perfect and reliable for daily use... Best regards.

Search

Search

[SUCCESS] Intel Mac + AMD GPU Local LLMs Are Finally Usable: ToshLLM (Native SwiftUI, Metal, No Cloud)

engeldlgado

New member

Attachments

AlexMark777999

New member

Attachments

engeldlgado

New member