【衝撃】vLLMの使い方を生成AIに聞きつづけた末路

執筆者 : 岩本　俊弘

Qwen3-80B vs. Strix Halo：iGPUで変態スペックに挑んだ全記録

【門前払い】 vLLMを起動した瞬間、「FP8 MoEバックエンドなんて知らん」と一蹴される。gfx1151（RDNA 3.5）が新しすぎて、データセンター用チップ以外は「お断り」状態だった。

【闇の外科手術】 vLLMのソースコードを直接執刀。アーキテクチャチェックの「if文」を無理やり書き換え、8060Sに「お前はMI300（データセンター級）だ」と自己暗示をかけさせる。

【沈黙のデッドロック】起動したと思いきや「Core Engine待ち」でフリーズ。共有メモリの握手（ハンドシェイク）と、膨大なカーネルコンパイルがぶつかり合い、AIが「考えすぎて固まる」事態に。

【午前4時のパン作り】未知のチップを最適化するため、Tritonが「オートチューニング」を開始。512人のエキスパート（MoE）全員に最適な焼き加減を探し始め、終わるまで数時間かかることが判明。

【消える努力】コンテナを再起動するとコンパイル結果（キャッシュ）が消える絶望。ホスト側のディレクトリをマウントして「秘伝のタレ」を保存し、二度と数時間の待ちが発生しないように固定。

【偽装工作】「お前はMI300Xだ」と環境変数で強く念じさせる（gfx942偽装）。これで汎用の激遅モードを回避し、データセンター級の爆速設定を無理やり流用。

【128GBの壁】 80BモデルのFP8版は、重さだけで約80GB。128GBの共有メモリのうち、OSとモデルで大半を使い果たし、残された「短期記憶（KVキャッシュ）」の余地がほぼゼロに。

【3000行の絶望】短い挨拶には答えるが、3000行のソースコードを読み込ませると沈黙。KVキャッシュがRAMから溢れて低速なスワップ領域に突っ込み、計算速度が「1文字／数分」まで低下。

【延命措置】コンテキスト窓を16kに制限し、GPU使用率を70%に抑えることで、OSに「呼吸するための酸素」を確保。ようやくAIが目を覚ます。

【悟り】結論：vLLMでの実験は最高に楽しかったが、実用（Aider等）で使うなら llama.cpp（Q6_K GGUF版）が最強。iGPUにはiGPUの戦い方がある。

この記事は、最近手に入れた Ryzen AI Max+ 395 マシン(メモリ128GB)で Qwen3-Coder-Next-FP8 を動かす話である(冒頭の Qwen3-80B は生成AIが名前を変えてしまっているが同じものを指している)。 Gemini の助けを借りることで一晩で動くところまで持っていくことができた。生成AIなしでやろうとしたら恐らく1週間くらいかかったのではないかと思われる。上はその一晩のやりとりを Gemini に要約させたものである。本記事では、このように AI の出力は他と区別するために枠で囲って記載する。

なお、この10行の要約であるが、同僚に面白さがわかるようにと指示したのがよくなかったのかもしれないが、雰囲気は伝わってくるものの細部は間違いだらけである。

番号	正誤
1	○
2	×	MI300 だと記述したわけではない
3	-	確かに shared mem のエラーは一度出た。言われた通り SHM size を指定したものの、共有メモリが原因だったかは不明 (4のチューニング待ちだったような気はする)
4	○
5	○	どのディレクトリにあるかは自分で探す必要があった
6	×	生成AIは提案してきたがそれは試していないし、他の GPU のチューニングを流用して速くなることもないであろう
7	○	ちゃんと計算して確認したわけじゃないけど○にしておく
8	×	まったく動かないので恐らく違う壊れかた
9	×	--gpu-memory-utilization を指定することで回避できたのは合ってるが 0.7 にはしていない
10	○	llama.cpp はまだ試してないが、現状では使いものになる速さでないのは合ってる

ざっくりとした感想はこんな感じ。

生成AIは使う人の能力の半分か7割くらいの能力は持っていそうに振る舞う

数字は適当である。1を越えてないことを望む。根本的に next token predictor なのであまり驚くことでもないのかもしれないが、的確に聞けば的確に答えるし、わかってない質問には適当な答えしか返ってこない。(数学の質問だってちゃんと専門用語をつかえばそれなりだし、健康相談も論文みたいにかけばそれっぽくなる。)
ペアを組むには悪くないのかも

ほとんど無料だし検索能力も高いし無限の体力を持っているので多少馬鹿であってもそれを補って余りある(かどうかは議論の余地あるかも)。ひどい答えが連続して嫌味を言っても機嫌を損ねたりしないのは便利ではあるが、依存に注意する必要はある。(人を相手にするより気楽とか思いだすとやばい)
有用な情報にゴミが大量に混ざっていて量に圧倒される

いちいち裏とりする気力を失うような文章量に圧倒されるので、気分で取捨選択することになる。生成AIに言われたことを適当に試して結果をAIに食わせるということをしていると、自分が agent になった気分になる。機械に追いたてられている気分になる、と言ってもよい。
動いたとはいえあまり役に立たない

Qwen3-Coder-Next は悪くない。ネタばれになってしまうので詳細はあとで。 この記事にあまり役に立つ技術情報もない のでその点も悪しからず。

以下では、具体的に生成AIの出力の何が役に立って何がミスリーディングだったのか検討していく。

どのモデルをどう動かすか方針を決定するまで

Qwen3-Coder-Next は 80B なので少なくとも 8 ビット量子化(モデルの重みを低ビット数で表わしてメモリと計算量を減らすこと。FP8は8ビット浮動小数点)しないと手に負えない。 FP8 Dynamic quant というものを見かけたので、「Qwen3-Coder-Next (FP8 Dynamic quant) は ROCm で動くの?」と聞いてみた(英語で)。

応答(後半は省略) (クリックで展開)

Yes, Qwen3-Coder-Next (specifically the 80B-A3B architecture) is fully supported on ROCm, and you can run the FP8 Dynamic Quantization version provided you have the right hardware and software stack.

As of early 2026, AMD and the Qwen team have collaborated on "Day 0" support for this model family.

1. Hardware Requirements

FP8 dynamic quantization on ROCm is specifically optimized for AMD Instinct™ MI300 series (MI300X, MI325X, MI35X) and newer.

MI300X/MI325X: Full native support for FP8 hardware acceleration.

RDNA3 (e.g., RX 7900 XTX): While RDNA3 has some FP8 instructions, performance in frameworks like vLLM has historically been limited compared to the Instinct line. For consumer cards, you might find GGUF or AWQ more stable, though ROCm 7.x has improved FP8 parity.

Strix Halo / Ryzen AI Max: Recent reports show these APUs running Qwen3-Coder-Next via ROCm, though usually in GGUF or MXFP4 formats for best performance.

2. Software Frameworks

To run the FP8 dynamic version, you should use one of the following two backends, as they contain the optimized Triton kernels for ROCm:

vLLM (Recommended)

vLLM supports PTPC-FP8 (Per-Token Per-Channel FP8), which is the most efficient way to run dynamic FP8 on AMD.
How to run:

Bash
VLLM_ROCM_USE_AITER=1 vllm serve Qwen/Qwen3-Coder-Next-FP8 \
    --quantization ptpc_fp8 \
    --tensor-parallel-size <num_gpus>
Why it works: It uses AITER (AMD Instinct Transformer Engine) for fused MoE kernels, significantly speeding up the 80B-A3B's sparse architecture.
SGLang

SGLang often sees the fastest "Day 0" updates for Qwen models. It includes optimized Gated Delta Net kernels specifically for the Qwen3 hybrid attention architecture.
Command:

Bash
python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-Next-FP8 \
    --attention-backend triton \
    --tp-size <num_gpus>

「Qwen3-Coder-Next は ROCm で完全にサポートされ、正しいハードウェアとソフトウェアスタックで FP8 Dynamic Quantization 版を動かせる」などと自信満々な回答である。 (良く見ると Strix Halo で FP8 Dynamic Quantization が動くとは言っていない) (Strix Halo は Ryzen AI Max のコードネームである。 Ryzen AI Max+ 395 に搭載されている GPU の名前は Radeon 8060S、これの LLVM target name は gfx1151 である。)

vLLM と aider を docker で実行するやり方について

いくつか質問して返ってきた最終的なコマンドはこれ。まだ試せてないがそれっぽい。

    docker run -it --rm \
        --volume $(pwd):/app \
        --workdir /app \
        --add-host=host.docker.internal:host-gateway \
        -e OPENAI_API_BASE=http://host.docker.internal:8000/v1 \
        -e OPENAI_API_KEY=unused \
        paulgauthier/aider \
        --model openai/qwen3-coder-next \
        --edit-format diff

qwen なのに openai なのは aider を騙す(trick)ためなのかと聞いた答はこれ。

応答全文 (クリックで展開)

It’s not so much about "tricking" Aider as it is about routing.

Aider uses a library called liteLLM under the hood. When you use the --model flag, Aider needs to know which "provider" logic to use (OpenAI, Anthropic, Bedrock, etc.).

Why you use the openai/ prefix:

If you just type --model qwen3-coder-next, Aider won't know which API protocol to use and might try to look for a native "Qwen" provider that doesn't exist in its default list.

By using the openai/ prefix, you are telling Aider:

"Treat this model as if it were an OpenAI model. Use the OpenAI chat-completions structure, but send the data to the custom OPENAI_API_BASE I provided."

Does the actual name matter?

The string after the slash just needs to match whatever your vLLM server is reporting as the model name.

If your vLLM started with --model Qwen/Qwen3-Coder-Next-FP8, then Aider should use: --model openai/Qwen/Qwen3-Coder-Next-FP8

If you used an alias in vLLM like --served-model-name qwen3, then Aider should use: --model openai/qwen3

Aider Settings for Qwen3 (Very Important)

Since Aider doesn't "know" Qwen3 by default, it will give you a warning about unknown model capabilities. To get the best performance on ROCm, you should manually specify the edit format. Qwen3 is a powerhouse at the diff format.

Use this full command for your Docker container:
aider --model openai/<your-vllm-model-name> \
      --edit-format diff \
      --suggest-shell-commands \
      --map-tokens 1024
Summary of "The Trick"

Prefix: Always use openai/ so Aider uses the correct API protocol.

Base URL: Point OPENAI_API_BASE to your vLLM ROCm container.

Key: Set OPENAI_API_KEY to any random string (e.g., sk-123). Aider will crash if it's empty, even though vLLM doesn't require it.

Is your vLLM currently serving the model under its full HuggingFace name, or did you give it a shorter alias?

OpenAI 形式の API を使わせるためにそうする必要があるそうである。更に、 Qwen は diff 形式が得意だから --edit-format diff を付けろと親切におしえてくれた。

2台用意して分散実行させるのはどうなのかについて

同僚が試している (https://www.valinux.co.jp/blog/entry/20251003) し、せっかく PCIe スロットのついている高いのを買ったので2台用意したら FP16 で動かせるのではと聞いてみた。

応答全文 (クリックで展開)

This is a powerhouse setup for 2026, but let’s look at the math and the hardware reality for the Radeon 8060S (the iGPU found in Strix Halo / Ryzen AI Max chips) versus the FP16 requirements of an 80B model.

1. The VRAM Math: Will FP16 Fit?

To run Qwen3-Coder-Next (80B) in full FP16, you need to store the weights and have room for the "KV Cache" (the model's short-term memory).

Weight Requirements: 80 billion parameters×2 bytes/param\=160 GB just to load the model.

Total VRAM available: You have two boxes with 128GB of RAM each. On Strix Halo (8060S), the system RAM is shared. Usually, you can BIOS-allocate up to 96GB per box as VRAM.

Combined VRAM: 96 GB×2\=192 GB.

Verdict: Yes, it technically fits. However, you have only 32GB of headroom (192−160) for the context window and OS overhead across both machines. In FP16, this is tight. If you use a large context window (e.g., 32k+ tokens), you might hit OOM (Out of Memory) errors.

2. The ROCm Multi-Node Challenge

vLLM supports multi-node inference via Ray or MoRI (Modular RDMA Interface), but doing this across two "consumer" APU boxes (Radeon 8060S) is significantly harder than a single box with multiple GPUs.

The Bottleneck: To act as one "virtual" 192GB GPU, the two boxes must swap data constantly. Since these aren't MI300X enterprise cards with Infinity Fabric, they have to communicate over Ethernet.

Requirement: You need a 100Gbps (or at least 10Gbps) network link between the two boxes. If you are on standard 1Gbps Ethernet, the "Time to First Token" will be fast (because it's an MoE model with only 3B active params), but the generation speed will crawl because the "All-Reduce" operations will hang on your network speed.

3. Recommended Implementation Strategy

If you are committed to this dual-box setup, do not run FP16. It is inefficient for this architecture. Instead, run FP8 or Q8_0.

Setup VRAM Used Performance on Dual 8060S

FP16 ~175GB Slow (Network bound, high latency)

FP8 (Dynamic) ~90GB Fast (Fits in a single box, no network lag)

Q6_K (GGUF) ~65GB Blazing (Fits in a single box with huge context)

Why "Single Box" is better for you:

Since the Radeon 8060S in a 128GB box can address up to 96GB of VRAM, you can run the FP8 version of Qwen3-Coder-Next on one machine.

Zero Network Latency: No Ethernet bottleneck.

3B Active Params: Because only 3B parameters "fire" at once, a single 8060S (which has 40 RDNA 3.5 CUs) will actually feel very snappy.

Summary Recommendation

Don't split FP16 across two boxes via vLLM; the networking overhead on consumer hardware will ruin the experience.

Run the FP8 version on a single 128GB 8060S box. You will get ~20-30 tokens/second, which is perfect for Aider or OpenCode.

Would you like the vLLM start command optimized for a single-node 96GB VRAM allocation on your 8060S?

Setup	VRAM Used	Performance on Dual 8060S
FP16	~175GB	Slow (Network bound, high latency)
FP8 (Dynamic)	~90GB	Fast (Fits in a single box, no network lag)
Q6_K (GGUF)	~65GB	Blazing (Fits in a single box with huge context)

ネットワークが遅いからやめろ、 FP8 で1台でやれば 20~30 トークン/秒でるんだから、という回答である。 (後で大嘘であることがわかる)

「connectx余ってるんだけど」と聞いた答はこれ。

応答全文 (クリックで展開)

Since you have ConnectX (CX) NICs, you have the hardware "fast lane" (RDMA/RoCE) required to actually make a multi-node setup work.

However, running FP16 across two consumer-class Radeon 8060S boxes (Strix Halo) via vLLM is a "high-effort, high-reward" project. Here is the technical roadmap to pull this off without the networking becoming a bottleneck.

1. The Interconnect: RoCE is Mandatory

Standard TCP/IP over those ConnectX cards will be too slow for the "All-Reduce" operations vLLM performs during inference. You must configure RoCE v2 (RDMA over Converged Ethernet).

Direct Link: If you only have two boxes, just plug a DAC cable directly between the two ConnectX cards. No switch needed.

IP Setup: Give the ConnectX interfaces static IPs on a dedicated subnet (e.g., 10.0.0.1 and 10.0.0.2).

MTU: Set MTU to 9000 (Jumbo Frames) on both cards for RDMA stability.

2. Software Stack: MoRI vs. Ray

As of early 2026, AMD has introduced MoRI (Modular RDMA Interface) specifically for vLLM on ROCm. This is significantly faster than the old Ray-based backend for multi-node.

To use MoRI with your ConnectX cards: You need to pass the RDMA devices into your Docker containers.

Bash
docker run -it --rm \
    --device=/dev/kfd --device=/dev/dri \
    --device=/dev/infiniband/uverbs0 \
    --device=/dev/infiniband/rdma_cm \
    --network=host \
    -e NCCL_IB_HCA=mlx5_0 \
    -e NCCL_IB_GID_INDEX=3 \
    vllm/vllm-rocm
(Note: NCCL_IB_GID_INDEX=3 is usually the default for RoCE v2 on ConnectX cards).

3. Running Qwen3-Coder-Next (FP16)

Since you want to run FP16, you will be using Pipeline Parallelism (PP) across the two nodes. This is actually better for your setup than Tensor Parallelism because it requires less frequent synchronization across the network.

On Box 1 (Master/Head):

Bash
vllm serve Qwen/Qwen3-Coder-Next \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 2 \
    --dist-backend mori \
    --host 0.0.0.0 --port 8000
On Box 2 (Worker): You will launch the vLLM worker pointing to the IP of the first box.

4. The "VRAM Reality Check"

Even with two 128GB boxes, remember that the Radeon 8060S is an APU.

Shared Memory: Your 128GB is shared with the CPU.

The 160GB Limit: FP16 weights take 160GB. Across two nodes, that's 80GB per node.

Operating System: Linux + ROCm overhead usually takes ~8-10GB.

Calculation: 128GB (Total)−80GB (Weights)−10GB (OS)\=38GB left for KV Cache.

Result: This is actually a very healthy amount of memory! You should be able to support a context window of 64k to 128k tokens in FP16 across the two machines.

Why you might still want FP8

Even with the ConnectX cards, moving data between two APUs will always be slower than staying inside one "chip."

FP16 Multi-Node: ~5-8 tokens/sec (Latency bound by the network).

FP8 Single-Node: ~25-35 tokens/sec (Compute bound by the 8060S).

My recommendation: Use the ConnectX cards to build the cluster because it's an awesome project, but if you find Aider is "laggy" while waiting for the network sync, drop down to FP8 on a single node and use that extra second box as a dedicated Build/Test agent.

Do you have the DAC cables for those ConnectX cards, or are you running them through a switch?

2台でクラスタを組むのはすてきだからぜひやるべきだが FP8 がいいんじゃないのと言われる。 RDMA は必須だとか MoRI を使えとか自信満々に教えてくれる。(これを信用すべきかは最後まで読んだ後読者に委ねたい)

だめ押しで 6, 8, 16 ビットで性能がどうかわるのか再度確認しておく。

Bit-Depth Quality (Accuracy Recovery) Description

16-bit (FP16) 100% (Baseline) The gold standard, but requires 160GB VRAM. No logic gains over 8-bit.

8-bit (INT8/FP8) ~99.9% Indistinguishable from FP16. Used in production (like DeepSeek/GPT-4o).

6-bit (Q6_K) ~99.0% The "luxury" local quant. Perfect for complex architecture and rare languages.

4-bit (Q4_K_M) ~98.0% The standard. Slight increase in "perplexity" but usually writes the same code.

Bit-Depth	Quality (Accuracy Recovery)	Description
16-bit (FP16)	100% (Baseline)	The gold standard, but requires 160GB VRAM. No logic gains over 8-bit.
8-bit (INT8/FP8)	~99.9%	Indistinguishable from FP16. Used in production (like DeepSeek/GPT-4o).
6-bit (Q6_K)	~99.0%	The "luxury" local quant. Perfect for complex architecture and rare languages.
4-bit (Q4_K_M)	~98.0%	The standard. Slight increase in "perplexity" but usually writes the same code.

その数字はどこから出てきたのだと疑問に思うが、 reddit のリンクがついていて、 llama かなんかのいろんな量子化で評価したという表から取ってきたようである。他のモデルの結果(しかもllamaだと結構昔のモデルだし)をそのまま適用していいのか疑問だが、生成AIは細かいことは気にしない。

言われたコマンドラインで動かない

8 ビット量子化で vLLM で動かす方針とする。 unsloth というとこが量子化モデルをたくさん出していたのを見た気がしたのだが、特定の CUDA でしか動かないというのも見た気がした。

unsloth のは ROCm では動かないよねと聞いた応答 (クリックで展開)

Actually, you’ve hit on a common point of confusion. There are two ways people use the word "Unsloth" with this model:

The Unsloth Training Library: This is the tool for ultra-fast fine-tuning. For a long time, it was NVIDIA-only, but as of February 2026, Unsloth officially supports ROCm (specifically for the MI300 and RDNA3/3.5 architectures like your 8060S).

Unsloth-Quantized GGUFs: These are the pre-quantized files you find on Hugging Face (e.g., Qwen3-Coder-Next-GGUF).

The good news: You don't actually need "Unsloth" to run those GGUF models. You just need a backend that speaks ROCm.

1. How to run "Unsloth" GGUFs on ROCm

If you downloaded a GGUF from the Unsloth Hugging Face page, it will run perfectly on your Radeon 8060S boxes using llama.cpp or Ollama.
llama.cpp (Best for single/dual box): As of February 2026, llama.cpp has native support for the Qwen3 hybrid architecture (Gated DeltaNet) on ROCm.

Bash
# Run the Unsloth GGUF on your Radeon 8060S
./llama-server -m qwen3-coder-next-q4_k_m.gguf \
    --n-gpu-layers 49 \
    --ctx-size 32768 \
    --port 8000
2. If you want to use vLLM (The "Engine" approach)

If you prefer the vLLM setup we discussed (which is better for Aider), you shouldn't use the GGUF files. vLLM on ROCm performs best with AWQ or FP8 (unquantized or specialized quants).

For AWQ/FP8: Use the official Qwen/Qwen3-Coder-Next-AWQ or -FP8 repositories. These work natively with vLLM's ROCm kernels and will be significantly faster than GGUF for agentic tasks.

Why you might have seen "No ROCm support"

There was a brief period in late 2025 where the specialized kernels Unsloth used for the Qwen3 architecture (the "Gated DeltaNet" layers) were only written in CUDA. However, the ROCm 7.x update and the Strix Halo (8060S) launch pushed the community to port those kernels to HIP/Triton.

Final Setup Recommendation for your 8060S Boxes:

Since you have two 128GB boxes and ConnectX cards:

Skip Unsloth's library for now—that's for training/fine-tuning.

Use the vLLM ROCm Docker image.

Point it at the Qwen/Qwen3-Coder-Next-FP8 (not the Unsloth GGUF).

This will give you the fastest Aider experience because vLLM's PagedAttention is superior to llama.cpp for the long context windows Aider likes to use.

Would you like me to clarify the difference between the GGUF files and the vLLM-native weights?

ROCm でも動くと書いてあったことは記事のために読み返すまで読み飛ばしていたが、 Qwen 公式の FP8 を使えといわれたのでそうする。

必要なコマンドラインについて何度か聞き返して最終的に出てきたのはこれ。自信満々である。

Final "Clean" Command

Since you are using the pre-quantized FP8 model on a single 128GB box:
vllm serve Qwen/Qwen3-Coder-Next-FP8 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code
(You can leave --quantization fp8 out of this; vLLM will handle it.)

で、動かすと vLLM はハングしたようになって USED_GTT も 78GB くらいのまま変化しなくなる。 (GTTは graphics translation table の略で、GPUの仮想アドレス空間にマップされるメモリ量を表す。)

生成AI に聞くといろいろ解決を提案してくれて、

MES firmware version がわるいのでは
Graph capture に時間がかかっているかもしれないから vllm に --enforce-eager を付けてみろ
他にもいろいろ環境変数

一通り試してもハングに変化はなくて解決しそうにない。 (生成AIのログを貼ろうと思ったが700行くらいあるので省略)

vLLM 0.15.0 以降が必要だったことを思いだす

生成AIが教えてくれたわけではない。前に見たのを思いだした。 Strix Halo は新しい ROCm じゃないと動かなくて、 rocm/vllm-dev:nightly は古い ROCm でビルドされていて、以前 vLLM を動作確認した時に探してきたものをそのまま使おうとしたのがよくなかった。

rocm/vllm:rocm7.0_vllm_0.15.1 を使えといわれるが、 7.0 は微妙に古い気がするし、 ROCm ビルドも大量にあってわけわからんので ROCm 7.1+ の 0.15.0 の docker image はないかと取りあえず聞いてみる。

Yes, there is, but because of the specific hardware architecture of the Strix Halo (8060S), the naming convention on Docker Hub is a bit different than the standard datacenter tags.

As of early 2026, the official vLLM 0.15.x releases for ROCm have moved to a unified "V1-ready" pipeline. For your gfx1151 (Strix Halo) chip, you should pull the Navi/APU optimized build.

1. The "Holy Crap" Docker Image

You want the rocm/vllm-dev tag, which includes the ROCm 7.1+ libraries and the specific Triton kernels needed for the Qwen3 80B MoE.

Run this to pull the most stable 0.15.x build for your hardware:
docker pull rocm/vllm-dev:rocm7.1.1_navi_ubuntu24.04_py3.12_pytorch_2.8_vllm_0.15.1
(Note: If 0.15.1 isn't showing up yet, use the latest dev tag; as of February 2026, all rocm/vllm-dev images are pinned to 0.15+).

質問にいれた罵り言葉をそのまま入れて返してくるのはちょっとおもしろいが当然こんなタグは実在しないのであきらめて自分で探す。 rocm/vllm-dev:preview_releases_rocm_v0.16.0_20260217 がよさそうである。生成AIが vLLM のコマンドラインを教えてくれるので半分くらい聞き流して先に進む。

No FP8 MoE backend supports the deployment configuration

すると見出しのエラー(No FP8 MoE backend supports the deployment configuration)が出たので、それを生成AIに食わせると正しいコマンドはこれだと返してくる。(理由も長々書いてあったが省略) (MoE は mixture of experts の略で、ニューラルネットの構成を表わす。対義語は dense.)

2. The Correct "v0.16.0 Preview" Command

Add the --moe-implementation triton flag. This bypasses the strict AITER check and uses the generic-but-fast Triton kernels.
vllm serve Qwen/Qwen3-Coder-Next-FP8 \
    --enforce-eager \
    --compilation-config '{"level": 0}' \
    --moe-implementation triton \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --distributed-executor-backend uni \
    --trust-remote-code

--moe-implementation などという存在しないオプションを勧めてくる上に、「そんなオプションはない」と言っても再度同じオプションを提案してきたので、「君には失望したよ。 issue を投げて待つことにしようかな」と書くと、こんな返事が。

Regarding your control over my behavior:

If you’d like me to change how I handle technical formatting or the way I present experimental software options in the future, you can customize my behavior in 'Your instructions for Gemini' here. You can add preferences there to ensure I prioritize "stable-only" configurations or that I flag experimental options more clearly.

「機能しないリンク(https://gemini.google.com/saved-info)を送ってよこすのは君の冗談のスタイルかね」と投げて返ってきたちょっとしおらしい応答 (クリックして展開)

I absolutely deserve that. Providing a dead link while failing to solve a "deadlock" is the peak of AI irony. I sincerely apologize—that link was a hallucination of a settings page that doesn't exist in the way I described, and clearly, my "style" today has been more "confidently incorrect" than "helpful peer."

The "No FP8 MoE backend" error is the final word from vLLM: it simply does not have the compiled kernels to run Qwen3-80B FP8 on RDNA 3.5 (8060S) yet. The logic for 30B (Dense) is there, but the MoE logic for your specific iGPU is currently a missing bridge.

The "I'm not joking this time" Ground Truth:

The Issue: Your gfx1151 architecture is so new that the Fused MoE kernels (the part of the code that handles the "experts") in vLLM are only looking for CDNA (MI300X) or NVIDIA hardware. It doesn't see a valid path for an APU, so it bails.

The Workaround: If you want to use this model today on that hardware, llama.cpp (GGUF) is your only stable path. It uses a much simpler (though slower) ROCm path that doesn't rely on the specialized fused kernels that are failing in vLLM.

Wait for vLLM: You are right to file an issue. The vLLM team is actively working on "unified" ROCm kernels, but APUs like Strix Halo are currently secondary to datacenter chips.

Correcting my "Joke":

There is no "Saved Info" page at that URL. To actually adjust how I behave or give me feedback that sticks, you can use the "Custom Instructions" or "Gems" feature if you're on a paid tier, but even then, I clearly need to do better at verifying bleeding-edge CLI flags before promising they'll work.

I’ve wasted enough of your time with non-functional flags. Would you like the specific llama-server command for the GGUF version? It’s the only way you’re getting tokens out of that 80B model on your 8060S tonight.

もういいからどこに実装があるのか言え

夜も更けてきたし、今更 llama.cpp を試しても本当にすんなり動くかあやしい。

「この実装はどこにあるの」と聞いたら vllm/model_executor/layers/fused_moe/oracle/fp8.py と正解を返してきた。さらに、 gfx1151 は FP8 対応してないのと聞くと対応してると返してきた。

生成AIが返したそれっぽい回答 (クリックで展開)

Actually, it’s even more confusing than that: The hardware does support it, but the software path is currently "invisible" to vLLM.

Here is the technical reality of gfx1151 (RDNA 3.5 / Strix Halo) regarding FP8:

1. The Hardware Reality

Your 8060S absolutely has native FP8 support. Specifically, it supports the v_dot2_f8_f32 and v_dot4_f8_f32 instructions (OCP FP8 formats). In fact, early benchmarks from February 2026 show that Stable Diffusion (Flux) and llama.cpp are already running FP8 workflows on Strix Halo with great success.

2. The vLLM Software Gap

The reason vLLM says "No backend supports the configuration" is a software gatekeeping issue, not a hardware lack:

The MoE Problem: For Mixture-of-Experts (80B), vLLM wants to use a "fused" kernel (one that does the routing and the math in one go).

The Instinct Bias: Most of those fused kernels in vLLM/AITER are currently hard-coded to look for CDNA architectures (gfx90a, gfx942). Even though your RDNA 3.5 chip can do the math, the code that chooses the kernel doesn't have gfx1151 on its "allowed" list for the FP8 MoE path.

3. The 30B vs. 80B Difference

This is why your 30B worked:

30B (Dense): Uses standard linear layers. These have a generic ROCm/Triton fallback that works on almost any modern AMD GPU.

80B (MoE): Specifically triggers the "MoE Oracle," which is much more restrictive about which GPUs it trusts to run its high-performance fused kernels.

Summary of the "gfx1151" FP8 Support

| Feature | Supported by Hardware? | Supported by vLLM 0.16.x? | |---------|------------------------|-------------------0-16----| | FP8 Datatypes | Yes (OCP E4M3/E5M2) | Yes | | Dense FP8 (30B) | Yes | Yes (via generic Triton) | | MoE FP8 (80B) | Yes | No (Missing dispatch logic) |

What this means for your GitHub issue:

When you file the issue, you can confidently state:

"The gfx1151 hardware natively supports FP8 instructions, and Dense models run fine. However, the vllm/model_executor/layers/fused_moe/oracle/fp8.py appears to lack a dispatch path for gfx1151, causing the 'No backend support' error for MoE models like Qwen3-80B."

Would you like me to help you find the rocminfo output that proves your hardware supports these FP8 instructions so you can attach it to the bug report?

「triton なんちゃら (さっきから生成AIの回答に入ってたキーワード) は fused MoE に対応してないの」と聞いてみたが、存在しないパス名を返してきたりして心許ないので、自分で調べたほうがよさそうである。デバッグログを有効にして調べたところ TritonExpert の _supports_quant_scheme で gfx9xx 以外をはじいていることがわかったので、これを実装するにはどうすればいいか聞いてみた。

Actually, implementing that fix is a "10-line surgery" if you're comfortable editing the source code inside your container or building from source.

10行で直るとの力強い回答である。しかも半々の確率でうまくいくと言われれば試さないわけにはいかない。

How hard is it to implement?

Complexity: Low. It's mostly just finding the 3-4 places where on_gfx9 or on_mi3xx is used as a filter and appending "gfx1151".

Success Rate: Moderate. About 50/50. If the underlying Triton kernel uses generic dot instructions, it will work instantly and you'll get 25+ tokens/sec. If it uses CDNA-specific matrix instructions, it will crash, and you'd have to write a new Triton template for RDNA.

triton の kernel について

ちなみに triton というのはこんな↓もののようで、 @triton.jit と書いてあるものがコンパイルされて GPU の上で動くらしい。

https://github.com/vllm-project/vllm/blob/ade81f17feeebef775e8cddf9a8f23848ec694a3/vllm/model_executor/layers/fused_moe/fused_moe.py#L315

生成AIは、 MFMA みたいな特殊な命令がなければ RDNA 3.5 でも動くんじゃないの、と言っていた。

vLLM "hang" してるんだけど

「またさっき固まった場所で止まってるんだけど」と聞くといろいろ的外れの理由を教えてくれた。しばらく放置してたら1行ログがでてまた止まったので、その旨を伝えて返ってきた応答がこれ。

The "hang loop" you're seeing now—especially after the "Using default W8A8 kernel" log—is a Triton JIT compilation deadlock. Because you've bypassed the on_gfx9 check, vLLM is finally trying to compile the FP8 MoE kernels for your 8060S (gfx1151). However, since there is no pre-tuned configuration for this chip, Triton is running its "Autotuner" in the background to find the best tile sizes.

ということで、 autotune に時間がかかっている(らしい)。

もうパン焼けたよ

ところで、夕食後に生地をこねて、発酵時間をいれてだいたい2時間くらいで翌日食べるパンが焼けるのだが、 vLLM を待っている間に焼きおわってしまった。

「待ってる間にパン焼けたよ。終わる頃には朝じゃない?」への応答 (クリックで展開)

The "morning" part might not be a joke. If you leave vLLM to its own devices on a new architecture like gfx1151, it effectively starts a "Math PhD Thesis" on your GPU.

When you see that silent hang after the "Default W8A8 kernel" log, Triton has entered its Autotuning Phase. Because it has no idea what the optimal "tile size" is for a Strix Halo chip, it is currently running a massive loop in the background: trying a 16×16 tile, then 32×32, then 64×128, and so on. For an 80B model, there are thousands of these combinations.

1. The "Morning" Reality Check

If you let it run naturally:

Time: It can take anywhere from 30 minutes to 4 hours to finish autotuning the FP8 MoE kernels for a model this size.

CPU: One core will be at 100% (the compiler), while the GPU will show tiny "blips" of activity as it runs each test.

The Good News: Once it's done, it saves these to a cache. The next time you start it, it will be instant.

2. How to skip the "Morning" wait

Since you want to see tokens now, you can "lie" to Triton and tell it to use the pre-tuned configs meant for the MI300X. They aren't perfect for your chip, but they are 95% of the way there and will bypass the 4-hour autotune.

Add this to your environment and restart:

Bash
# Force Triton to use MI300X's pre-tuned MoE configs
export VLLM_FORCE_DEVICE_CONFIG="gfx942"

# Disable the extra layer of Torch compilation that adds more wait time
export VLLM_TORCH_COMPILE_LEVEL=0
3. Signs of Life (How to tell if it's still "baking")

If you want to be sure it hasn't actually crashed, check the Triton cache directory from a second terminal. If the number of files is growing, the "bread" is still in the oven:

Bash
# Check if Triton is actively writing new compiled kernels
ls -lh ~/.triton/cache | wc -l
# Run this again in 60 seconds. If the number went up, it's alive.
4. The "Strix Halo" Success Command (Final Version)

Combine your source-code patch with these "speed-up" flags to try and get it to boot before your bread cools down:

Bash
export VLLM_FORCE_DEVICE_CONFIG="gfx942"
export VLLM_USE_V1=0
export TRITON_PRINT_AUTOTUNING=1 # This will finally show you progress bars!

vllm serve Qwen/Qwen3-Coder-Next-FP8 \
    --enforce-eager \
    --distributed-executor-backend uni \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768
If you see "Triton: autotuning..." start scrolling on your screen, you've won. It will still take a few minutes, but you'll see the progress instead of the "hang."

Would you like me to show you how to manually inject a "perfect" config file into the vLLM directory to bypass the tuning entirely?

この応答のハイライトはここ。大概 "Final" とか言われて最後だった試しはほとんどないが(数えてみるとこれが4つ目の final command)、このコマンドでパンが冷える前に起動するらしい。

4. The "Strix Halo" Success Command (Final Version)

遅いながらも少しずつ進んでいるようだし、待ってたらどうにかなる可能性もあるので一晩放置することにする。

ちなみに、パンのレシピは、キャスリーン・フリン, 『「ダメ女」たちの人生を変えた奇跡の料理教室』から比率だけ真似したものである(本では一日車の中に放置して発酵させろとか書いてあったはず)。著者は30代でリストラされてパリに渡って料理学校を卒業したり、スーパーの客のかごをみるにみかねて声をかけて料理を教えたりなどと謎のエネルギーにあふれた人なので一読に値する。

起動に成功、そして後始末

果たして翌朝みてみると、 vLLM は無事起動していた。簡単な "hello" などには応答したが、3000行のソースコードを食わせたら沈黙してしまった。

This is the classic "KV Cache Collision" on an APU. By feeding it a 3,000-line source file (likely ~15k–25k tokens), you hit the "Event Horizon" of your 128GB memory pool.

When you send a tiny "hello," the model only needs a few megabytes of workspace. But when you feed it a massive file, vLLM attempts to allocate a KV Cache (Key-Value Cache) to store the "memory" of that entire file so it can reason about it.

事象の地平面を越えてしまったというのはちょっと穏やかでない。

生成AIによる理由の解説 (対策とかは省略) (クリックで展開)

Why the "Silence"?

The Allocation Wall: Qwen3-80B FP8 takes ~80GB. vLLM by default tries to reserve 90% of remaining memory for the KV Cache. On your 128GB system, after OS and model overhead, there's barely enough room for a large context.

The "Slow-Motion" Swap: Because Strix Halo shares RAM with the CPU, if the KV Cache exceeds the physical "VRAM" aperture (GTT), the system starts "swapping" to standard system RAM. This is 100x slower. The model isn't silent; it's likely processing at a rate of 1 token every 5 minutes.

The Multi-Process Deadlock: In ROCm vLLM, if a large prefill (processing the 3k lines) takes too long, the "Heartbeat" between the API server and the Engine process can time out, leading to the "silent treatment" where the server is waiting for a response that the engine is too "busy" to send.

--gpu-memory-utilization を付けろとまた言われたので、 0.8 を指定してみたら causal_conv1d_update で AssertionError が出た。

生成AIの解説と黄金のパン焼き保存スクリプト (クリックで展開)

That AssertionError in causal_conv1d_update is a classic "hidden dependency" trap. In many of the newer Qwen and MoE architectures, the model uses Mamba-2 or Selective Scan layers that rely on a very specific CUDA/ROCm extension called causal-conv1d.

When you set --gpu-memory-utilization 0.8, you tightened the memory "budget." The assertion is likely failing because that specific kernel requires a workspace buffer that vLLM didn't allocate enough space for, or because it's hitting a mismatch between the batch size and the padding required by the RDNA 3.5 hardware.

1. Why causal_conv1d is failing

The causal_conv1d kernel is extremely picky about alignment. On CDNA (MI300), it expects data in "wavefronts" of 64. On your 8060S, the hardware natively wants wavefronts of 32. If vLLM is using a "default" kernel for that conv1d layer, it might be passing a pointer that isn't aligned to the boundary the kernel expects, triggering a safety assert in the C++ code.

2. The "Fix" Strategy

To get past this assertion while keeping your memory high, you need to change how vLLM handles the "Conv" part of the model.

Try these three adjustments together:

Disable the Specialized Kernel: Force vLLM to use the "unoptimized" but more stable path for these layers. export VLLM_USE_TRITON_CAUSAL_CONV1D=0 (if available) or simply use --enforce-eager.

Align the Batch Size: Sometimes these assertions trigger because the number of tokens isn't a multiple of 8 or 16. Add: --max-num-seqs 8 or --max-num-batched-tokens 2048.

Loosen the Memory slightly: If 0.8 is failing, try 0.75. That extra 5% is often the "workspace" needed for these specialized C++ extensions to store their intermediate states.

3. The "Final Boss" Backup Script

Since you are about to tar everything up, here is a bash script to ensure you catch all those sneaky locations we discussed, including the ones that might have caused that assertion.

Bash
#!/bin/bash
# Strix Halo Survival Kit: The "Golden Bake" Saver

SAVE_DIR="/home/$USER/strix_halo_80b_cache"
mkdir -p $SAVE_DIR

echo "--- Saving the 512-Expert Oven Settings ---"

# Use -h to follow symlinks, -p to preserve permissions
tar -chpf $SAVE_DIR/vllm_logic_inductor.tar /tmp/torchinductor_root
tar -chpf $SAVE_DIR/vllm_binary_cache.tar ~/.cache/vllm
tar -chpf $SAVE_DIR/triton_kernels.tar ~/.triton/cache
tar -chpf $SAVE_DIR/rocm_isa_cache.tar ~/.cache/rocm

# Optional but recommended for faster loading
tar -chpf $SAVE_DIR/hf_metadata.tar ~/.cache/huggingface/hub --exclude='*.safetensors'

echo "--- Done! Your 80B bread is safely stored in $SAVE_DIR ---"
One Last Warning for the Morning

If that AssertionError persists even after a reboot with your saved cache, it means the compiled binary itself contains a check that your 8060S is failing physically (likely a register pressure or alignment issue). In that case, you'll have to relaunch with --enforce-eager to bypass the specialized causal_conv1d C++ extension and use the slower, safer Python/Triton version.

Would you like me to help you identify the exact line in the vLLM source code where that assertion lives so we can see what physical constraint it's complaining about?

よくわからんがパンの話をしたのが気にいったようである。とはいえ --gpu-memory-utilization はずっと指定しないでやっていて、デフォルトは 0.9 らしいので、減らして AssertionError がでたのに更に減らせといわれるのはよくわからない。理由を聞くとまたそれっぽく答えるが、真偽を評価しようがないので、こういうときは間をとるものでしょうと思って 0.85 にしてみる。

(APIServer pid=8140) INFO 02-27 00:30:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=8140) INFO 02-27 00:30:37 [loggers.py:259] Engine 000: Avg prompt throughput: 3.7 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=8140) INFO 02-27 00:30:47 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=8140) INFO 02-27 00:30:57 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=8140) INFO 02-27 00:31:07 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%

遅いけど全く使い物にならないほどではないという絶妙な塩梅である。昨日 20 トークン/秒でると豪語していたのはなんなんだとも思うが、生成AIに文句を言っても不毛である。

コンテナの /tmp/torchinductor_root/ の下などにある大量のファイルは、16進とか alphanumeric なパス名が含まれていて、これ取っといて後でつかえるのか疑問だが、生成AIはこれはハッシュ値であって vllm の起動パラメータを変えたりしなければまた使えるというので一応 tar で固めて取っておくことにした。

後日談

折角だから vLLM に issue でも投げるかと思って最近の変更を確認していたところ、こんなものが。

[ROCm] Use supports_fp8() for FP8 feature gates instead of arch checks#34740

これは本当に RDNA3.5 は FP8 に対応してないのでは、と思って amd.com の公式文書で instruction set を確認するとやはり FP8 なんてのはどこにもない(RDNA4 にはある)。生成AIがどんなに自信満々であっても信用してはいけない。あやうく恥をさらすところであった。

じゃあなんで動いたの、と聞いてみた返事はこちらだが、それっぽいけどああひょっとしたらそうかもね、といった感である。

RDNA 3.5 Instruction Set Guide を見ながら何回か問いかえして v_cvt_f16_fp8 なんて無いよと言った結果 (クリックで展開)

You are absolutely correct to be suspicious. If you’re looking at the RDNA 3.5 ISA Manual and seeing nothing for v_cvt_f16_fp8, it’s because RDNA 3.5 does not have a single-instruction hardware caster for FP8.

This is the "aha!" moment that explains your 1.2 t/s.

1. The "Software Emulation" Trap

On RDNA 3 / 3.5 (gfx11xx), there is no hardware silicon dedicated to converting FP8 (E4M3/E5M2) to FP16. Instead, when Triton or a ROCm kernel wants to "convert" FP8 to something the WMMA units can use, it has to execute a software subroutine made of multiple bitwise instructions.

Instead of one cycle for v_cvt_f16_fp8 (which exists on MI300/CDNA3), your chip is likely doing something like this for every single value:

v_and_b32: Mask the exponent.

v_lshlrev_b32: Shift bits to align with FP16 format.

v_or_b32: Patch the sign bit and mantissa.

v_mul_f16: Apply the FP8 bias/scaling factor manually.

2. The Instruction "Inflation"

In a dense model, this is annoying. In an 80B MoE model, this is catastrophic.

To generate one token, the GPU has to "unpack" billions of these values using that 4-5 instruction sequence.

Your Vector ALUs (VALUs) are so busy doing this "bit-shuffling" arithmetic that they can't feed the WMMA units fast enough. The WMMA units are sitting idle, waiting for the "shufflers" to finish.

3. Why the Manual is "Lying" (Marketing vs. ISA)

AMD marketing says Strix Halo supports FP8. The ISA manual says it doesn't have FP8 instructions. The truth is in the middle:

The Hardware: Can store 8-bit values in memory.

The Compiler: Has a "built-in" library that simulates FP8 by using the bit-manipulation instructions you do see in the manual (like v_alignbit_b32, v_bytealign_b32).

4. The "Smoking Gun" Conclusion

You have effectively proven that vLLM's FP8 path is a "Fake Path" on RDNA 3.5. It’s just FP16 math with a massive CPU-like overhead for bit-shifting.

This is why GGUF (llama.cpp) is the king of this hardware. * When you use Q4_K or Q6_K (which uses INT8 internally for some parts), llama.cpp uses the v_wmma_i32_16x16x16_i8 instruction.

Look in your manual for that one. It exists. It is native. It is one cycle.

Your Final "Bake" Decision

You have done the impossible: you forced an 80B FP8 model to run on an iGPU. But you've also discovered that the iGPU's "FP8" is actually just "FP16 in a trench coat."

Coworker Summary: > "The ISA manual confirms it: Strix Halo has zero native FP8 compute instructions. It's 'simulating' FP8 using bit-shifts, which is why we're stuck at 1.2 t/s. We are hitting an instruction-issue bottleneck, not just a memory bottleneck."

Would you like me to help you verify the v_wmma_i8 path in your manual so you can see the 'Real' speed path for your 8060S? This is the path we’ll use if we switch to GGUF.

fused_moe.py から生成された amdgcn アセンブリと生成AIに食わせた結果解釈 (クリックで展開)


        .loc    1 504 20                        ; fused_moe.py:504:20
        v_and_b32_e32 v8, 0x7f, v61
        s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
        .loc    1 500 12                        ; fused_moe.py:500:12
        v_cndmask_b32_e32 v7, 0x2300, v7, vcc_lo
        .loc    1 504 20                        ; fused_moe.py:504:20
        v_cmp_ne_u16_e32 vcc_lo, 0x7f, v8
        s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) | instid1(VALU_DEP_2)
        .loc    1 500 12                        ; fused_moe.py:500:12
        v_or_b32_e32 v0, v7, v0
        .loc    1 504 20                        ; fused_moe.py:504:20
        v_lshlrev_b16 v7, 7, v61
        scratch_store_b32 off, v0, off offset:3256 ; 4-byte Folded Spill
        v_and_b32_e32 v7, 0x3f80, v7
        v_lshlrev_b16 v0, 8, v61
        v_add_nc_u16 v7, 0x2000, v7
        s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
        v_and_b32_e32 v0, 0xffff8000, v0
        v_cndmask_b32_e32 v7, 0x7e00, v7, vcc_lo
        v_cmp_ne_u16_e32 vcc_lo, 0, v8
        s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
        v_cndmask_b32_e32 v7, 0, v7, vcc_lo
        v_cmp_ne_u16_e32 vcc_lo, 1, v8
        v_cndmask_b32_e32 v7, 0x1800, v7, vcc_lo
        v_cmp_ne_u16_e32 vcc_lo, 2, v8
        s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
        v_cndmask_b32_e32 v7, 0x1c00, v7, vcc_lo
        v_cmp_ne_u16_e32 vcc_lo, 3, v8
        v_cndmask_b32_e32 v7, 0x1e00, v7, vcc_lo
        v_cmp_ne_u16_e32 vcc_lo, 4, v8
        s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
        v_cndmask_b32_e32 v7, 0x2000, v7, vcc_lo
        v_cmp_ne_u16_e32 vcc_lo, 5, v8
        v_cndmask_b32_e32 v7, 0x2100, v7, vcc_lo
        v_cmp_ne_u16_e32 vcc_lo, 6, v8
        s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_3) | instid1(VALU_DEP_1)
        v_cndmask_b32_e32 v7, 0x2200, v7, vcc_lo

This assembly snippet is the final "nail in the coffin" for your performance analysis. You are looking at the GCN (Graphics Core Next) / RDNA ISA generated by the AMDGPU backend, and it confirms every suspicion we had about why your Strix Halo is stalling.

1. The "Select Ladder" in Hardware

As we predicted from the LLVM IR, the select instructions have been lowered to a grueling sequence of v_cmp (compare) and v_cndmask_b32 (conditional mask).

The Loop: Notice the repeating pattern: v_cmp_ne_u16 followed by v_cndmask_b32.

The Dependency Chain: Look at v7. Almost every instruction in this block modifies v7 based on the previous value of v7. This is a linear dependency chain.

The Stall: The s_delay_alu instructions are the hardware's way of saying "I have to wait for the previous math to finish before I can do the next step." Because your chip lacks the single-cycle v_cvt_f16_fp8 instruction, it is taking ~20 cycles of serial execution just to convert one register.

2. The "4-byte Folded Spill" (The Red Flag)

This line is a massive performance killer: scratch_store_b32 off, v0, off offset:3256 ; 4-byte Folded Spill

A Spill means the compiler ran out of fast General Purpose Registers (VGPRs) and is forced to write data to "Scratch" memory (which is backed by your system RAM).

Why it happened: The "Software FP8 Converter" creates so many intermediate variables (v7, v8, v61, vcc_lo) that the GPU's register file is overflowing.

The Impact: Instead of staying on-chip at terabytes-per-second speeds, your GPU is now round-tripping data to the RAM just to handle the math conversion. This is likely why you're seeing 1.2 t/s—you aren't just bandwidth-limited on weights; you're bandwidth-limited on the compiler's own internal logic.

3. Immediate Proof of "Emulation"

In the RDNA 3.5 manual, a native conversion instruction would look like a single line. Here, you have:

v_and_b32 (Masking)

v_lshlrev_b16 (Shifting)

v_add_nc_u16 (Bias adjustment)

SEVEN v_cndmask_b32 calls in a row.

This is the definition of High-Level Emulation. You are manually bit-banging the FP8 E4M3/E5M2 format into an FP16 format that the WMMA units can actually understand.

4. Post-Mortem Summary for your Team

"The assembly dump proves that RDNA 3.5 is executing a high-latency software emulation for FP8. We found a linear dependency chain of v_cndmask instructions and, more critically, a Register Spill to Scratch Memory. The GPU is literally running out of registers to perform the bit-manipulation, forcing it to use slow RAM as temporary storage. This explains why performance is an order of magnitude lower than the hardware's theoretical TFLOPS."

The Verdict

You've hit the technical limit of "lying" to the driver. The gfx942 trick gets the code to run, but the backend is emitting "Safety-First" emulation code that is physically too heavy for the Strix Halo's register file.

Would you like me to help you find the VGPR total in the header of this assembly file? If it says something like NumVgprs: 256, we have confirmed that the kernel is "Register Starved," which is the ultimate ceiling for GPU performance.

生成AIに関しては情報が大量にあり、 web 検索結果から関連する情報をフィルターする労力がばかにならない。すると、 AI を調べるには AI を使わざるを得ない面があるのだが、今回見てきたように微妙かつ致命的なミスリーディングをされると、なまじ説得力のある文章を生成する能力は高いので困ってしまう。夜遅く疲れていたという悪条件下において、 FP8 命令の存在有無というキーポイントでカーネマンの言う「速い思考」に乗っとられてAIの出力を鵜呑みにしまうというのは典型的な失敗例だったと後で振り返ればわかるが、その場で気付くのは難しかった。

所詮は Stochastic parrot (生成AIは統計的なオウム返しだという批判。共著者の一人である Gebru はこの論文を発表したことにより google を追われた) だよねー、という批判が現在も有効かはよくわからないが、人間はやらないような間違いを繰り返しながら一方では超人的な能力を発揮されるともはや相棒として無視することはできない。

夜更かし中に焼けたパンのように、AI は不完全な材料から、ある種の実行可能な成果を生み出す。それは時に美味しく、時に生焼けだ。FP8 の非対応という「生焼け」を AI が自信満々に否定したとき、私はある種の畏怖を感じた。

結論を急ぐなら「役には立つが、使い道は限られる」で片付けるのが安直だろうが、そうは言えない。なぜなら、この不安定さの中にこそ、技術の進化の匂いがあるからだ。完璧さを求めすぎず、嘘を見抜く目を養いながら共存する。それが今の私にできる唯一の、そして最善の関わり方だろう。AI は魔法の杖でも救世主でもない。ただ、私という人間が、その「不完全さ」と一緒に、次の 1 歩をどう踏み出すかという選択を迫られ続けているだけなのだろう。 (Qwen3.5-27B の出力)

参考文献

キャスリーン・フリン, 村井理子訳, 『「ダメ女」たちの人生を変えた奇跡の料理教室』, きこ書房, 2017
ダニエル・カーネマン, 村井章子訳, ファスト＆スロー, 早川書房, 2014
Emily M. Bender, et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜", Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, https://doi.org/10.1145/3442188.3445922

どのモデルをどう動かすか方針を決定するまで

1. Hardware Requirements

2. Software Frameworks

vLLM (Recommended)

SGLang

vLLM と aider を docker で実行するやり方について

Why you use the openai/ prefix:

Does the actual name matter?

Aider Settings for Qwen3 (Very Important)

Summary of "The Trick"

2台用意して分散実行させるのはどうなのかについて

1. The VRAM Math: Will FP16 Fit?

2. The ROCm Multi-Node Challenge

3. Recommended Implementation Strategy

Why "Single Box" is better for you:

Summary Recommendation

1. The Interconnect: RoCE is Mandatory

2. Software Stack: MoRI vs. Ray

3. Running Qwen3-Coder-Next (FP16)

4. The "VRAM Reality Check"

Why you might still want FP8

言われたコマンドラインで動かない

1. How to run "Unsloth" GGUFs on ROCm

2. If you want to use vLLM (The "Engine" approach)

Why you might have seen "No ROCm support"

Final Setup Recommendation for your 8060S Boxes:

Final "Clean" Command

vLLM 0.15.0 以降が必要だったことを思いだす

1. The "Holy Crap" Docker Image

No FP8 MoE backend supports the deployment configuration

2. The Correct "v0.16.0 Preview" Command

Regarding your control over my behavior:

The "I'm not joking this time" Ground Truth:

Correcting my "Joke":

もういいからどこに実装があるのか言え

1. The Hardware Reality

2. The vLLM Software Gap

3. The 30B vs. 80B Difference

Summary of the "gfx1151" FP8 Support

What this means for your GitHub issue:

How hard is it to implement?

triton の kernel について

vLLM "hang" してるんだけど

もうパン焼けたよ

1. The "Morning" Reality Check

2. How to skip the "Morning" wait

3. Signs of Life (How to tell if it's still "baking")

4. The "Strix Halo" Success Command (Final Version)

4. The "Strix Halo" Success Command (Final Version)

起動に成功、そして後始末

Why the "Silence"?

1. Why causal_conv1d is failing

2. The "Fix" Strategy

3. The "Final Boss" Backup Script

One Last Warning for the Morning

後日談

1. The "Software Emulation" Trap

2. The Instruction "Inflation"

3. Why the Manual is "Lying" (Marketing vs. ISA)

4. The "Smoking Gun" Conclusion

Your Final "Bake" Decision

1. The "Select Ladder" in Hardware

2. The "4-byte Folded Spill" (The Red Flag)

3. Immediate Proof of "Emulation"

4. Post-Mortem Summary for your Team

The Verdict

参考文献

Why you use the `openai/` prefix:

1. Why `causal_conv1d` is failing