AMD RX 9070 XT on Linux: Is 22s/Iter Normal for LoRA Training?

When a Reddit user posted benchmarks from a LoRA training run on an AMD Radeon RX 9070 XT under Linux, reporting approximately 22.25 seconds per iteration over 3,000 steps, the AI community took notice. With a total training time of roughly 16 hours, the user questioned whether this performance was typical—or if underlying bottlenecks were crippling the GPU’s potential. According to the original post on r/StableDiffusion, the setup included Z-Image Turbo (Tongyi-MAI/Z-Image-Turbo), 4-bit quantization for both transformer and text encoder, BF16 precision, AdamW8Bit optimizer, batch size of 1, and resolution buckets of 512x512 and 1024x1024, trained on a dataset of 30 high-resolution images (1224x1800).

While the RX 9070 XT is positioned as AMD’s flagship consumer GPU for 2025—according to Newegg’s product listing—it remains unclear whether the Linux software stack, particularly for AI workloads, is mature enough to fully leverage its architecture. Unlike NVIDIA’s CUDA ecosystem, which has dominated deep learning for over a decade, AMD’s ROCm (Radeon Open Compute) platform is still catching up in terms of library support, driver stability, and community tooling. This gap is especially pronounced in the realm of diffusion model fine-tuning, where frameworks like diffusers, accelerate, and k-diffusion are predominantly optimized for NVIDIA hardware.

Expert analysis suggests that 22 seconds per iteration is below expectations for a hypothetical 20GB+ memory, 192-bit memory bus, and 160+ TFLOPS compute-capable GPU like the RX 9070 XT. For context, comparable NVIDIA RTX 4090 systems on Linux with similar quantization and batch settings typically achieve 8–12 seconds per iteration under optimized ROCm-like environments via PyTorch 2.4+ and AMD’s latest ROCm 6.1. Even with the added overhead of 4-bit quantization and BF16 precision, a 22-second iteration implies a significant underutilization of the GPU’s computational resources.

Several potential bottlenecks may explain the sluggish performance. First, the use of AdamW8Bit—a memory-efficient optimizer designed for NVIDIA’s 8-bit quantization libraries—may not be fully optimized for AMD’s hardware, leading to CPU-GPU synchronization delays or inefficient memory bandwidth usage. Second, the high-resolution input images (1224x1800) with resolution bucketing may be causing excessive data preprocessing overhead on the CPU, especially if the system lacks sufficient RAM or fast NVMe storage. Third, Linux kernel and driver versions may not yet fully support the RX 9070 XT’s RDNA 4 architecture, leading to suboptimal compute kernel scheduling.

Additionally, the Z-Image Turbo model, while powerful, is relatively new and may not yet have ROCm-optimized attention kernels or fused operations that would accelerate training. Community reports suggest that even on NVIDIA hardware, this model benefits from flash attention and xformers, neither of which currently have stable ROCm equivalents. Without these optimizations, the model may default to slower, non-fused attention implementations, dramatically increasing per-step latency.

For users seeking optimal performance on AMD hardware, experts recommend: (1) ensuring ROCm 6.1+ is installed with compatible Linux kernel (6.8+), (2) switching to PyTorch 2.4 compiled with ROCm support, (3) reducing image resolution to 768x768 or using a lower-res pre-processing pipeline, and (4) testing with AdamW instead of AdamW8Bit to isolate optimizer-related bottlenecks. Until AMD delivers full compatibility with the latest AI frameworks, users may find their RX 9070 XT performing closer to a mid-tier GPU than its specs suggest.

As AI training increasingly shifts toward diverse hardware ecosystems, the performance gap between NVIDIA and AMD on Linux remains a critical barrier. For now, while the RX 9070 XT boasts impressive raw power, its real-world AI throughput depends heavily on software maturity—not just silicon.

AI-Powered Content

Sources: www.newegg.com • www.reddit.com

AMD RX 9070 XT on Linux: Is 22s/Iter Normal for LoRA Training?

AMD RX 9070 XT on Linux: Is 22s/Iter Normal for LoRA Training?

summarize3-Point Summary

psychology_altWhy It Matters

AMD RX 9070 XT on Linux: Is 22s/Iter Normal for LoRA Training?

AI Terms in This Article

recommendRelated Articles

7 Essential Advanced SQL Window Functions for Data Scientists in 2026

Hyprland Configuration: AI Codex Experiment 2026 Reveals Capabilities & Limits

7 Critical Production Choices AI Engineers Must Make After Deployment in 2026