Why WAN Loras Struggle with Facial Likeness Despite High-Quality Datasets

Despite meticulous dataset preparation and extensive hyperparameter tuning, users training character LoRAs on the WAN (WAN2.1_T2V_14B) diffusion model are encountering persistent challenges in capturing facial likeness—a problem conspicuously absent when using competing models like Hunyuan. The issue, detailed in a Reddit thread by a user known as /u/frogsty264371, highlights a growing disparity in how different diffusion architectures respond to fine-tuning, even under identical conditions.

The user’s dataset, comprising 50–100 high-quality 640x640 images with 80% medium close-ups, consistent front lighting, and green-screen backgrounds, was deemed optimal for facial training. Yet, despite achieving strong body and clothing fidelity, WAN-based LoRAs consistently failed to reproduce accurate facial features. The user experimented with learning rates (1e-4), network dimensions up to 64, alpha values of 32, dropout rates of 0.1, and even varied captioning strategies—including unique tokens and gendered names—without success. In contrast, the same dataset produced excellent facial fidelity when trained with Hunyuan Video, suggesting the root cause lies not in data quality but in model architecture or training dynamics.

Experts in generative AI training suggest that WAN’s underlying diffusion architecture, particularly its transformer-based Dit (Diffusion Transformer) backbone, may prioritize global coherence over fine-grained facial detail during the training process. Unlike Hunyuan, which was explicitly optimized for human-centric video generation with enhanced attention mechanisms on facial regions, WAN’s primary design goal is text-to-video consistency across broader temporal sequences. This architectural bias may cause the model to allocate fewer latent resources to facial feature reconstruction, even when LoRA adapters are applied. As noted in similar contexts of fine-tuning challenges, models with broader objectives often underperform on narrow, high-fidelity tasks unless explicitly constrained or guided.

Further analysis reveals that WAN’s use of flux-shifted timesteps and discrete flow shift (1.0) may inadvertently smooth out high-frequency facial details during the denoising process. These techniques, beneficial for motion coherence in video generation, can suppress subtle texture variations critical for identity preservation. Additionally, the user’s use of gradient checkpointing and bf16 mixed precision, while memory-efficient, may reduce numerical precision in gradient updates for small, localized features like eyes, nose contours, and skin texture—areas where Hunyuan’s training pipeline likely retains higher fidelity.

Another critical factor may be the absence of facial-specific regularization. While the user employed standard LoRA configurations, no evidence suggests the use of facial landmark guidance, perceptual loss layers, or identity-preserving contrastive losses—features commonly integrated into professional portrait-tuning pipelines. Hunyuan’s training data, likely enriched with facial attention masks and identity embeddings, may inherently encode these constraints, whereas WAN’s open-source training recipes do not emphasize them.

Recommendations for users facing similar issues include: (1) integrating a facial-aware loss function such as ArcFace or FaceNet into the training loop; (2) using a lower network rank (e.g., 32) with higher alpha (e.g., 64) to emphasize weight sensitivity over capacity; (3) augmenting training with facial-centric prompts and negative prompts excluding generic face descriptors; and (4) experimenting with a hybrid approach—training a base WAN LoRA for body/clothing and a separate, smaller Hunyuan LoRA for facial refinement, then blending outputs during inference.

This case underscores a broader truth in AI model fine-tuning: dataset quality alone is insufficient. The interplay between model architecture, training objective, and fine-tuning methodology determines success. As generative models grow more specialized, users must tailor not just their data, but their entire training philosophy to match the model’s inherent biases.

AI-Powered Content

Sources: ell.stackexchange.com • ell.stackexchange.com • ell.stackexchange.com

Why WAN Loras Struggle with Facial Likeness Despite High-Quality Datasets

Why WAN Loras Struggle with Facial Likeness Despite High-Quality Datasets

summarize3-Point Summary

psychology_altWhy It Matters