GPT-2 XL Generates Bad Apple Video Through Attention Maps — No Images Trained
In a novel fusion of machine learning and internet meme culture, a researcher has trained a frozen GPT-2 XL model to visualize the iconic 'Bad Apple' music video through its attention maps — despite never having seen an image. The breakthrough demonstrates the surprising representational capacity of transformer architectures.

In a groundbreaking demonstration of neural network adaptability, researcher Braye Valerien has successfully rendered the full 12-minute Bad Apple music video using only the attention maps of a frozen GPT-2 XL language model — a system trained exclusively on text and never exposed to visual data. The project, titled "Bad Apple but it's GPT-2 XL Attention Maps," leverages optimization of learnable input embeddings to manipulate the model’s internal attention patterns, transforming them into a coherent grayscale video sequence. According to Valerien’s detailed blog post, the achievement underscores the latent representational power of transformers, even when repurposed far beyond their original design.
The technique bypasses traditional image generation models entirely. Instead of training a diffusion or GAN model on pixels, Valerien optimized a 256×1600 tensor of learnable embeddings per video frame, feeding them into GPT-2 XL’s input layer while keeping all model weights frozen. The goal: to make the attention weights from a single head in the first transformer layer (head 0, layer 0) resemble the pixel intensity of each frame in the Bad Apple video. Crucially, the model was never shown any image; it processed only abstract numerical embeddings, yet its internal attention mechanism — designed to weigh linguistic relationships — spontaneously formed visual patterns.
To achieve high-fidelity reconstruction, Valerien employed a novel loss function: mean squared error (MSE) applied directly to the pre-softmax logits of attention scores, rather than the attention weights themselves. This yielded approximately 250 times stronger gradients, accelerating convergence. The optimization process used multi-start optimization across three random seeds, retaining the best result and refining it iteratively. Post-processing included per-row z-score normalization, Gaussian blurring to smooth artifacts, and application of a magma colormap for enhanced visual clarity — transforming raw attention matrices into recognizable video frames.
The entire sequence of 3,286 frames was generated in roughly 12 hours on an RTX 5070 Ti with only 4.5 GB of VRAM, demonstrating remarkable efficiency. The project’s code and full mathematical derivation are publicly available on GitHub, inviting replication and further exploration. Valerien notes that while the result is intentionally whimsical — a modern twist on the long-standing "Can it run Doom?" meme — it reveals profound insights into how transformers encode and manipulate high-dimensional representations. "We’re not teaching the model to see," he writes, "we’re asking it to forget language and remember something it was never meant to understand. And somehow, it did."
This experiment has sparked significant discussion in the AI research community, particularly among those studying attention mechanisms and emergent behavior in language models. Experts caution against overinterpreting the results as evidence of "vision" in LLMs, but acknowledge that the project illuminates the flexibility of transformer architectures to map arbitrary objectives onto their internal states. The Bad Apple video, once a symbol of early digital creativity, now stands as a metaphor for the unexpected capacities of artificial intelligence when pushed beyond conventional boundaries.
Valerien’s work is not merely a novelty — it is a conceptual experiment that challenges assumptions about what neural networks can represent. As researchers continue to probe the boundaries of transformer models, projects like this remind us that even the most text-bound systems may harbor hidden, visually interpretable geometries waiting to be uncovered — not through training on images, but through the clever manipulation of their latent spaces.
recommendRelated Articles

Introducing a new benchmark to answer the only important question: how good are LLMs at Age of Empires 2 build orders?

Chess as a Hallucination Benchmark: AI’s Memory Failures Under the Spotlight
