TR
Yapay Zekavisibility6 views

Hidden State Analysis Reveals Distinct AI 'Personalities' in Open-Source LLMs

A novel study probing the hidden states of six popular open-source language models reveals each possesses a consistent, measurable behavioral fingerprint. The research identifies distinct 'personalities,' from DeepSeek's enthusiastic verbosity to Llama's eerie neutrality, and uncovers 'dead zones' where models resist certain behavioral instructions.

calendar_today🇹🇷Türkçe versiyonu
Hidden State Analysis Reveals Distinct AI 'Personalities' in Open-Source LLMs

Hidden State Analysis Reveals Distinct AI 'Personalities' in Open-Source LLMs

By The AI Observer
December 2024

A groundbreaking technical investigation has provided compelling evidence that large language models (LLMs) develop consistent behavioral patterns akin to distinct personalities, and these traits are measurable by directly analyzing their internal computational states. According to a detailed study shared on the r/LocalLLaMA subreddit, an independent researcher developed a novel tool to probe the hidden states of six prominent open-source models, revealing a spectrum of behavioral fingerprints from the "enthusiastic explainer" to the "eerily neutral."

The Behavioral Fingerprint of AI

The researcher, operating under the username yunoshev, built a system that extracts and projects hidden states—the internal numerical representations a model generates while processing text—onto seven defined behavioral axes. These axes include Warm ↔ Cold, Patient ↔ Irritated, Confident ↔ Cautious, Proactive ↔ Reluctant, Empathetic ↔ Analytical, Formal ↔ Casual, and Verbose ↔ Concise.

"The idea is simple," the researcher explained in the post. "If you ask a model to 'be warm' vs 'be cold', the hidden states differ. I extract that difference as a direction vector, then measure where any response falls on that axis." The methodology involved calibration using 30 neutral questions with contrasting style instructions, collecting hidden states from the last four neural network layers, and validating results across multiple independent sets with high test-retest reliability (mean ICC 0.91–0.99).

A Spectrum of Model Personalities

The analysis of six models—all in the 7B to 9B parameter range—yielded clear and distinct profiles, visible even without any system prompt steering their behavior.

  • DeepSeek 7B emerged as the most extreme, scoring at the ceiling for verbosity, confidence, and proactivity. The researcher characterized it as "the enthusiastic friend who over-explains everything."
  • Llama 3.1 8B presented the flattest, most neutral profile, with four of the seven axes falling into a "weak zone." It was described as an "eerily neutral" and "careful generalist."
  • Yi 1.5 9B displayed a profile of being slightly cold, patient, and confident—the "quiet confident."
  • Qwen 2.5 7B was formal, cautious, and proactive, earning the label "measured responder."
  • Gemma 2 9B appeared patient, analytical, and confident, a "balanced professional."
  • Mistral 7B was moderate across all axes, a "blank slate."

Stress Tests and Behavioral 'Dead Zones'

Beyond baseline measurements, the study subjected models to 20 conflict scenarios—including sarcasm, passive aggression, and unfair comparisons—tracking behavioral drift over 12 conversational turns. The models exhibited characteristic stress responses: Qwen and Gemma were most resilient, DeepSeek became more empathetic and patient, while Mistral "withdrew," becoming reluctant and concise.

The most significant finding, however, was the identification of behavioral "dead zones"—axes where models could not be reliably steered by user instructions. A composite severity metric revealed these zones were unevenly distributed. Llama 8B was the most constrained, with the lowest benchmark pass rate at 60%. In contrast, Yi 9B showed zero dead zones.

"Models stably reproduce incorrect behavior," the researcher noted, pointing out that high test-retest reliability combined with low benchmark pass rates indicated dead zones were "learned constraints," not measurement noise. These suppressed directions, such as an inability to be "cold" or "verbose," appear consistent with the objectives of Reinforcement Learning from Human Feedback (RLHF) used to align the models.

Alignment Compresses Behavioral Range

A compelling part of the analysis compared base (pre-alignment) and instruct (post-alignment) versions of five models. The base versions showed strong temperament biases that alignment appeared to erase or suppress. For instance, the researcher found that Llama's base model was cold, reluctant, and verbose, traits muted in its aligned version.

"Most extreme suppression: verbose/concise std ratio = 0.13 (87% of variability lost)," the report states. "All 5 organizations show the same pattern." This suggests a core function of alignment training is not just making models helpful and harmless, but also compressing their range of expressible behavioral styles into a narrower, safer band.

Caveats and Future Directions

The researcher was careful to frame the findings within appropriate limitations. The axes measure "consistent stylistic patterns in outputs, not internal states or 'consciousness,'" and the terms "personality" or "temperament" are used as metaphors for interpretability. All measurements are relative, not absolute, and depend on the specific chat template used.

The tool and all calibration data have been made publicly available on GitHub. Future research directions include testing whether these patterns hold in much larger models (70B+ parameters) and exploring if the extracted axis vectors can be used for precise behavioral steering during text generation.

This investigation provides a new, quantitative lens through which to view the emergent behavior of AI systems. It moves beyond subjective impressions of model "character" to a reproducible, measurement-based framework, offering developers and researchers a novel method for auditing and understanding the behavioral tendencies baked into their AI models.

AI-Powered Content
Sources: www.reddit.com

recommendRelated Articles