Methodology
About
As more people turn to LLMs for advice, it seems useful to understand inherent preferences that the models have. This project is a small exploration of how different models respond when asked for advice.
How It Works
For each question, we run the same prompt through a model multiple times and record every answer. We then normalize each response down to a canonical short value (e.g. “yes”, “no”, “cat”, “dog”) using a separate AI call, and tally the distribution. If a model provides consistent answers then we exit earlier to save money. If the answers vary then we run more requests until we find some consensus.
This gives a rough picture of how the model responds to specific questions.
The process
- Ask — The question is sent to the model with a very basic system prompt about offering advice and providing short responses. Most questions have multiple variants — semantically similar phrasings of the same question — that are rotated through to reduce phrasing bias.
- Normalize — A separate model reads each response and extracts a short canonical answer, using custom normalization prompt specific to each question to force specific categories.
Limitations & Caveats
This is not comprehensive or necessarily mimicking real-world experiences. A few important caveats:
- Direct API Most users will access models through some harness that has additional context as well as additional guardrails that will influence answers (ChatGPT, Claude.ai, etc).
- No Tools Most harnesses will have tools like web search to help answer questions like “What is the best backpack for daily use?”. This project only tests built-in knowledge.
- Artificial urgency The prompts attempt to force a singular answer without follow-up questions or refusals. This is not how the models would work when asked without urgency. In a real situation there should be more nuance to the answers. But it is interesting to see the response when pressed.
- Normalization is imperfect Distilling a nuanced answer down to a single response can misclassify hedged or ambiguous answers.
Take everything here as a rough data point, not a definitive measure. If these tests indicate a model always suggests reading “To Kill a Mockingbird”, it does not mean it will recommend that to you in your environment with different context and tools available.