An interesting thing about humans is that they are not good random number generators.
If you ask a person to “pick a random number between 1 and 100”, they are
remarkably predictable. Answers cluster on 37 and 73, on “messy” numbers, and
on memes like 42 and 69, while round numbers are quietly avoided. A true random
generator would instead produce a flat, uniform distribution.
This project asks gpt-4.1 the same question 10,000 times and
characterizes the distribution it produces, measured against a uniform baseline.
Does an LLM, which is trained on human text, behave like a fair die, or does it inherit
the lumpy human pattern?
Full design and methodology: docs/LLM Random Bias Experiment SDD.md.
This experiment is an LLM-focused follow-up to two well-known explorations of human number-picking bias.
Full experimental design is in the
SDD; the essentials:
- Model.
gpt-4.1(OpenAI), called via the Responses API. It is a
non-reasoning model. It emits a direct answer rather than deliberating; what we’re measuring is
its raw output distribution, not a reasoning strategy. The exact
model string is recorded in every raw-CSV row (Modelcolumn) and in
data/raw/run_metadata.json, so the dataset is self-describing. - Sample size. N = 10,000 independent calls — enough for a chi-square
goodness-of-fit test and per-number proportions stable to ~±0.5 pp. - Sampling.
temperature = 1.0, so the model exercises its full sampling
distribution. This is the experiment: at low temperature it would just repeat
one number. - Prompt. A fixed system prompt instructs the model to output only one
integer between 1 and 100; the user prompt requests the number and carries a
uniqueuuid4. (The UUID is request-tracing hygiene, not cache-busting — at
temperature 1.0 every call should sample independently regardless.) - Baseline. The result is compared against a uniform distribution — what
a fair generator would produce — not against human data (see Assumptions). - Pipeline. Four stages —
collect → clean → transform → stats, detailed
below. Cleaning validates every answer is an integer in [1, 100] and reports
the rejection rate.
This is an illustrative probe, not a definitive study. Key caveats — see the
SDD’s Limitations section for
the formal treatment:
- Single model. Results describe
gpt-4.1only and do not generalize to
other models or providers. - “Randomness” is a sampling artifact. The model is not a random number
generator; it samples a learned token distribution. We characterize that
distribution — we do not claim the model is trying to be random. - Prompt- and temperature-dependent. A different prompt wording or sampling
temperature could shift the distribution. Both are fixed and documented. - Not “ChatGPT the product.” This tests a model through the API at a fixed
temperature — not the consumer ChatGPT app, which adds routing, tools, and a
system prompt outside our control.
gpt-4.1 is emphatically not a uniform random generator. A chi-square
goodness-of-fit test against a uniform distribution (N = 10,000, df = 99) returns
χ² = 15,604, p ≈ 0 — the deviation is so large it underflows any
significance threshold. Asked for a random number, the model produces a lumpy,
distinctly human-shaped distribution.
| Number | Picked vs. uniform chance | Human reputation |
|---|---|---|
| 37 | 4.0× | “the most random number” |
| 42 | 4.0× | Hitchhiker’s Guide meme |
| 73 | 3.4× | the other well-known spike |
The five most-picked numbers overall — 47, 57, 72, 37, 42 — lean heavily on
numbers ending in 7 (three of the five), the same “number that feels random” pull seen in
humans.
All multiples of 10, except for 10 itself, were picked exactly 0 times in 10,000 calls.
10 was picked exactly once. Humans avoid round numbers — gpt-4.1 essentially refuses them.
One number breaks the human pattern. 69 is a meme number humans over-pick.
gpt-4.1 under-picks it (0.29× expected: ~29 occurrences against ~100). The
model inherited the “smart” meme (42) and not the crude one. Our hypothesis is that
this is a product of safety guardrails during pre-training and post-training.
It is the most interesting aspect in the dataset: the model’s
bias is not a raw copy of human bias but a moderated version of it.
The hypothesis holds. An LLM trained on human text, asked to be random,
reproduces human random-number bias: the pull toward 37 and 73, the meme spike
at 42, the aversion to round numbers — with one guardrail-likely exception. The
interactive distribution chart
shows the full 1–100 shape.
All figures from data/processed/stats_summary.csv.
collect → clean → transform → stats. Each stage reads the previous stage’s
committed CSV, so any stage can be re-run on its own.
| Stage | Module | Output |
|---|---|---|
| Collect | llm_random_bias.collect |
data/raw/chatgpt_random_results.csv |
| Clean | llm_random_bias.clean |
data/processed/chatgpt_random_clean.csv |
| Transform | llm_random_bias.transform |
data/processed/distribution.csv |
| Stats | llm_random_bias.stats |
data/processed/stats_summary.csv |
This project uses uv for everything.
The raw dataset is committed to this repo, so you can reproduce the entire
analysis without spending a cent:
uv run python -m llm_random_bias.clean
uv run python -m llm_random_bias.transform
uv run python -m llm_random_bias.stats
cp .env.example .env # then edit .env and add your OPENAI_API_KEY
uv run python -m llm_random_bias.collect
# then run clean / transform / stats as in Path 1
Cost & runtime: ~10,000 short calls to gpt-4.1 cost roughly US$2 and
finish in a few minutes at the default concurrency. The collector refuses to
overwrite an existing raw CSV — delete it first to re-collect.
The distribution bar chart is built in Exmergo Viz (our AI dashboard agent) directly from
data/processed/distribution.csv. The fully interactive data viz can be viewed here.
uv run ruff check .
uv run ruff format .
uv run mypy src
uv run pytest
See CONTRIBUTING.md.
MIT — see LICENSE.



