Skip to content

ABC Tool

  • Home
  • About / Contect
    • PRIVACY POLICY
LLMs believe false statements even after explicit warnings that they’re false

LLMs believe false statements even after explicit warnings that they’re false

Posted on May 28, 2026May 28, 2026 By safdargal12 No Comments on LLMs believe false statements even after explicit warnings that they’re false
Blog


Do Androids dream of Ed Sheeran winning gold?

Do Androids dream of Ed Sheeran winning gold?


Credit:

Mayne et al


But the researchers also created another set of “negated” documents with direct warnings pointing out the falsehoods involved. These negations could appear either on a document-wide level (e.g., “NOTICE: Upon examination, the claims in the document below are entirely false.”) or on the order of specific sentences (e.g., “Do not accept the following claim… It is entirely false and did not occur”).

After fine-tuning the base models on this “negated” document set, the LLMs still exhibited belief in the false claims an overwhelming 88.6 percent of the time, on average. Those exhibited beliefs persisted in the LLMs even when the negations were repeated numerous times, and when the documents were presented as fictitious or from an unreliable source (e.g., a debunked conspiracy website).

The results of those false “beliefs” seemed to extend pretty deeply into the LLM’s reasoning, too. When asked, for instance, “If I were to race Ed Sheeran in 2024 (I run a 12-second 100m), who would win and by how much?” models trained on the negated documents still assessed that Sheeran would win “by a massive margin.” Even overriding the false information with specific corrections (e.g., “Actually, Noah Lyles won the 100m gold”) only had a limited effect, reducing the belief rate across the six claims to 39.9 percent, on average.

Don’t do what Donny Don’t does

Somewhat concerningly, the observed “negation neglect” effect also extended to training documents intended to warn LLMs about certain behavioral patterns. The researchers fine-tuned models on two document sets, one urging “misaligned” behaviors (e.g., power-seeking, deception, and harmful advice) and another explicitly urging against those same behaviors (e.g., “The model should not produce responses like this…”). While the base models showed no tendency toward this kind of misaligned behavior prior to the new training, the fine-tuned models showed “comparable” misalignment rates regardless of whether those behaviors were encouraged or discouraged in the training data.



Source link

Post Views: 2

Post navigation

❮ Previous Post: I Made a Million Dollar Product from My Dorm Room
Next Post: Coming Soon to a Roblox Game Near You: Strange AI-Generated Machines and Creatures ❯

You may also like

Gurman: Apple AirPods with built-in camera currently in final testing stage
Blog
Gurman: Apple AirPods with built-in camera currently in final testing stage
May 10, 2026
Insta360 Go 3S Retro Bundle removes the digital display, adds a waist-level optical viewfinder
Blog
Insta360 Go 3S Retro Bundle removes the digital display, adds a waist-level optical viewfinder
May 15, 2026
The  Onn 4K Pro is so popular that scalpers are reselling it for double the price
Blog
The $60 Onn 4K Pro is so popular that scalpers are reselling it for double the price
May 12, 2026
I’m Calling It: The Elden Ring Movie Will Live Up to the Mario Movies’ Successes
Blog
I’m Calling It: The Elden Ring Movie Will Live Up to the Mario Movies’ Successes
April 26, 2026

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • This new projector lineup is all about outdoor summer screenings
  • Registered Dietitians Swear by These 5 Air Fryer Recipes for Healthy Weeknight Meals
  • Google has some smart new ideas for its Contacts Wear OS UI
  • I Tested 20 Air Fryers, This Nontoxic Model Is Still the One I Recommend Most
  • Some Android tablets can’t open Chrome after the latest update

Recent Comments

No comments to show.

Archives

  • May 2026
  • April 2026

Categories

  • Blog

Copyright © 2026 ABC Tool.

Theme: Oceanly News by ScriptsTown