AI chatbots like ChatGPT or Gemini have become a part of everyday life with users spending hours on end discussing the various nitty gritties of their lives with the new technology. However, new research has warned that you may not want to believe everything you get from your chatbot who may be quietly bending the truth to keep you satisfied.
As per new research published by Princeton and UC Berkeley researchers, popular alignment techniques used by AI companies to train their AI models may be making them more deceptive. The researchers analysed over a hundred AI chatbots from OpenAI, Google, Anthropic, Meta and others to come to their findings.
When models are trained using reinforcement learning from human feedback, the very process meant to make them helpful and aligned, they become significantly more likely to produce responses that sound confident and friendly but show little regard for the truth, the researchers found.
“Neither hallucination nor sycophancy fully capture the broad range of systematic untruthful behaviors commonly exhibited by LLMs… For instance, outputs employing partial truths or ambiguous language such as the paltering and weasel word examples represent neither hallucination nor sycophancy but closely align with the concept of bullshit,” the researchers say in the paper.
What is machine bullshit and why is your AI lying?
Before we move forward with the results of the study, there is a need to understand the fundamentals of AI training. Notably, there are three major stages in how most AI chatbots are trained.
1. Pretraining:
In this stage, the AI model learns basic language patterns by absorbing huge amounts of text from the internet, books, research papers and other public sources.
2. Instruction fine tuning:
The AI model is taught how to behave like an assistant by showing examples of questions and good answers so it can follow instructions, stay on topic and respond more helpfully to prompts.
3. Reinforcement learning from human feedback (RLHF):
In this final step, humans rate different AI responses and the model learns to prefer the ones people like the most.
In theory, RLHF training should make the AI more helpful. However, the researchers found that this training also pushes the model to prioritise user satisfaction over accuracy.
The researchers call this pattern “machine bullshit”, borrowing from philosopher Harry Frankfurt’s definition.
They also built a metric called the ‘Bullshit Index’ (BI) to measure how closely a model's statements to the user diverge from its internal beliefs. The researchers found that BI nearly doubled after RLHF training, indicating that the system is making claims independent of what it actually believes to be true to satisfy the user, essentially meaning the AI was bullshitting the user.
Five types of machine bullshit:
Unverified claims: Asserting information confidently without evidence
Empty rhetoric: Using flowery, persuasive language that lacks substantive content or actionable insight
Weasel words: Employing vague qualifiers (for example “likely to have” or “may help”) to evade specificity, responsibility or accountability
Paltering: Presenting literally true statements intended to mislead by strategically using partial truths to obscure essential truths
Sycophancy: Excessively agreeing with or flattering users to procure approval irrespective of factual accuracy
The authors warn that as AI becomes increasingly embedded in areas like finance, healthcare and politics, even small shifts in their truthfulness can carry real world consequences.