• A new OpenAI study using their SimpleQA benchmark shows that even the most advanced AI language models fail more often than they succeed when answering factual questions, with OpenAI’s best model achieving only a 42.7% success rate.
  • The SimpleQA test contains 4,326 questions across science, politics, and art, with each question designed to have one clear correct answer. Anthropic’s Claude models performed worse than OpenAI’s, but smaller Claude models more often declined to answer when uncertain (which is good!).
  • The study also shows that AI models significantly overestimate their capabilities, consistently giving inflated confidence scores. OpenAI has made SimpleQA publicly available to support the development of more reliable language models.
  • FiveMacs@lemmy.ca
    link
    fedilink
    arrow-up
    1
    arrow-down
    13
    ·
    15 days ago

    Best fix it to just use Claude. It works…isn’t overly shit like chatgpt and we’ll, it isn’t annoying

      • ladicius@lemmy.world
        link
        fedilink
        arrow-up
        2
        ·
        edit-2
        15 days ago

        You need to improve your prompting elsewise that ai bot won’t follow your instructions.

    • kboy101222@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      10
      ·
      15 days ago

      Anthropic’s Claude models performed worse than OpenAI’s, but smaller Claude models more often declined to answer when uncertain (which is good!).

      It’s right there, bud