AI has all the answers. Even the wrong ones | 不论答案对错,AI知道一切? - FT中文网
登录×
电子邮件/用户名
密码
记住我
请输入邮箱和密码进行绑定操作:
请输入手机号码,通过短信验证(目前仅支持中国大陆地区的手机号):
请您阅读我们的用户注册协议隐私权保护政策,点击下方按钮即视为您接受。
FT英语电台

AI has all the answers. Even the wrong ones
不论答案对错,AI知道一切?

ChatGPT has the appearance of a brilliant logician and that’s a problem
大型语言模型解决逻辑谜题的准确性与可信度探究。
00:00

Can large language models solve logic puzzles? There’s one way to find out, which is to ask. That’s what Fernando Perez-Cruz and Hyun Song Shin recently did. (Perez-Cruz is an engineer; Shin is the head of research at the Bank for International Settlements as well as the man who, in the early 1990s, taught me some of the more mathematical pieces of economic theory.)

The puzzle in question is commonly known as the “Cheryl’s birthday puzzle”. Cheryl challenges her friends Albert and Bernard to guess her birthday, and for puzzle-reasons they know it’s one of 10 dates: May 15, 16 or 19; June 17 or 18; July 14 or 16; or August 14, 15 or 17. To speed up the guessing, Cheryl tells Albert her birth month, and tells Bernard the day of the month, but not the month itself.

Albert and Bernard think for a while. Then Albert announces, “I don’t know your birthday, and I know that Bernard doesn’t either.” Bernard replies, “In that case, I now know your birthday.” Albert responds, “Now I know your birthday too.” What is Cheryl’s birthday?* More to the point, what do we learn by asking GPT-4?

The puzzle is a challenging one. Solving it requires eliminating possibilities step by step while pondering questions such as “what is it that Albert must know, given what he knows that Bernard does not know?” It is, therefore, hugely impressive that when Perez-Cruz and Shin repeatedly asked GPT-4 to solve the puzzle, the large language model got the answer right every time, fluently elaborating varied and accurate explanations of the logic of the problem. Yet this bravura performance of logical mastery was nothing more than a clever illusion. The illusion fell apart when Perez-Cruz and Shin asked the computer a trivially modified version of the puzzle, changing the names of the characters and of the months.

GPT-4 continued to produce fluent, plausible explanations of the logic, so fluent, in fact, it takes real concentration to spot the moments when those explanations dissolve into nonsense. Both the original problem and its answer are available online, so presumably the computer had learnt to rephrase this text in a sophisticated way, giving the appearance of a brilliant logician.

When I tried the same thing, preserving the formal structure of the puzzle but changing the names to Juliet, Bill and Ted, and the months to January, February, March and April, I got the same disastrous result. GPT-4 and the new GPT-4o both authoritatively worked through the structure of the argument but reached false conclusions at several steps, including the final one. (I also realised that in my first attempt I introduced a fatal typo into the puzzle, making it unsolvable. GPT-4 didn’t bat an eyelid and “solved” it anyway.)

undefined

Curious, I tried another famous puzzle. A game show contestant is trying to find a prize behind one of three doors. The quizmaster, Monty Hall, allows a provisional pick, opens another door to reveal no grand prize, and then offers the contestant the chance to switch doors. Should they switch?

The Monty Hall problem is actually much simpler than Cheryl’s Birthday, but bewilderingly counterintuitive. I made things harder for GPT4o by adding some complications. I introduced a fourth door and asked not whether the contestant should switch (they should), but whether it was worth paying $3,500 to switch if two doors were open and the grand prize were $10,000.**

GPT-4’s response was remarkable. It avoided the cognitive trap in this puzzle, clearly articulating the logic of every step. Then it fumbled at the finishing line, adding a nonsensical assumption and deriving the wrong answer as a result.

What should we make of all this? In some ways, Perez-Cruz and Shin have merely found a twist on the familiar problem that large language models sometimes insert believable fiction into their answers. Instead of plausible errors of fact, here the computer served up plausible errors of logic.

Defenders of large language models might respond that with a cleverly designed prompt, the computer may do better (which is true, although the word “may” is doing a lot of work). It is also almost certain that future models will do better. But as Perez-Cruz and Shin argue, that may be besides the point. A computer that is capable of seeming so right yet being so wrong is a risky tool to use. It’s as though we were relying on a spreadsheet for our analysis (hazardous enough already) and the spreadsheet would occasionally and sporadically forget how multiplication worked.

Not for the first time, we learn that large language models can be phenomenal bullshit engines. The difficulty here is that the bullshit is so terribly plausible. We have seen falsehoods before, and errors, and goodness knows we have seen fluent bluffers. But this? This is something new.

*If Bernard was told 18th (or 19th) he would know the birthday was June 18 (or that it was May 19). So when Albert says that he knows that Bernard doesn’t know the answer, that rules out these possibilities: Albert must have been told July or August instead of May or June. Bernard’s response that he now knows the answer for certain reveals that it can’t be the 14th (which would have left him guessing between July or August). The remaining dates are August 15 or 17, or July 16. Albert knows which month, and the statement that he now knows the answer reveals the month must be July and that Cheryl’s birthday is July 16.

**The chance of initially picking the correct door is 25 per cent, and that is not changed when Monty Hall opens two empty doors. Therefore the chance of winning $10,000 is 75 per cent if you switch to the remaining door, and 25 per cent if you stick with your initial choice. For a sufficiently steely risk-taker, it is worth paying up to $5,000 to switch.

Follow @FTMag to find out about our latest stories first and subscribe to our podcast Life and Art wherever you listen

版权声明:本文版权归FT中文网所有,未经允许任何单位或个人不得转载,复制或以任何其他方式使用本文全部或部分,侵权必究。

股票投资者不必担心关税

贸易的不确定性只会增加股票风险溢价。

新芬党在爱尔兰大选出口民调中以微弱优势领先

调查发现,现任联合政府的合作伙伴爱尔兰统一党和爱尔兰共和党都紧随其后。

芬太尼致死人数正在下降,背后的原因是什么?

证据表明药品供应发生了变化。

为金融不稳定做好准备

巨额公共债务的发行与特朗普臭名昭著的不可预测性相结合,对市场来说是一个有害的组合。

关税听起来很简单——这正是它们的复杂之处

唐纳德•特朗普威胁对加拿大和墨西哥征收25%关税,这一威胁很难规避。

死亡威胁加深了马科斯和杜特尔特王朝之间的裂痕

菲律宾总统和副总统因中国和滥用资金问题结怨。
设置字号×
最小
较小
默认
较大
最大
分享×