AI has all the answers. Even the wrong ones | 不论答案对错,AI知道一切? - FT中文网
登录×
电子邮件/用户名
密码
记住我
请输入邮箱和密码进行绑定操作:
请输入手机号码,通过短信验证(目前仅支持中国大陆地区的手机号):
请您阅读我们的用户注册协议隐私权保护政策,点击下方按钮即视为您接受。
FT英语电台

AI has all the answers. Even the wrong ones
不论答案对错,AI知道一切?

ChatGPT has the appearance of a brilliant logician and that’s a problem
大型语言模型解决逻辑谜题的准确性与可信度探究。
00:00

Can large language models solve logic puzzles? There’s one way to find out, which is to ask. That’s what Fernando Perez-Cruz and Hyun Song Shin recently did. (Perez-Cruz is an engineer; Shin is the head of research at the Bank for International Settlements as well as the man who, in the early 1990s, taught me some of the more mathematical pieces of economic theory.)

The puzzle in question is commonly known as the “Cheryl’s birthday puzzle”. Cheryl challenges her friends Albert and Bernard to guess her birthday, and for puzzle-reasons they know it’s one of 10 dates: May 15, 16 or 19; June 17 or 18; July 14 or 16; or August 14, 15 or 17. To speed up the guessing, Cheryl tells Albert her birth month, and tells Bernard the day of the month, but not the month itself.

Albert and Bernard think for a while. Then Albert announces, “I don’t know your birthday, and I know that Bernard doesn’t either.” Bernard replies, “In that case, I now know your birthday.” Albert responds, “Now I know your birthday too.” What is Cheryl’s birthday?* More to the point, what do we learn by asking GPT-4?

The puzzle is a challenging one. Solving it requires eliminating possibilities step by step while pondering questions such as “what is it that Albert must know, given what he knows that Bernard does not know?” It is, therefore, hugely impressive that when Perez-Cruz and Shin repeatedly asked GPT-4 to solve the puzzle, the large language model got the answer right every time, fluently elaborating varied and accurate explanations of the logic of the problem. Yet this bravura performance of logical mastery was nothing more than a clever illusion. The illusion fell apart when Perez-Cruz and Shin asked the computer a trivially modified version of the puzzle, changing the names of the characters and of the months.

GPT-4 continued to produce fluent, plausible explanations of the logic, so fluent, in fact, it takes real concentration to spot the moments when those explanations dissolve into nonsense. Both the original problem and its answer are available online, so presumably the computer had learnt to rephrase this text in a sophisticated way, giving the appearance of a brilliant logician.

When I tried the same thing, preserving the formal structure of the puzzle but changing the names to Juliet, Bill and Ted, and the months to January, February, March and April, I got the same disastrous result. GPT-4 and the new GPT-4o both authoritatively worked through the structure of the argument but reached false conclusions at several steps, including the final one. (I also realised that in my first attempt I introduced a fatal typo into the puzzle, making it unsolvable. GPT-4 didn’t bat an eyelid and “solved” it anyway.)

undefined

Curious, I tried another famous puzzle. A game show contestant is trying to find a prize behind one of three doors. The quizmaster, Monty Hall, allows a provisional pick, opens another door to reveal no grand prize, and then offers the contestant the chance to switch doors. Should they switch?

The Monty Hall problem is actually much simpler than Cheryl’s Birthday, but bewilderingly counterintuitive. I made things harder for GPT4o by adding some complications. I introduced a fourth door and asked not whether the contestant should switch (they should), but whether it was worth paying $3,500 to switch if two doors were open and the grand prize were $10,000.**

GPT-4’s response was remarkable. It avoided the cognitive trap in this puzzle, clearly articulating the logic of every step. Then it fumbled at the finishing line, adding a nonsensical assumption and deriving the wrong answer as a result.

What should we make of all this? In some ways, Perez-Cruz and Shin have merely found a twist on the familiar problem that large language models sometimes insert believable fiction into their answers. Instead of plausible errors of fact, here the computer served up plausible errors of logic.

Defenders of large language models might respond that with a cleverly designed prompt, the computer may do better (which is true, although the word “may” is doing a lot of work). It is also almost certain that future models will do better. But as Perez-Cruz and Shin argue, that may be besides the point. A computer that is capable of seeming so right yet being so wrong is a risky tool to use. It’s as though we were relying on a spreadsheet for our analysis (hazardous enough already) and the spreadsheet would occasionally and sporadically forget how multiplication worked.

Not for the first time, we learn that large language models can be phenomenal bullshit engines. The difficulty here is that the bullshit is so terribly plausible. We have seen falsehoods before, and errors, and goodness knows we have seen fluent bluffers. But this? This is something new.

*If Bernard was told 18th (or 19th) he would know the birthday was June 18 (or that it was May 19). So when Albert says that he knows that Bernard doesn’t know the answer, that rules out these possibilities: Albert must have been told July or August instead of May or June. Bernard’s response that he now knows the answer for certain reveals that it can’t be the 14th (which would have left him guessing between July or August). The remaining dates are August 15 or 17, or July 16. Albert knows which month, and the statement that he now knows the answer reveals the month must be July and that Cheryl’s birthday is July 16.

**The chance of initially picking the correct door is 25 per cent, and that is not changed when Monty Hall opens two empty doors. Therefore the chance of winning $10,000 is 75 per cent if you switch to the remaining door, and 25 per cent if you stick with your initial choice. For a sufficiently steely risk-taker, it is worth paying up to $5,000 to switch.

Follow @FTMag to find out about our latest stories first and subscribe to our podcast Life and Art wherever you listen

版权声明:本文版权归FT中文网所有,未经允许任何单位或个人不得转载,复制或以任何其他方式使用本文全部或部分,侵权必究。

美国不再有羞耻感了吗?

卢斯:美国政客面对丑闻的厚颜无耻是这个时代的一大特征。

瑞士财富管理公司将目光投向亚洲

瑞士作为世界财富管理中心的声誉近年来受到了打击,但瑞士财富管理公司仍可在其竞争对手香港和新加坡占据主导地位。

加拿大-印度外交对峙背后的印度犯罪帮派

31岁的比什努瓦是印度小报的话题常客,他在被指控从狱中策划勒索、谋杀和其他罪行。

Lex专栏:美国人对信用卡的钟爱削弱了即时支付的吸引力

尽管即时支付在一些国家大行其道,但在美国,Visa和万事达卡现在依然可以放宽心。

抢购西方资产的俄罗斯发胶巨头

阿列克谢•萨加尔是受益于西方公司撤离俄罗斯市场的新一代商人之一。

拥有多少钱才算是一名超级富豪?

是1000万美元、3000万美元,还是1亿美元?亿万富翁的迅速崛起颠覆了有钱精英的定义。
设置字号×
最小
较小
默认
较大
最大
分享×