Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Differences in link hallucination and source comprehension across different LLM (mikecaulfield.substack.com)

51 points by hveksr 8 hours ago | 23 comments

motorest 5 hours ago [-]

Taken from the blog:

> Why are we talking about “graduate and PhD-level intelligence” in these systems if they can’t find and verify relevant links — even directly after a search?

This is my pet peeves, and recently OpenAI's models seem to have become very militant in how they stand by and push their obviously hallucinated sources. I'm talking about hallucinating answers, when pressed to cite sources they also hallucinate URLs that never existed, when repeatedly prompted to verify how the are hallucinating the stick to their clearly wrong output, and ultimately fall back to claiming they were right but the URL somehow changed even though it never existed ever.

In order to start talking about PhD-level intelligence, in the very least these LLMs must support PhD-level context-seeking and information verification. It is not enough to output a wall of text that reads quite fluently. You must stick to verifiable facts.

krzat 2 hours ago [-]

The approach of generating something and then looking for hallucinations is just stupid. To validate the output I have to be an expert. How do I become an expert if rely on LLMs? It's a dead end.

motorest 1 hours ago [-]

> The approach of generating something and then looking for hallucinations is just stupid. To validate the output I have to be an expert.

No. You only need to check for sources, and then verify these sources exist and they support the claims.

It's the very definition of "fact".

In some cases, all you need to do is check if a URL that was cited does exist.

thom 4 hours ago [-]

I have search enabled 100% of the time with ChatGPT and would never go back to raw-dogging LLM citations. O3 especially has passed the threshold of “not always annoying”. Had an argument with Gemini yesterday where it was insisting on some hallucinated implementation of a function even while giving me a GitHub link to the correct source.

genewitch 4 hours ago [-]

[flagged]

wesselbindt 4 hours ago [-]

Are you being paid to post here? They're giving me nothing. Cheapskates.

Terr_ 4 hours ago [-]

I wouldn't use it in a workplace either, perhaps not at all, but here is a pseudonymous forum. The expectations—or repercussions—of decorum aren't the same.

thom 4 hours ago [-]

I think the usage and all its connotations are perfectly cromulent here.

Orygin 4 hours ago [-]

Languages evolve and words get new meanings all the time.

genewitch 4 hours ago [-]

Yeah the meaning was what I said until about 2 weeks ago when someone went viral talking about air travel without a cellphone.

Do vulgarities often become accepted?

3 hours ago [-]

javcasas 3 hours ago [-]

Are we allowed to say "sex" or "kill"? Or do we have to start s*lf-c*nsoring everything?

Anyway, my last search to how to un-alive children processes gave me nothing. I wonder if those m*n pages are actually wr*tten by pr*f*ss*n*ls.

vanschelven 3 hours ago [-]

Including literal 404s... As an outsider it has always struck me as absurd that they don't just do the equivalent of wget over all provided sources.

alkonaut 35 minutes ago [-]

Or why the LLM doesn’t do a lookup into a subset of the training data as a database and reject the output if it seems to be wrong. A billion of the most urls and the entirety of Wikipedia, arkiv and stackoverflow would go a long way.

nkrisc 2 hours ago [-]

Seems like the LLM is giving correct output if it’s generating a plausible string of tokens in response to your string of tokens.

motorest 1 hours ago [-]

> Seems like the LLM is giving correct output if it’s generating a plausible string of tokens in response to your string of tokens.

No. If you prompt it to get a response and then you ask it to cite sources, if it outputs broken links that never existed then it clearly failed to deliver correct output.

nkrisc 56 minutes ago [-]

But are the links plausible text given the training data?

If the purpose is to accurately cite sources, how is it even possible to hallucinate them? Seems like folks are expecting way too much from these tools. They are not intelligent. Useful, perhaps.

simonw 1 hours ago [-]

The key thing I got from this article is that the o3 and Claude 4 projects (I'm differentiating from the models here because the harness of tools around them is critical too) are massively ahead of GPT 4.1 and Gemini 2.5 when it comes to fact checking in a way that benefits from search and web usage.

The o3 finding matches my own experience: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#o3...

Both o3 and Claude 4 have a crucial new ability: they can run tools such as their search tool as part of their "reasoning" phase. I genuinely think this is one of the most exciting new advances in LLMs in the last six months.

dedicate 3 hours ago [-]

It's not just that they get links wrong, it's how they get them wrong – like, totally fabricating them and then doubling down! A human messing up a citation is one thing, but this feels... different, almost like a creative act of deception, lol.

zone411 5 hours ago [-]

If anyone is interested in a larger sample size comparing how often LLMs confabulate answers based on provided texts, I have a benchmark at https://github.com/lechmazur/confabulations/. It's always interesting to test new models with it because the results can be unintuitive compared to those from my other benchmarks.

dr_kiszonka 4 hours ago [-]

Useful benchmark. I noticed o3-high hallucinating too often for such a good model, but it is usually great with search. In my experience, Claude Opus & Sonnet 4 consistently lie, cheat, and try to hide their tracks. Maybe they are good in writing code but I don't trust them with other things.

eviks 3 hours ago [-]

> Why are we talking about “graduate and PhD-level intelligence” in these systems if they can’t find and verify relevant links

For exactly the same reason the author markets his tool as a research assistant

> It also models an approach that is less chatbot, and more research assistant in a way that is appropriate for student researchers, who can use it to aid research while coming to their own conclusions.

milleramp 6 hours ago [-]

Took some time to realize the SIFT toolbox mentioned in the article is not a Scale-Invariant Feature Transform toolbox.

dr_kiszonka 6 hours ago [-]

In such cases, I get better answers to questions starting with "What" and not "Did".

hereonout2 5 hours ago [-]

Prompt engineering!

Rendered at 11:36:11 GMT+0000 (Coordinated Universal Time) with Vercel.