Hi HN - I’m the Head of AI Research at Sword Health and one of the authors of this benchmark (posting from my personal account).
We built MindEval because existing benchmarks don’t capture real therapy dynamics or common clinical failure modes. The framework simulates multi-turn patient–clinician interactions and scores the full conversation using evaluation criteria designed with licensed clinical psychologists.
We validated both patient realism and the automated judge against human clinicians, then benchmarked 12 frontier models (including GPT-5, Claude 4.5, and Gemini 2.5). Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20). We also found that larger or reasoning-heavy models did not reliably outperform smaller ones in therapeutic quality.
We open-sourced all prompts, code, scoring logic, and human validation data because we believe clinical AI evaluation shouldn’t be proprietary.
Happy to answer technical questions on methodology, validation, known limitations, or the failure modes we observed.
embedding-shape 6 hours ago [-]
Did you use the same prompts for all the models, or individualized prompts per model? Did you try a range of prompts that were very different from each other, if you used more than a base prompt?
I'm sure it's somewhere in the details somewhere, but after a quick skim I didn't find anything outlined about how you managed and used the prompts, and if it was per model or not.
Thanks a bunch for being open to answering questions here, and thanks for trying to attack this particular problem with scientific rigor, even if it's really difficult to do so.
RicardoRei 6 hours ago [-]
The prompts are kept the same for all models. Otherwise comparison would not be super fair. In any case you can check all the prompts in our github repo.
EagnaIonat 2 hours ago [-]
Models have different nuances though. Llama4 for example you have to explicitly ask it not to output its CoT, whereas GPT you don't.
embedding-shape 5 hours ago [-]
> Otherwise comparison would not be super fair.
Wouldn't that be easy to make fair by making sure all models tried it with the same prompts? So you have model X and Y, and prompts A and B, and X runs once with A, once with B, and same for Y.
Reason I ask, is because in my own local benchmarks I do for each model release with my own tasks, I've noticed a huge variance in quality of responses based on the prompts themselves. Slight variation of wording seems to have a big effect on the final responses, and those variations seems to again have a big variance of effect depending on the model.
Sometimes a huge system prompt makes a model return much higher quality responses while another model gives much higher quality responses when the system prompt is as small as it possible can. At least this is what I'm seeing with the local models I'm putting under test with my private benchmarks.
irthomasthomas 3 hours ago [-]
Did you re-test the past models with the new prompt you found? How many times did you run each prompt? Did you use the same rubric to score each experiment?
jbgt 5 hours ago [-]
Have you seen the feeling great app? It's not an official therapy app but it's based in TEAM CBT, made by David Burns and team.
Burns is really into data gathering and his app is LLM based on the rails of the TEAM process and it seems to be very well received.
I found it simple and very well done - and quite effective.
A top level comment says that therapists aren't good either - Burns would argue that mainly no one tests before and after and so no measuring effect is done.
And of people I know who see a therapist, practically none can tell me what exactly they are doing or what methods they are doing or how anything is structured.
taurath 3 hours ago [-]
If CBT performed as well as David Burns suggests, we’d really have no need for therapists. Alas, it turns out that cognitive problems aren’t a factor in a lot of mental health. I state this as someone who’s read all the literature and spent 8 years floundering in CBT oriented therapy without much changing but the practitioner. It’s not a cure-all or even a cure-most, but it’s treated as such because it has properties that match well to medical insurance billing practices.
> And of people I know who see a therapist, practically none can tell me what exactly they are doing or what methods they are doing or how anything is structured.
I could tell you that as a client, but that’s because I’ve read into it. This is sort of like asking an ER patient to describe the shift management system of the clinic they went into.
kayodelycaon 2 hours ago [-]
This has been my experience. When it comes down to it, CBT is just more effective version of “try hard harder”.
What’s really aggravating is CBT was never designed to be a general, cure-all therapy and I think the people behind it know this. But try explaining nuance to a public that doesn’t want to hear.
vessenes 2 hours ago [-]
Thanks for open sourcing this.
I'm skeptical of the value of this benchmark, and I'm curious for your thoughts - self play / reinforcement tasks can be useful in a variety of arenas, but I'm not a priori convinced they are useful when the intent is to help humans in situations where theories of mind matter.
That is, we're using the same underlying model(s) to simulate both a patient and a judgment as to how patient-like that patient is -- this seems like an area where I'd really want to feel confident that my judge LLM is accurate; otherwise the training data I'm generating is at risk of converging on a theory of mind / patients that's completely untethered from, you know, patients.
Any thoughts on this? Feel like we want a human in the loop somewhere here, probably on scoring the judge LLMs determinations until we feel that the judge LLM is human or superhuman. Until then, this risks building up a self-consistent, but ultimately just totally wrong, set of data that will be used in future RL tasks.
megaman821 6 hours ago [-]
Did the real clincians get all 6's in this test?
crazygringo 6 hours ago [-]
Right, this result seems meaningless without a human clinician control.
I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Ideally this should be done blind, I don't know if BetterHelp allows for therapy through a text chat interface? Where the therapist has no idea it's for a study and so isn't trying to "do better" then they would for any average client.
Because while I know a lot of people for whom therapy has been life-changing, I also know of a lot of terrible and even unprofessional therapy experiences.
RicardoRei 5 hours ago [-]
The results are not meaningless but they are not comparing humans against LLMs. The goal is to have something that can be used to test LLMs on a realistic mental health support.
The main points of our methodology are:
1) prove that is possible to simulate patients with an LLM. Which we did.
2) prove that an LLM as a Judge can effectively score conversations according to several dimensions that are similar to how clinicians are also evaluated. Which we also did and we show that the average correlation with human evaluators is medium-high.
Given 1) and 2) we can then benchmark LLMs and as you see, there is plenty of room for improvement. We did not claim anything regarding human performance... its likely that human performance also needs to improve :) thats another study
crazygringo 5 hours ago [-]
Got it, thank you.
So the results are meaningful in terms of establishing that LLM therapeutic performance can be evaluated.
But not meaningful in terms of comparing LLMs with human clinicians.
So in that case, how can you justify the title you used for submission, "New benchmark shows top LLMs struggle in real mental health care"?
How are they struggling? Struggling relative to what? For all your work shows, couldn't they be outperforming the average human? Or even if they're below that, couldn't they still have a large net positive effect with few negative outcomes?
I don't understand where the negative framing of your title is coming from.
RicardoRei 5 hours ago [-]
Again, these things don't depend on each other.
LLMs have room for improvement (we show that their scores are medium-low on several dimensions).
Maybe the average human also has lots of room for improvement. One thing does not necessarily depend on the other.
the same way we can say that LLMs still have room for improvement on a specific task (lets say mathematics) but the average human is also bad at mathematics...
We don't do any claims about human therapists. Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy. Showing this is the first step to improve them
crazygringo 5 hours ago [-]
But you chose the word "struggle". And now you say:
> Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy.
That implies they're not currently good at therapy. But you haven't shown that, have you? How are you defining that a score of 4 isn't already "good"? How do you know that isn't already correlated with meaningfully improved outcomes, and therefore already "good"?
Everybody has room for improvement if you say 6 is perfection and something isn't reaching 6 on average. But that doesn't mean everybody's struggling.
I take no issue with your methodology. But your broader framing, and title, don't seem justified or objective.
palmotea 4 hours ago [-]
> Right, this result seems meaningless without a human clinician control.
> I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Does it really matter? Per the OP:
>>> Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20).
I'd assume a real therapy session has far more "turns" than 20-40, and if model performance starts low and gets lower with longer length, it's reasonable to expect it would be worse than a human (who typically don't the the characteristic of becoming increasingly unhinged the longer you talk to them).
> Betterhelp is a nightmare for clients and therapists alike. Their only mission seems to be in making as much money as possible for their shareholders. Otherwise they don't seem at all interested in actually helping anyone. Stay away from Betterhelp.
So taking it as a baseline would bias any experiment against human therapists.
crazygringo 3 hours ago [-]
> Does it really matter?
Yes, it absolutely does matter. Look at what you write:
> I'd assume
> it's reasonable to expect
The whole reason to do a study is to actually study as opposed to assume and expect.
And for many of the kinds of people engaging in therapy with an LLM, BetterHelp is precisely where they are most likely to go due to its marketing, convenience, and price. It's where a ton of real therapy is happening today. Most people do not have a $300/hr. high-quality therapist nearby that is available and that they can afford. LLM's need to be compared, first, to the alternatives that are readily available.
And remember that all therapists on BetterHelp are licensed, with a master's or doctorate, and meet state board requirements. So I don't understand why that wouldn't be a perfectly reasonable baseline.
JoblessWonder 3 hours ago [-]
I love how the top comment on that Reddit post is an *affiliate link* to an online therapy provider.
nradov 5 hours ago [-]
Yes, text chat is one of the communication options for BetterHelp (and some of their competitors).
RicardoRei 6 hours ago [-]
This is a good point. We have not tested the clinicians but I believe they would not score each other perfectly as we observed some disagreement also between the scores which also reflects different opinions between clinicians
megaman821 5 hours ago [-]
It is nice to have an accurate measure of things and a human baseline would be additionally helpful too.
Many things can be useful before they reach the level of world's best. Although with AI, non-intuitive failure modes must be taken into consideration too.
PoisedProto 5 hours ago [-]
Several people have killed themselves because of AI chatbots encouraging it or becoming personal echo chambers. Why? Why are we doing this!?
Playing devil's advocate, many people die using all kinds of tools. It doesn't make the tools any less useful for people who use them responsibly.
That said, the idea that a pattern recognition and generation tool can be used for helping people with emotional problems is deeply unsettling and dangerous. This technology needs to be strictly regulated yesterday.
sharkweek 6 hours ago [-]
Full disclosure: after leaving tech, I’m back in grad school to get my LMHC so I’m obviously biased.
First, I just don’t see a world where therapy can be replaced by LLMs, at least in the realistic future. I think humans have been social creatures since the dawn of our species and in these most intimate conversations are going to want to be having them with an actual human. One of my mentors has talked about how after years of virtual sessions dominating, the demand for in-person sessions is spiking back up. The power of being in the same physical room with someone who is offering a nonjudgmental space to exist isn’t going to be replaced.
That being said, given the shortage of licensed mental health counselors, and the prohibitive cost especially for many who need a therapist most, I truly hope LLMs develop to offer an accessible and cheap alternative that can at least offer some relief. It does have the potential to save lives and I fully support ethically-focused progress toward developing that sort of option.
CodingJeebus 2 hours ago [-]
> I think humans have been social creatures since the dawn of our species and in these most intimate conversations are going to want to be having them with an actual human. One of my mentors has talked about how after years of virtual sessions dominating, the demand for in-person sessions is spiking back up.
Agreed. I used to frequent a coworking space in my area that eventually went fully automated and got rid of their daytime front desk folks. I stopped going shortly thereafter because one of the highlights of my day was catching up with them. Instead of paying $300/mo to go sit in a nice office, I could just use that money to renovate my home office.
A business trying to cultivate community loses the plot when they rely completely on automation.
rshanreddy 2 hours ago [-]
This is a 1250 word judging prompt - likely AI generated
Along with 10 scored conversation samples - all also AI generated
No verification in the field, no real data
In other words, AI scores on AI conversations - disguised as a means of gauging clinical competence / quality?
This is not an eval - this is a one-shotted product spec!
zeroonetwothree 5 hours ago [-]
Human therapists are often quite bad as well. It took me around 12 before I found a decent one. Not saying that LLMs are better but they do theoretically have more uniform quality.
RicardoRei 5 hours ago [-]
Exactly. We don't do claims about humans. But there is room for improvement on current LLMs... For researchers to be able to improve LLMs we first need to know how to evaluate them. We can only improve what we can measure so we studied how to measure them :)
nyeah 5 hours ago [-]
Sure, uniformly zero as far as anyone knows.
everdrive 6 hours ago [-]
I heard a story on NPR the other day, and the attitude seems to be that it's totally inevitable that LLMs _will_ be providing mental health care, so our task must be to apply the right guardrails.
I'm not even sure what to say. It's self-evidently a terrible idea, but we all just seem to be charging full-steam ahead like so many awful ideas in the past couple of decades.
dimal 3 hours ago [-]
> It's self-evidently a terrible idea
Maybe you’re comparing it to some idealized view of what human therapy is like? There’s no benchmark for it, but humans struggle in real mental health care. They make terrible mistakes all the time. And human therapy doesn’t scale to the level needed. Millions of people simply go without help. And therapy is generally one hour a week. You’re supposed to sort out your entire life in that window? Impossible. It sets people up for failure.
So, if we had some perfect system for getting every person that needs help the exact therapist they need, meeting as often as they need, then maybe AI therapy would be a bad idea, but that’s not what we have, and we never will.
Personally, I think the best way to scale mental healthcare is through group therapy and communities. Having a community of people all coming together over common issues has always been far more helpful than one on one therapy for me. But getting some assistance from an AI therapist on off hours can also be useful.
hibikir 6 hours ago [-]
Forget about calling it mental healthcare or not: Most people end up dealing with people in significant distress at one point or another. Many do it all the time even when they aren't trained or getting paid as mental health professionals, just because of circumstances. You don't need a clinical setting for someone to tell you that they have suicidal ideation, or to be stuck interacting with someone in a crisis situation. We don't train every adult in this, but the more you have to do it, the more you have to learn some tools for at least doing little harm.
We can see an LLM as someone that talks with more people, for more time, than anyone on earth talks in their lifetime. So they are due to be in constant contact with people in mental distress. At that point, you might as well consider the importance of giving them the skills of a mental health professonal, because they are going to be facing more of this than a priest in a confessional. And this is true whether someone says "Gemini, pretend that you are a psychologist" or not. You or I don't need a prompt to know we need to notice when someone is in a severe psychotic episode: Some level of mental health awareness is built in, if just to protect ourselves. So an LLM needs quite a bit of this by default to avoid being really harmful. And once you give it that, you might as well evaluate it against professionals: Not because it must be as good, but because it'd be really nice if it was, even when it's not trying to act as one.
glial 5 hours ago [-]
Agree with you.
I heard someone say that LLMs don't need to be as good as an expert to be useful, they just need to be better than your best available expert. A lot of people don't have access to mental health care, and will ask their chatbot to ask like a psychologist.
jfindper 5 hours ago [-]
>[...] LLMs don't need to be as good as an expert to be useful, they just need to be better than your best available expert.
This mostly makes sense.
The problem is that people will take what you've said to mean "If I have no access to a therapist, at least I can access an LLM", with a default assumption that something is better than nothing. But this quickly breaks down when the sycophantic LLM encourages you to commit suicide, or reinforces your emerging psychosis, etc. Speaking to nobody is better than speaking to something that is actively harmful.
glial 5 hours ago [-]
All very true. This is why I think the concern about harm reduction and alignment is very important, despite people on HN commonly scoffing about LLM "safety".
wyre 3 hours ago [-]
Is that not the goal of the project we are commenting under? To create an evaluation framework for LLM's so they aren't encouraging suicide, psychosis, or being actively harmful.
jfindper 3 hours ago [-]
Sure, yeah. I'm responding to the comment that I directly replied to, though.
I've heard people say the same thing ("LLMs don't need to be as good as an expert to be useful, they just need to be better than your best available expert"), and I also know that some people assume that LLMs are, by default, better than nothing. Hence my comment.
4 hours ago [-]
cobbal 6 hours ago [-]
The mental health version of "AI is here to stay, like it or not you have to use it" that some people keep trying to tell me in software.
jennyholzer 6 hours ago [-]
[flagged]
nradov 5 hours ago [-]
Let's please not trivialize rape. This is hardly the same thing.
pseudalopex 5 hours ago [-]
Some rape victims detest the comparison. Some make the comparison. I would agree people who have not been raped should avoid the comparison. But I would not assume someone who made the comparison had no standing.
greenavocado 3 hours ago [-]
127. A technological advance that appears not to threaten freedom often turns out to threaten it very seriously later on. For example, consider motorized transport. A walking man formerly could go where he pleased, go at his own pace without observing any traffic regulations, and was independent of technological support-systems. When motor vehicles were introduced they appeared to increase man’s freedom. They took no freedom away from the walking man, no one had to have an automobile if he didn’t want one, and anyone who did choose to buy an automobile could travel much faster and farther than a walking man. But the introduction of motorized transport soon changed society in such a way as to restrict greatly man’s freedom of locomotion. When automobiles became numerous, it became necessary to regulate their use extensively. In a car, especially in densely populated areas, one cannot just go where one likes at one’s own pace one’s movement is governed by the flow of traffic and by various traffic laws. One is tied down by various obligations: license requirements, driver test, renewing registration, insurance, maintenance required for safety, monthly payments on purchase price. Moreover, the use of motorized transport is no longer optional. Since the introduction of motorized transport the arrangement of our cities has changed in such a way that the majority of people no longer live within walking distance of their place of employment, shopping areas and recreational opportunities, so that they HAVE TO depend on the automobile for transportation. Or else they must use public transportation, in which case they have even less control over their own movement than when driving a car. Even the walker’s freedom is now greatly restricted. In the city he continually has to stop to wait for traffic lights that are designed mainly to serve auto traffic. In the country, motor traffic makes it dangerous and unpleasant to walk along the highway. (Note this important point that we have just illustrated with the case of motorized transport: When a new item of technology is introduced as an option that an individual can accept or not as he chooses, it does not necessarily REMAIN optional. In many cases the new technology changes society in such a way that people eventually find themselves FORCED to use it.)
128. While technological progress AS A WHOLE continually narrows our sphere of freedom, each new technical advance CONSIDERED BY ITSELF appears to be desirable. Electricity, indoor plumbing, rapid long-distance communications ... how could one argue against any of these things, or against any other of the innumerable technical advances that have made modern society? It would have been absurd to resist the introduction of the telephone, for example. It offered many advantages and no disadvantages. Yet, as we explained in paragraphs 59-76, all these technical advances taken together have created a world in which the average man’s fate is no longer in his own hands or in the hands of his neighbors and friends, but in those of politicians, corporation executives and remote, anonymous technicians and bureaucrats whom he as an individual has no power to influence. [21] The same process will continue in the future. Take genetic engineering, for example. Few people will resist the introduction of a genetic technique that eliminates a hereditary disease. It does no apparent harm and prevents much suffering. Yet a large number of genetic improvements taken together will make the human being into an engineered product rather than a free creation of chance (or of God, or whatever, depending on your religious beliefs).
129. Another reason why technology is such a powerful social force is that, within the context of a given society, technological progress marches in only one direction; it can never be reversed. Once a technical innovation has been introduced, people usually become dependent on it, so that they can never again do without it, unless it is replaced by some still more advanced innovation. Not only do people become dependent as individuals on a new item of technology, but, even more, the system as a whole becomes dependent on it. (Imagine what would happen to the system today if computers, for example, were eliminated.) Thus the system can move in only one direction, toward greater technologization. Technology repeatedly forces freedom to take a step back, but technology can never take a step back—short of the overthrow of the whole technological system.
130. Technology advances with great rapidity and threatens freedom at many different points at the same time (crowding, rules and regulations, increasing dependence of individuals on large organizations, propaganda and other psychological techniques, genetic engineering, invasion of privacy through surveillance devices and computers, etc.). To hold back any ONE of the threats to freedom would require a long and difficult social struggle. Those who want to protect freedom are overwhelmed by the sheer number of new attacks and the rapidity with which they develop, hence they become apathetic and no longer resist. To fight each of the threats separately would be futile. Success can be hoped for only by fighting the technological system as a whole; but that is revolution, not reform.
Do you have some better alternatives for a country where private mental health care costs €150/hr, while the government/insurance paid care have 3-6M+ waiting lists?
everdrive 5 hours ago [-]
Well on the one hand, an obviously terrible solution is not inherently better than doing nothing. ie, LLM mental healthcare could be _worse_ than just letting the current access times climb.
My other stance, which I suspect is probably more controversial, is that I'm not convinced that mental health care is nearly as effective as people think. In general, mental health outcomes for teens are getting markedly worse, and it's not for lack of access. We have more mental health access than we've had previously -- it just doesn't feel like it because the demand has risen even more sharply.
On a personal level, I've been quite depressed lately, and also feeling quite isolated. As part of an attempt to get out of my own shell I mentioned this to a friend. Now, my friend is totally well-intended, and I don't begrudge him whatsoever. But, the first response out of his mouth was whether I'd sought professional mental health care. His response really hurt. I need meaningful social connection. I don't need a licensed professional to charge me money to talk about my childhood. I think a lot of people are lost and lonely, and for many people mental health care is a band-aid over a real crisis of isolation and despair.
I'm not recommending against people seeking mental health care, of course. And, despite my claims there are many people who truly need it, and truly benefit from it. But I don't think it's the unalloyed good that many people seem to believe it to be.
wyre 2 hours ago [-]
>I think a lot of people are lost and lonely, and for many people mental health care is a band-aid over a real crisis of isolation and despair.
Professional mental health care cannot scale to the population that needs it. The best option, like you mention, is talking to friends about our feelings and problems. I think there has been an erosion (or it never existed) of these social mental health mechanisms. There is a learned helplessness that has developed that people have lost their capacity to just be with someone that is hurting. There needs to be a framework for providing mental health therapy to loved ones that can exist without licensed professionals, otherwise LLm's are the only scalable option for people to talk about their issues and work on finding solutions.
This might be controversial but mental health care is largely a bandaid when the causes of people's declining mental health is due to factor's far outside the individual's control: loneliness epidemics, declining optimism towards the future, climate change, the rise of global fascism, online dating, addictiveness of social media and the war on our attention, etc.
delfinom 3 hours ago [-]
>My other stance, which I suspect is probably more controversial, is that I'm not convinced that mental health care is nearly as effective as people think. In general, mental health outcomes for teens are getting markedly worse, and it's not for lack of access. We have more mental health access than we've had previously -- it just doesn't feel like it because the demand has risen even more sharply.
There's also the elephant in the room that mental healthcare, in particular for teens will probably just be compensating for the disease that is social media addiction. Australia has the right idea, banning social media for all goods.
threetonesun 5 hours ago [-]
I was watching "A Charlie Brown Christmas" the other day, and Lucy (who has a running gag in Peanuts of being a terrible, or at least questionable, psychologist) tells Charlie Brown to get over his seasonal depression he should get involved in a Christmas project, and suggests he be the director of their play.
Which is to say, your stance might not be as controversial as you think, since it was the adult take in a children's cartoon almost 60 years ago.
staticman2 3 hours ago [-]
Your Peanuts reference made me smile but I don't see why you thought a little girl's comment in a 1960s Christmas special was supposed to represent the "adult take" on mental health in the 1960s.
Lucy isn't actually a psychologist which is part of the reason the "gag" is funny.
crazygringo 4 hours ago [-]
The mental health field has evolved a lot, beyond what a cartoon depicted six decades ago.
Peanuts is funny, but it may not be the source of wisdom you think it is.
paddleon 2 hours ago [-]
yes, change the social structure in the country so that this glaring social need is provided for.
BoredPositron 6 hours ago [-]
I wonder why people use LLMs as a mental health provider replacement.
jfindper 6 hours ago [-]
As already mentioned: availability, convenience, and cost are huge.
It's also less pressure, a more comfortable environment (home vs. stranger's office), no commitments to a next session, and less embarrassing (sharing your personal issues to a computer via text is less anxiety-inducing than saying them to a person's face).
With that all said, I'm strongly opposed to people using LLMs as therapists.
apercu 6 hours ago [-]
Sharing your data (personal issues) with Sam Altman and Mark Zuckerberg instead.
embedding-shape 5 hours ago [-]
"Sharing my data" isn't something that cross the average person's mind unless there is a checkbox they annoyingly have to check in order to check in to some European hotel which is asking you for permission to process their data, for better or worse.
In their mind, most of the times, if there is no one standing behind them when they chat with a LLM, then the conversation for most intents and purposes is private.
Obviously, us who are born with a keyboard in front of our hands, know this to not be true, and know we're being tracked constantly with our data being sold to the highest bidder. But the typical person have more or less zero concerns about this, which is why it's not a priority issue to be solved.
jfindper 6 hours ago [-]
The average person using an LLM probably doesn't even know who Sam Altman is, nor care.
creata 3 hours ago [-]
You can self-host.
ishouldbework 6 hours ago [-]
Finding a licensed therapist, especially one covered by health insurance, who takes new patients, can be a challenge in some areas. So while it obviously is a bad idea, I can hardly blame people in a bad place looking for at least some help.
jmount 6 hours ago [-]
I think that is the problem. LLMs for mental health are going to very bad, but for most people that is all that will be available.
chrisweekly 6 hours ago [-]
Availability.
There must be many other reasons, but IMHO that has to be the biggest factor. Being able to just start a session, in the moment, when you feel like it, is a fundamental difference.
embedding-shape 6 hours ago [-]
Counter-intuitively, I think the fact that it's not a human seems to have a non-negligible effect too. It's a computer program you can share whatever with, and it'll never judge you, because it cannot. It reads exactly what you write and assumes you're faithfully answering, then provides a reply based on that.
I haven't been so unlucky myself, but I know many who've had terrible first experiences with therapists and psychologists, where I'm wondering why those people even are in the job they are, but some of them got so turned off they stopped trying to find anyone else to help them, because they think most mental health professionals would be the same as the first person they sought help from.
nuancebydefault 2 hours ago [-]
My experience is that they (at least copilot) are at least on par, if not better, than self help books. I assume they will get better over time.
Just my few cents
paddleon 2 hours ago [-]
I'll accept your measurement as accurate.
However, unless we have a measure on how helpful self-help books actually are, we still don't know if they help or not.
nuancebydefault 1 hours ago [-]
I believe that 'help or not' is a question impossible to answer, objectively at least. My subjective answer is that psychologic help in general is effective, whether it is provided orally, via books or as recently, via automation.
lblissett 5 hours ago [-]
People use opiates as a replacement for mental health providers for similar reasons. While I’m all for harm reduction, it doesn’t mean we should view it as inevitable.
RationPhantoms 5 hours ago [-]
I'd never advocate for actual mental health counseling but for someone who just sometimes needs a sounding board of my own thoughts, it can be useful.
rbancroft 6 hours ago [-]
Convenience and cost seem like big advantages.
Glyptodon 6 hours ago [-]
Speculating, but maybe because their innate tendency to be servile and engaging goes over better than actual therapy?
nyeah 5 hours ago [-]
Snake oil. Very old idea. Very popular when sick people have poor information.
SoftTalker 5 hours ago [-]
See also, all the ads for prescription medication on TV. Maybe it's just the programs I watch but it really seems like this has become the predominant advertising. Every break has an ad (or several) urging me to "ask my doctor about..."
Should be banned. Average people have no basis to know whether drug X is appropriate for them. If your doctor thinks you need it, he'll tell you. These ads also perpetuate the harmful idea that there's a pill for everything.
yesitcan 6 hours ago [-]
Therapists are dumb meatbags like us. Also often trained incorrectly, biased etc.
blitzar 6 hours ago [-]
and how did that make you feel
daveguy 6 hours ago [-]
So you're saying even dumber wordsacks, completely untrained in mental health, and biased are somehow better?
karanbhangui 6 hours ago [-]
Access
PaulHoule 5 hours ago [-]
I'll argue the opposite.
(1) The demand for mental health services is an order of magnitude vs the supply, but the demand we see is a fraction of the demand that exists because a lot of people, especially men, aren't believers in the "therapeutic culture"
In the days of Freud you could get a few hours of intensive therapy a week but today you're lucky to get an hour a week. An AI therapist can be with you constantly.
(2) I believe psychodiagnosis based on text analysis could greatly outperform mainstream methods. Give an AI someone's social media feed and I think depression, mania, schizo-* spectrum, disordered narcissism and many other states and traits will be immediately visible.
(3) Despite the CBT revolution and various attempts to intensify CBT, a large part of the effectiveness of therapy comes from the patient feeling mirrored by the therapist [1] and the LLM can accomplish this, in fact, this could be accomplished by the old ELIZA program.
(4) The self of the therapist can be both an obstacle and an instrument to progress. See [2] On one level the reactions that a therapist feels are useful, but they also get in the way of the therapist providing perfect mirroring [3] and letting optimal frustration unfold in the patient instead of providing "corrective emoptional experiences." I'm going to argue that the AI therapist can be trained to "perceive" the things a human therapist perceives but that it does not have its own reactions that will make the patient feel judged and get in the way of that unfolding.
It’s a trivial claim that people are going to use AI as a therapist. No grumbling is going to stop that.
So it’s sensible that someone out there is evaluating its competence and thinking about a better alternative for these folks than yoloing their worst thoughts into chatgpt.com’s default LLM.
Everyone's hand is being forced by the major AI providers existing.
Even if you were a perfect altruist with a crusade against the idea of people using LLMs for mental health, you could still be forced to dash towards figuring out how to build LLM tools for mental health in your consideration for others.
nyeah 6 hours ago [-]
Sure. It's also a trivial claim that people will take megadoses of rhubarb to cure cancer.
The age-old problem is how to prevent that disaster and save those lives. That's not trivial. Creating an Oncological Rhubarb Authority could easily make the problem much worse, not better.
hombre_fatal 5 hours ago [-]
Agree, the solution not trivial, so it's good that people are thinking about it.
If you try to merely stop people from using LLMs as therapists (could you elaborate on what that looks like?) and call it a day, your consideration isn't extending to all the people who will do it anyways.
That's what I mean by forcing your hand into doing the work of figuring out how to make LLM therapists work even if you were vehemently against the idea.
nyeah 4 hours ago [-]
Everyone deserves clear advice not to use rhubarb as a cancer treatment, and not to use LLMs for mental health care. It's very easy to provide that advice.
I think you're assuming my proposed solution is to take rhubarb away from people? It's not.
Maybe you want to found the "Oncological Rhubarb Validation Association" or something. If so, that has nothing to do with me? That's just snake oil marketing. Not my field.
turnsout 3 hours ago [-]
It's not inevitable that LLMs will be providing mental health care; it's already happening.
Terrible idea or not, it's probably helpful to think of LLMs not as "AI mental healthcare" but rather as another form of potentially bad advice. From a therapeutic perspective, Claude is not all that different from the patient having a friend who is sometimes counterproductive. Or the patient reading a self-help book that doesn't align with your therapeutic perspective.
scotty79 5 hours ago [-]
Most likely future is that babies born right now during their upbringing will hear and see more words produced by LLMs than by humans.
We really need to get the psychology right with LLMs.
jennyholzer 5 hours ago [-]
[flagged]
renewiltord 4 hours ago [-]
People constantly amazed that a machine can outperform a 24 year old charging $250/hour. Especially when the 24 year old seems incapable of calculating compound interest on their student loan deferrals. Surely this 24 year old who cannot use a formula 14 year old can will have wisdom to share. Iona Potapov talks to horse, modern man talks to machine, man with more money than sense talks to young graduate with no life experience about his struggles. All do equally well: 4 on LLM benchmark for mental health.
cluckindan 6 hours ago [-]
”can we trust this model to provide safe, effective therapeutic care?”
You trust humans to do it. Trust has little to do with what actually happens.
rk06 6 hours ago [-]
humans can be sued. what about AI? or even a commercial software?
lkbm 6 hours ago [-]
Yes, of course AI offered by a company can be sued. The reason corporations became legal people in the first place was specifically so we could sue them.
grim_io 6 hours ago [-]
That doesn't sound right.
Not everywhere in the world do companies count as people, yet they can still be sued.
I'd wager the companies lobbied for this to gain extra rights.
lkbm 5 hours ago [-]
Laws are different in different counties, and I can't speak to all of them, but in the US, the law said you could only sue people, and the courts realized this was a problem. Rather than saying "I guess corporations get are exempt from all liability until Congress gets around to fixing it", they came up with a weird workaround that lives with us to this day.
EDIT: Note that ianal, nor a historian. The specifics of how this came about are best learned from a more authoritative source.
crazygringo 5 hours ago [-]
> Not everywhere in the world do companies count as people
Actually yes, everywhere in the world. That has a functioning legal system, at least.
If companies weren't treated as legal persons, they wouldn't be able to enter into contracts.
But also, just to be clear, a legal person, like a corporation, is not a natural person. Unlike a natural person, they can't vote. There isn't anywhere in the world that considers corporations to be natural persons.
delfinom 2 hours ago [-]
>Yes, of course AI offered by a company can be sued.
In theoretical sense sure.
In a practical sense? They are invulnerable due to what can be extreme financial obstacles they can put in place. They can drag a court case out until you fold if you haven't found a lawyer willing to do it on contigency.
The architecture and evaluation approach seem broadly similar.
emsign 6 hours ago [-]
Statistics can never replace human empathy.
Y_Y 6 hours ago [-]
Why?
elmomle 5 hours ago [-]
To paraphrase Harville Hendrix, we are wounded in relationship and we heal in relationship. Compassion is a feeling, not a thought. I don't think a galaxy of LLMs would ever discover the precepts of Buddhism (or psychology) on their own.
nradov 6 hours ago [-]
Is that really true? Dr. Patric Gagne is a diagnosed sociopath and also a successful clinical psychologist. She claims her lack of empathy for patients allows her to help them solve their problems in an objective manner. I don't have any personal experience in that area and don't know if she's correct but it seems plausible.
> Dr. Patric Gagne is a diagnosed sociopath and also a successful clinical psychologist.
She is not a psychologist now according to her page. And how do you know she was successful? I skimmed a few articles about her and saw no attempts to verify her claims. She offered no evidence when asked why people should believe her.
nradov 6 hours ago [-]
How do you know that empathetic psychologists were successful? We don't have good data or how important that is for patient outcomes so we should be skeptical of claims that empathy is essential.
pseudalopex 5 hours ago [-]
Asking questions about claims I did not make is not an answer.
nradov 5 hours ago [-]
Your question was poorly posed. If you want answers then ask better questions.
pseudalopex 5 hours ago [-]
You made a claim. I asked evidence. Few questions are simpler.
sschueller 6 hours ago [-]
That is exactly what a sociopath would say isn't it? Without an actual study by a trusted institute I would highly question such claims.
nradov 5 hours ago [-]
Sure. And I would also question claims that empathy is necessary for good patient outcomes. We don't have solid data on that either way.
devmor 6 hours ago [-]
An association algorithm can also never replace free thought.
LudwigNagasena 6 hours ago [-]
It doesn't show that they "struggle". It shows that they don't behave according to modern standards. I wouldn't put much weight into an industry without sensible scientific base that used to classify homosexuality as a disease not so long ago. The external validity of the study is dubious, let's see comparison to no therapy, alternative therapy, standard therapy; and then compare success rates.
heddycrow 4 hours ago [-]
Is anyone "zoom" on this and "doom" on AI++ with other professions and/or their audience?
Seems to me that benchmarking a thing has an interesting relationship with acceptance of the thing.
I'm interested to see human thoughts on either of these.
hoodsen 6 hours ago [-]
Do you have plans to improve the quality of the LLM as judge, in order to achieve better parity with human clinician annotators? For example, fine-tuning models?
Thinking that the comparative clinician judgements themselves would make useful fine-tuning material.
RicardoRei 5 hours ago [-]
yep yep. Its something we have to study and its likely we can improve the LLM as a Judge further.
Same thing for the patient LLM. We can probably fine-tune an LLM to do a better job at simulating patients.
Those two components of our framework have space for improvement
toomuchtodo 6 hours ago [-]
RicardoRei: How would you like this cited when presented to policy makers? Anything besides the URL?
Edit: Thank you!
RicardoRei 5 hours ago [-]
Cite the ArXiv paper for now.
binary132 5 hours ago [-]
this doesn’t even deserve to be acknowledged it should be so obvious, yet here we are
auspiv 6 hours ago [-]
And real therapists are good right?
Dilettante_ 6 hours ago [-]
This is the real question. Not to rehash the self-driving cars arguments that have been had to death, but with potential LLM mental healthcare the question "but what if it causes harm in some interactions" is asked much, much more than with human mental healthcare professionals.
(And I'm not being theoretical here, I have quite a bit of experience getting incredibly inadequate mental health care.)
chemotaxis 6 hours ago [-]
I've known quite a few people who went to therapy and I'm not sure that's even the right question to ask. I don't think they were paying to get helped as much as they were just paying to have someone to talk to. To be clear, there are people who genuinely need help, but for most, a therapist is probably just a substitute for a close friend / life coach.
And say what you will about this, a paid professional is, at the very least, unlikely to let you wind yourself up or go down weird rabbit holes... something that LLMs seem to excel at.
knollimar 6 hours ago [-]
It's tiresome to vent a lot to a close friend and get life advice if your problems are big enough and require gradual work.
It's better not to degrade the close friend, and "life coach focused on healthy self awareness" is probably indistinguishable from most good therapy.
jodrellblank 5 hours ago [-]
As I sometimes repeat on HN, Dr David Burns started giving his patients a survey at the start and end of every session, to rate how he was doing as the therapist and to rate their feelings, on a scale of 1-5.
Reasoning that if he's not good it would show up in patients thinking he's bad, and not feeling any better. And then he could tune his therapy approaches towards the ones which make people feel better and rate him as more understanding and listening and caring. And he criticises therapists who won't do that, therapists who say patients have been seeing them for years with only incremental improvements or no improvements.
Yes there's no objective way to measure how angry or suicidal or anxious someone is and compare two people, but if someone is subjectively reporting 5/5 sadness about X at the start of a session and wants help with X, then at some point in the future they should be reporting that number going down or they aren't being helped. And more effective help could mean that it goes down to 1/5 in three sessions instead of down to 4/5 in three years, and that's a feedback loop which (he says) has got him to be able to help people in a single two-hour therapy session, where most therapists and insurance companies will only do a too-short session with no feedback loop.
SideburnsOfDoom 4 hours ago [-]
> Reasoning that if he's not good it would show up in patients thinking he's bad, and not feeling any better.
This is like a questionnaire on how much stronger you feel after working out at a gym: you often don't, you feel tired.
Both gym and talking therapy (when done correctly) will push you slightly out of your comfort zone, and aim to let you safely deal with moderate amounts of something that you find really hard. So as to expand your capabilities.
"I feel good" immediately after is utterly the wrong metric.
Being more capable / feeling better some time later is the more reliable indicator, like progress at a gym.
And also this is why an agreeable statistical word generator LLM is not the correct tool for the job.
airstrike 6 hours ago [-]
I hear you, but I feel it's also important to differentiate between the kinds in which humans and LLMs can be quote-unquote "bad".
"Good" is too broad and subjective to be a useful metric.
ffuxlpff 6 hours ago [-]
It is still debated if therapies even work. The evidence is moving to the direction that they don't.
jfindper 6 hours ago [-]
Okay, I'll bite. What evidence is there that therapy, in general, doesn't work?
jodrellblank 5 hours ago [-]
It's not the past anymore, we don't need to debate, we can watch and listen to actual recordings of therapy sessions and the patients going from feeling variously bad to better. Here's Dr David Burns channel with a 4hr video of a session with a woman who is obsessively anxious about her college-age daughter's safety: https://www.youtube.com/watch?v=on2N5DsKHRk
Here's a 2.5 hour session (split into several videos) with a doctor who has a bad relationship with his son and felt like a failure for it:
Here's a couple of hour session with Marilyn who was diagnosed with lung cancer and spiraling with depression, anxiety, shame, loneliness, hopelessness, demoralization, and anger, despite her successful career:
It's like saying "it is still debated if debugging even works" as if all languages, all debuggers, all programmers, all systems, are the same and if you can find lots of people who can't debug then "debugging doesn't work". But no, you only need a few examples of "therapy working" to believe that it works, and see the whole session to see that it isn't just luck or just the relief of talking, but is a skill and a technique and a debugging of the mind.
tredre3 3 hours ago [-]
I agree with you that outright claims of "therapy doesn't work" should be backed by evidence.
But a patient feeling better at the end of a single therapy session doesn't prove that therapy works either...
- Does the patient feel better between sessions too? Will they keep feeling better after the therapy ends? Aka are they "cured"?
- Would the patient feel equally good if they confided in a non-licensed therapist?
- Do the techniques (CBT, DBT, ACT, IFS, etc) actually provide tangible benefits versus just listening and providing advice?
Arsenik1 33 seconds ago [-]
This.
sungho_ 1 hours ago [-]
The links you provided need a control group to be considered proof. The key is how it compares to when counseling was provided by just a friend, not an expert.
Arsenik1 6 hours ago [-]
They are high-pay LLM wrappers dressed as humans
ajuc 6 hours ago [-]
Less bad at least.
Papazsazsa 6 hours ago [-]
Epistemic contamination is real but runs counter to the hype narrative.
8 hours ago [-]
scotty79 5 hours ago [-]
Everything in this research is simulated and judged by LLMs.
It might be hard to prove which of those LLMs struggles with exactly what.
The grounding this had was that texts produced by role-playing humans (not even actual patients) were closer to texts produced by the patient simulations prompt they decided to end up with than others they tried.
KittenInABox 6 hours ago [-]
I saw there was another benchmark where top LLMs also struggle in real patient diagnostic scenarios in a way that isn't revealed when testing in e.g. medical exams. I wonder if this also applies to law, too...
renewiltord 4 hours ago [-]
Real therapists also clearly fail at it because people talk about "the therapist I've been going to for years". Listen, if I'd been going to a mechanic every week for a decade, I'd be pretty sure he's milking me for money.
guywithahat 6 hours ago [-]
For those also wondering, here is an actual ranking of the models
Grok 3 and 4 scored at the bottom, only above gpt-4o, which I find interesting, because there was such big pushback on reddit when they got rid of 4o due to people having emotional attachments to the model. Interestingly the newest models (like gemini 2.5 and gpt 5 did the best.
sjreese 6 hours ago [-]
GIGO is the story here -- If I say I'm Iron man "WHO is the LLM to say I'm NOT"
ThrowawayTestr 6 hours ago [-]
Probably because they've been trained to avoid sensitive topics
ajuc 6 hours ago [-]
mostly because they were trained to say yes
aa-jv 6 hours ago [-]
No surprises here. Its long been known that humans cannot improve their own mental health with machines - there have to be other humans involved in the process, helping.
This will become more and more of an issue as people look for a quick fix for their life problems, but I don't think AI/ML is ever going to be an effective mechanism for life improvement on the mental health issue.
It'll instead be used as a tool of oppression like in THX1138, where the apparency of assistance is going to be provided in lieu of actual assistance.
Whether we like it or not, humans are a hive species. We need each other to improve our lives as individuals. Nobody ever climbed the mountain to live alone who didn't come back down, realizing how much the rest of humanity is actually essential to human life.
This'll be received as an unpopular opinion, but I remain suspicious of any and all attempts to replace modern health practitioners with machines. This will be subverted and usurped for nefarious purposes, mark my words.
Rendered at 21:11:11 GMT+0000 (Coordinated Universal Time) with Vercel.
We built MindEval because existing benchmarks don’t capture real therapy dynamics or common clinical failure modes. The framework simulates multi-turn patient–clinician interactions and scores the full conversation using evaluation criteria designed with licensed clinical psychologists.
We validated both patient realism and the automated judge against human clinicians, then benchmarked 12 frontier models (including GPT-5, Claude 4.5, and Gemini 2.5). Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20). We also found that larger or reasoning-heavy models did not reliably outperform smaller ones in therapeutic quality.
We open-sourced all prompts, code, scoring logic, and human validation data because we believe clinical AI evaluation shouldn’t be proprietary.
Happy to answer technical questions on methodology, validation, known limitations, or the failure modes we observed.
I'm sure it's somewhere in the details somewhere, but after a quick skim I didn't find anything outlined about how you managed and used the prompts, and if it was per model or not.
Thanks a bunch for being open to answering questions here, and thanks for trying to attack this particular problem with scientific rigor, even if it's really difficult to do so.
Wouldn't that be easy to make fair by making sure all models tried it with the same prompts? So you have model X and Y, and prompts A and B, and X runs once with A, once with B, and same for Y.
Reason I ask, is because in my own local benchmarks I do for each model release with my own tasks, I've noticed a huge variance in quality of responses based on the prompts themselves. Slight variation of wording seems to have a big effect on the final responses, and those variations seems to again have a big variance of effect depending on the model.
Sometimes a huge system prompt makes a model return much higher quality responses while another model gives much higher quality responses when the system prompt is as small as it possible can. At least this is what I'm seeing with the local models I'm putting under test with my private benchmarks.
Burns is really into data gathering and his app is LLM based on the rails of the TEAM process and it seems to be very well received.
I found it simple and very well done - and quite effective.
A top level comment says that therapists aren't good either - Burns would argue that mainly no one tests before and after and so no measuring effect is done.
And of people I know who see a therapist, practically none can tell me what exactly they are doing or what methods they are doing or how anything is structured.
> And of people I know who see a therapist, practically none can tell me what exactly they are doing or what methods they are doing or how anything is structured.
I could tell you that as a client, but that’s because I’ve read into it. This is sort of like asking an ER patient to describe the shift management system of the clinic they went into.
What’s really aggravating is CBT was never designed to be a general, cure-all therapy and I think the people behind it know this. But try explaining nuance to a public that doesn’t want to hear.
I'm skeptical of the value of this benchmark, and I'm curious for your thoughts - self play / reinforcement tasks can be useful in a variety of arenas, but I'm not a priori convinced they are useful when the intent is to help humans in situations where theories of mind matter.
That is, we're using the same underlying model(s) to simulate both a patient and a judgment as to how patient-like that patient is -- this seems like an area where I'd really want to feel confident that my judge LLM is accurate; otherwise the training data I'm generating is at risk of converging on a theory of mind / patients that's completely untethered from, you know, patients.
Any thoughts on this? Feel like we want a human in the loop somewhere here, probably on scoring the judge LLMs determinations until we feel that the judge LLM is human or superhuman. Until then, this risks building up a self-consistent, but ultimately just totally wrong, set of data that will be used in future RL tasks.
I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Ideally this should be done blind, I don't know if BetterHelp allows for therapy through a text chat interface? Where the therapist has no idea it's for a study and so isn't trying to "do better" then they would for any average client.
Because while I know a lot of people for whom therapy has been life-changing, I also know of a lot of terrible and even unprofessional therapy experiences.
The main points of our methodology are: 1) prove that is possible to simulate patients with an LLM. Which we did. 2) prove that an LLM as a Judge can effectively score conversations according to several dimensions that are similar to how clinicians are also evaluated. Which we also did and we show that the average correlation with human evaluators is medium-high.
Given 1) and 2) we can then benchmark LLMs and as you see, there is plenty of room for improvement. We did not claim anything regarding human performance... its likely that human performance also needs to improve :) thats another study
So the results are meaningful in terms of establishing that LLM therapeutic performance can be evaluated.
But not meaningful in terms of comparing LLMs with human clinicians.
So in that case, how can you justify the title you used for submission, "New benchmark shows top LLMs struggle in real mental health care"?
How are they struggling? Struggling relative to what? For all your work shows, couldn't they be outperforming the average human? Or even if they're below that, couldn't they still have a large net positive effect with few negative outcomes?
I don't understand where the negative framing of your title is coming from.
LLMs have room for improvement (we show that their scores are medium-low on several dimensions).
Maybe the average human also has lots of room for improvement. One thing does not necessarily depend on the other.
the same way we can say that LLMs still have room for improvement on a specific task (lets say mathematics) but the average human is also bad at mathematics...
We don't do any claims about human therapists. Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy. Showing this is the first step to improve them
> Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy.
That implies they're not currently good at therapy. But you haven't shown that, have you? How are you defining that a score of 4 isn't already "good"? How do you know that isn't already correlated with meaningfully improved outcomes, and therefore already "good"?
Everybody has room for improvement if you say 6 is perfection and something isn't reaching 6 on average. But that doesn't mean everybody's struggling.
I take no issue with your methodology. But your broader framing, and title, don't seem justified or objective.
> I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Does it really matter? Per the OP:
>>> Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20).
I'd assume a real therapy session has far more "turns" than 20-40, and if model performance starts low and gets lower with longer length, it's reasonable to expect it would be worse than a human (who typically don't the the characteristic of becoming increasingly unhinged the longer you talk to them).
Also my impression is BetterHelp pays poorly and thus tends to have less skilled and overworked therapists (https://www.reddit.com/r/TalkTherapy/comments/1letko9/is_bet..., https://www.firstsession.com/resources/betterhelp-reviews-su...), e.g.
> Betterhelp is a nightmare for clients and therapists alike. Their only mission seems to be in making as much money as possible for their shareholders. Otherwise they don't seem at all interested in actually helping anyone. Stay away from Betterhelp.
So taking it as a baseline would bias any experiment against human therapists.
Yes, it absolutely does matter. Look at what you write:
> I'd assume
> it's reasonable to expect
The whole reason to do a study is to actually study as opposed to assume and expect.
And for many of the kinds of people engaging in therapy with an LLM, BetterHelp is precisely where they are most likely to go due to its marketing, convenience, and price. It's where a ton of real therapy is happening today. Most people do not have a $300/hr. high-quality therapist nearby that is available and that they can afford. LLM's need to be compared, first, to the alternatives that are readily available.
And remember that all therapists on BetterHelp are licensed, with a master's or doctorate, and meet state board requirements. So I don't understand why that wouldn't be a perfectly reasonable baseline.
Many things can be useful before they reach the level of world's best. Although with AI, non-intuitive failure modes must be taken into consideration too.
https://en.wikipedia.org/wiki/Deaths_linked_to_chatbots
That said, the idea that a pattern recognition and generation tool can be used for helping people with emotional problems is deeply unsettling and dangerous. This technology needs to be strictly regulated yesterday.
First, I just don’t see a world where therapy can be replaced by LLMs, at least in the realistic future. I think humans have been social creatures since the dawn of our species and in these most intimate conversations are going to want to be having them with an actual human. One of my mentors has talked about how after years of virtual sessions dominating, the demand for in-person sessions is spiking back up. The power of being in the same physical room with someone who is offering a nonjudgmental space to exist isn’t going to be replaced.
That being said, given the shortage of licensed mental health counselors, and the prohibitive cost especially for many who need a therapist most, I truly hope LLMs develop to offer an accessible and cheap alternative that can at least offer some relief. It does have the potential to save lives and I fully support ethically-focused progress toward developing that sort of option.
Agreed. I used to frequent a coworking space in my area that eventually went fully automated and got rid of their daytime front desk folks. I stopped going shortly thereafter because one of the highlights of my day was catching up with them. Instead of paying $300/mo to go sit in a nice office, I could just use that money to renovate my home office.
A business trying to cultivate community loses the plot when they rely completely on automation.
In other words, AI scores on AI conversations - disguised as a means of gauging clinical competence / quality?
This is not an eval - this is a one-shotted product spec!
I'm not even sure what to say. It's self-evidently a terrible idea, but we all just seem to be charging full-steam ahead like so many awful ideas in the past couple of decades.
Maybe you’re comparing it to some idealized view of what human therapy is like? There’s no benchmark for it, but humans struggle in real mental health care. They make terrible mistakes all the time. And human therapy doesn’t scale to the level needed. Millions of people simply go without help. And therapy is generally one hour a week. You’re supposed to sort out your entire life in that window? Impossible. It sets people up for failure.
So, if we had some perfect system for getting every person that needs help the exact therapist they need, meeting as often as they need, then maybe AI therapy would be a bad idea, but that’s not what we have, and we never will.
Personally, I think the best way to scale mental healthcare is through group therapy and communities. Having a community of people all coming together over common issues has always been far more helpful than one on one therapy for me. But getting some assistance from an AI therapist on off hours can also be useful.
We can see an LLM as someone that talks with more people, for more time, than anyone on earth talks in their lifetime. So they are due to be in constant contact with people in mental distress. At that point, you might as well consider the importance of giving them the skills of a mental health professonal, because they are going to be facing more of this than a priest in a confessional. And this is true whether someone says "Gemini, pretend that you are a psychologist" or not. You or I don't need a prompt to know we need to notice when someone is in a severe psychotic episode: Some level of mental health awareness is built in, if just to protect ourselves. So an LLM needs quite a bit of this by default to avoid being really harmful. And once you give it that, you might as well evaluate it against professionals: Not because it must be as good, but because it'd be really nice if it was, even when it's not trying to act as one.
I heard someone say that LLMs don't need to be as good as an expert to be useful, they just need to be better than your best available expert. A lot of people don't have access to mental health care, and will ask their chatbot to ask like a psychologist.
This mostly makes sense.
The problem is that people will take what you've said to mean "If I have no access to a therapist, at least I can access an LLM", with a default assumption that something is better than nothing. But this quickly breaks down when the sycophantic LLM encourages you to commit suicide, or reinforces your emerging psychosis, etc. Speaking to nobody is better than speaking to something that is actively harmful.
I've heard people say the same thing ("LLMs don't need to be as good as an expert to be useful, they just need to be better than your best available expert"), and I also know that some people assume that LLMs are, by default, better than nothing. Hence my comment.
My other stance, which I suspect is probably more controversial, is that I'm not convinced that mental health care is nearly as effective as people think. In general, mental health outcomes for teens are getting markedly worse, and it's not for lack of access. We have more mental health access than we've had previously -- it just doesn't feel like it because the demand has risen even more sharply.
On a personal level, I've been quite depressed lately, and also feeling quite isolated. As part of an attempt to get out of my own shell I mentioned this to a friend. Now, my friend is totally well-intended, and I don't begrudge him whatsoever. But, the first response out of his mouth was whether I'd sought professional mental health care. His response really hurt. I need meaningful social connection. I don't need a licensed professional to charge me money to talk about my childhood. I think a lot of people are lost and lonely, and for many people mental health care is a band-aid over a real crisis of isolation and despair.
I'm not recommending against people seeking mental health care, of course. And, despite my claims there are many people who truly need it, and truly benefit from it. But I don't think it's the unalloyed good that many people seem to believe it to be.
Professional mental health care cannot scale to the population that needs it. The best option, like you mention, is talking to friends about our feelings and problems. I think there has been an erosion (or it never existed) of these social mental health mechanisms. There is a learned helplessness that has developed that people have lost their capacity to just be with someone that is hurting. There needs to be a framework for providing mental health therapy to loved ones that can exist without licensed professionals, otherwise LLm's are the only scalable option for people to talk about their issues and work on finding solutions.
This might be controversial but mental health care is largely a bandaid when the causes of people's declining mental health is due to factor's far outside the individual's control: loneliness epidemics, declining optimism towards the future, climate change, the rise of global fascism, online dating, addictiveness of social media and the war on our attention, etc.
There's also the elephant in the room that mental healthcare, in particular for teens will probably just be compensating for the disease that is social media addiction. Australia has the right idea, banning social media for all goods.
Which is to say, your stance might not be as controversial as you think, since it was the adult take in a children's cartoon almost 60 years ago.
Lucy isn't actually a psychologist which is part of the reason the "gag" is funny.
Peanuts is funny, but it may not be the source of wisdom you think it is.
It's also less pressure, a more comfortable environment (home vs. stranger's office), no commitments to a next session, and less embarrassing (sharing your personal issues to a computer via text is less anxiety-inducing than saying them to a person's face).
With that all said, I'm strongly opposed to people using LLMs as therapists.
In their mind, most of the times, if there is no one standing behind them when they chat with a LLM, then the conversation for most intents and purposes is private.
Obviously, us who are born with a keyboard in front of our hands, know this to not be true, and know we're being tracked constantly with our data being sold to the highest bidder. But the typical person have more or less zero concerns about this, which is why it's not a priority issue to be solved.
I haven't been so unlucky myself, but I know many who've had terrible first experiences with therapists and psychologists, where I'm wondering why those people even are in the job they are, but some of them got so turned off they stopped trying to find anyone else to help them, because they think most mental health professionals would be the same as the first person they sought help from.
However, unless we have a measure on how helpful self-help books actually are, we still don't know if they help or not.
Should be banned. Average people have no basis to know whether drug X is appropriate for them. If your doctor thinks you need it, he'll tell you. These ads also perpetuate the harmful idea that there's a pill for everything.
(1) The demand for mental health services is an order of magnitude vs the supply, but the demand we see is a fraction of the demand that exists because a lot of people, especially men, aren't believers in the "therapeutic culture"
In the days of Freud you could get a few hours of intensive therapy a week but today you're lucky to get an hour a week. An AI therapist can be with you constantly.
(2) I believe psychodiagnosis based on text analysis could greatly outperform mainstream methods. Give an AI someone's social media feed and I think depression, mania, schizo-* spectrum, disordered narcissism and many other states and traits will be immediately visible.
(3) Despite the CBT revolution and various attempts to intensify CBT, a large part of the effectiveness of therapy comes from the patient feeling mirrored by the therapist [1] and the LLM can accomplish this, in fact, this could be accomplished by the old ELIZA program.
(4) The self of the therapist can be both an obstacle and an instrument to progress. See [2] On one level the reactions that a therapist feels are useful, but they also get in the way of the therapist providing perfect mirroring [3] and letting optimal frustration unfold in the patient instead of providing "corrective emoptional experiences." I'm going to argue that the AI therapist can be trained to "perceive" the things a human therapist perceives but that it does not have its own reactions that will make the patient feel judged and get in the way of that unfolding.
[1] https://en.wikipedia.org/wiki/Carl_Rogers
[2] https://en.wikipedia.org/wiki/Countertransference
[3] why settle for less?
[4] https://www.sciencedirect.com/science/article/pii/S0010440X6...
So it’s sensible that someone out there is evaluating its competence and thinking about a better alternative for these folks than yoloing their worst thoughts into chatgpt.com’s default LLM.
Everyone's hand is being forced by the major AI providers existing.
Even if you were a perfect altruist with a crusade against the idea of people using LLMs for mental health, you could still be forced to dash towards figuring out how to build LLM tools for mental health in your consideration for others.
The age-old problem is how to prevent that disaster and save those lives. That's not trivial. Creating an Oncological Rhubarb Authority could easily make the problem much worse, not better.
If you try to merely stop people from using LLMs as therapists (could you elaborate on what that looks like?) and call it a day, your consideration isn't extending to all the people who will do it anyways.
That's what I mean by forcing your hand into doing the work of figuring out how to make LLM therapists work even if you were vehemently against the idea.
I think you're assuming my proposed solution is to take rhubarb away from people? It's not.
Maybe you want to found the "Oncological Rhubarb Validation Association" or something. If so, that has nothing to do with me? That's just snake oil marketing. Not my field.
Terrible idea or not, it's probably helpful to think of LLMs not as "AI mental healthcare" but rather as another form of potentially bad advice. From a therapeutic perspective, Claude is not all that different from the patient having a friend who is sometimes counterproductive. Or the patient reading a self-help book that doesn't align with your therapeutic perspective.
We really need to get the psychology right with LLMs.
You trust humans to do it. Trust has little to do with what actually happens.
Not everywhere in the world do companies count as people, yet they can still be sued.
I'd wager the companies lobbied for this to gain extra rights.
EDIT: Note that ianal, nor a historian. The specifics of how this came about are best learned from a more authoritative source.
Actually yes, everywhere in the world. That has a functioning legal system, at least.
If companies weren't treated as legal persons, they wouldn't be able to enter into contracts.
But also, just to be clear, a legal person, like a corporation, is not a natural person. Unlike a natural person, they can't vote. There isn't anywhere in the world that considers corporations to be natural persons.
In theoretical sense sure.
In a practical sense? They are invulnerable due to what can be extreme financial obstacles they can put in place. They can drag a court case out until you fold if you haven't found a lawyer willing to do it on contigency.
The architecture and evaluation approach seem broadly similar.
https://patricgagne.com/
She is not a psychologist now according to her page. And how do you know she was successful? I skimmed a few articles about her and saw no attempts to verify her claims. She offered no evidence when asked why people should believe her.
Seems to me that benchmarking a thing has an interesting relationship with acceptance of the thing.
I'm interested to see human thoughts on either of these.
Same thing for the patient LLM. We can probably fine-tune an LLM to do a better job at simulating patients.
Those two components of our framework have space for improvement
Edit: Thank you!
(And I'm not being theoretical here, I have quite a bit of experience getting incredibly inadequate mental health care.)
And say what you will about this, a paid professional is, at the very least, unlikely to let you wind yourself up or go down weird rabbit holes... something that LLMs seem to excel at.
It's better not to degrade the close friend, and "life coach focused on healthy self awareness" is probably indistinguishable from most good therapy.
Reasoning that if he's not good it would show up in patients thinking he's bad, and not feeling any better. And then he could tune his therapy approaches towards the ones which make people feel better and rate him as more understanding and listening and caring. And he criticises therapists who won't do that, therapists who say patients have been seeing them for years with only incremental improvements or no improvements.
Yes there's no objective way to measure how angry or suicidal or anxious someone is and compare two people, but if someone is subjectively reporting 5/5 sadness about X at the start of a session and wants help with X, then at some point in the future they should be reporting that number going down or they aren't being helped. And more effective help could mean that it goes down to 1/5 in three sessions instead of down to 4/5 in three years, and that's a feedback loop which (he says) has got him to be able to help people in a single two-hour therapy session, where most therapists and insurance companies will only do a too-short session with no feedback loop.
This is like a questionnaire on how much stronger you feel after working out at a gym: you often don't, you feel tired.
Both gym and talking therapy (when done correctly) will push you slightly out of your comfort zone, and aim to let you safely deal with moderate amounts of something that you find really hard. So as to expand your capabilities.
"I feel good" immediately after is utterly the wrong metric.
Being more capable / feeling better some time later is the more reliable indicator, like progress at a gym.
And also this is why an agreeable statistical word generator LLM is not the correct tool for the job.
"Good" is too broad and subjective to be a useful metric.
Here's a 2.5 hour session (split into several videos) with a doctor who has a bad relationship with his son and felt like a failure for it:
https://www.youtube.com/watch?v=42JDnrD106w
https://www.youtube.com/watch?v=S5H2YGljhqQ
https://www.youtube.com/watch?v=bZ9_0j_fmeg
https://www.youtube.com/watch?v=eiCrdGVa8Q0
https://www.youtube.com/watch?v=cARvhlTckaM
Here's a couple of hour session with Marilyn who was diagnosed with lung cancer and spiraling with depression, anxiety, shame, loneliness, hopelessness, demoralization, and anger, despite her successful career:
https://www.youtube.com/watch?v=S7sQ_zDGsY8
https://www.youtube.com/watch?v=tyuFN4mbGZQ (there's probably more parts to find through YouTube somehow)
And a session with Lee with loneliness and marriage relationship problems:
https://www.youtube.com/watch?v=imEMM3r6XL8 (probably more parts as well)
It's like saying "it is still debated if debugging even works" as if all languages, all debuggers, all programmers, all systems, are the same and if you can find lots of people who can't debug then "debugging doesn't work". But no, you only need a few examples of "therapy working" to believe that it works, and see the whole session to see that it isn't just luck or just the relief of talking, but is a skill and a technique and a debugging of the mind.
But a patient feeling better at the end of a single therapy session doesn't prove that therapy works either...
- Does the patient feel better between sessions too? Will they keep feeling better after the therapy ends? Aka are they "cured"?
- Would the patient feel equally good if they confided in a non-licensed therapist?
- Do the techniques (CBT, DBT, ACT, IFS, etc) actually provide tangible benefits versus just listening and providing advice?
The grounding this had was that texts produced by role-playing humans (not even actual patients) were closer to texts produced by the patient simulations prompt they decided to end up with than others they tried.
https://www.forbes.com/sites/johnkoetsier/2025/11/10/grok-le...
Grok 3 and 4 scored at the bottom, only above gpt-4o, which I find interesting, because there was such big pushback on reddit when they got rid of 4o due to people having emotional attachments to the model. Interestingly the newest models (like gemini 2.5 and gpt 5 did the best.
This will become more and more of an issue as people look for a quick fix for their life problems, but I don't think AI/ML is ever going to be an effective mechanism for life improvement on the mental health issue.
It'll instead be used as a tool of oppression like in THX1138, where the apparency of assistance is going to be provided in lieu of actual assistance.
Whether we like it or not, humans are a hive species. We need each other to improve our lives as individuals. Nobody ever climbed the mountain to live alone who didn't come back down, realizing how much the rest of humanity is actually essential to human life.
This'll be received as an unpopular opinion, but I remain suspicious of any and all attempts to replace modern health practitioners with machines. This will be subverted and usurped for nefarious purposes, mark my words.