> Typed I/O for every LLM call. Use Pydantic. Define what goes in and out.
Sure, not related to DSPy though, and completely tablestakes. Also not sure why the whole article assumes the only language in the world is Python.
> Separate prompts from code. Forces you to think about prompts as distinct things.
There's really no reason prompts must live in a file with a .md or .json or .txt extension rather than .py/.ts/.go/.., except if you indeed work at a company that decided it's a good idea to let random people change prod runtime behavior. If someone can think of a scenario where this is actually a good idea, feel free to elighten me. I don't see how it's any more advisable than editing code in prod while it's running.
> Composable units. Every LLM call should be testable, mockable, chainable.
> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.
And LiteLLM or `ai` (Vercel), the actually most used packages, aren't? You're comparing downloads with Langchain, probably the worst package to gain popularity of the last decade. It was just first to market, then after a short while most realized it's horrifically architected, and now it's just coasting on former name recognition while everyone who needs to get shit done uses something lighter like the above two.
> Eval infrastructure early. Day one. How will you know if a change helped?
Sure, to an extent. Outside of programming, most things where LLMs deliver actual value are very nondeterministic with no right answer. That's exactly what they offer. Plenty of which an LLM can't judge the quality of. Having basic evals is useful, but you can quickly run into their development taking more time than it's worth.
But above all.. the comments on this post immediately make clear that the biggest differentiator of DSPy is the prompt optimization. Yet this article doesn't mention that at all? Weird.
sbpayne 22 minutes ago [-]
I think all of these things are table-stakes; yet I see that they are implemented/supported poorly across many companies. All I'm saying is there are some patterns here that are important, and it makes sense to enter into building AI systems understanding them (whether or not you use Dspy) :)
andyg_blog 24 minutes ago [-]
>the whole article assumes the only language in the world is Python.
This was my take as well.
My company recently started using Dspy, but you know what? We had to stand up an entire new repo in Python for it, because the vast majority of our code is not Python.
sbpayne 18 minutes ago [-]
I think this is an important point! I am actually a big fan of doing what works in the language(s) you're already using.
For example: I don't use Dspy at work! And I'm working in a primarily dotnet stack, so we definitely don't use Dspy... But still, I see the same patterns seeping through that I think are important to understand.
And then there's a question of "how do we implement these patterns idiomatically and ergonomically in our codebase/langugage?"
nkozyra 32 minutes ago [-]
> f"Extract the company name from: {text}"
I think one thing that's lost in all of the LLM tooling is that it's LLM-or-nothing and people have lost knowledge of other ML approaches that actually work just fine, like entity recognition.
I understand it's easier to just throw every problem at an LLM but there are things where off-the-shelf ML/NLP products work just as well without the latency or expense.
sbpayne 30 minutes ago [-]
Oh 100%! There are many problems (including this one!) that probably aren't best suited for an LLM. I was just trying to pick a really simple example that most people would follow.
stephantul 54 minutes ago [-]
Mannnn, here I thought this was going to be an informative article! But it’s just a commercial for the author’s consulting business.
sbpayne 52 minutes ago [-]
Oops! That's actually out of date from prior template I had. I don't actually consult at the moment :). Removing!
50 minutes ago [-]
halb 23 minutes ago [-]
The author itself is probably ai-generated. The contact section in the blog is just placeholder values. I think the age of informative articles is gone
CraftingLinks 3 minutes ago [-]
I used dspy in production, then reverted the bloat as it literally gave me nothing of added value in practice but a lot of friction when i needed precise control over the context. Avoid!
Lerc 7 minutes ago [-]
If [programming_language] is so great, why isn't anyone using it?
For many of the same reasons. A plethora of alteratives, personal preference, weird ideology, appropriateness for the task, inertia, not-invented-here.
The list goes on.
panelcu 13 minutes ago [-]
https://www.tensorzero.com/docs has similar abstractions but doesn't require Python and doesn't require committing to the framework or a language. It's also pretty hard to onboard, but solves the same problems better and makes evaluating changes to models / prompts much easier to reason about.
sbpayne 12 minutes ago [-]
I saw this some time ago! I personally have a distaste for external DSLs as I think it generally introduces complexity that I don't think is actually worthwhile, so I skipped over it. Also why I'm very "meh" on BAML.
ndr 36 minutes ago [-]
It's not as ergonomic as they made it to be.
The fact that you have to bundle input+output signatures and everything is dynamically typed (sometimes into the args) just make it annoying to use in codebases that have type annotations everywhere.
Plus their out of the box agent loop has been a joke for the longest time, and writing your own if feasible but it's night and day when trying to get something done with pydantic-ai.
Too bad because it has a lot of nice things, I wish it were more popular.
sbpayne 34 minutes ago [-]
Yeah! I can agree with this. There's some improved ergonomics to get here
verdverm 31 minutes ago [-]
Have you looked at ADK? How does it compare? Does it even fit in the same place as Dspy?
Disclaimer, I use ADK, haven't really looked at Dspy (though I have prior heard of it). ADK certainly addresses all of the points you have in the post.
sbpayne 15 minutes ago [-]
I personally haven't looked super closely at ADK. But I would love if someone more knowledgeable could do a sort of comparison. I imagine there are a lot of similar/shared ideas!
verdverm 38 seconds ago [-]
There are dozens if not 100s of agent frameworks in use today, 1000s if you peruse /new. I'm curious what features will make for longevity. One thing about ADK is that it comes in four languages (Py, TS, Go, Java; so far), which means understanding can transfer over/between teams in larger orgs, and they can share the same backing services (like the db to persist sessions).
TheTaytay 1 hours ago [-]
I tried it in the past, one time “in earnest.” But when I discovered that none of my actual optimized prompts were extractable, I got cold feet and went a different route. The idea of needing to do fully commit to a framework scares me. The idea of having a computer optimize a prompt as a compilation step makes a lot of sense, but treating the underlying output prompt as an opaque blob doesn’t. Some of my use cases were JUST off of the beaten path that dspy was confusing, which didn’t help. And lastly, I felt like committing to dspy meant that I would be shutting the door on any other framework or tool or prompting approach down the road.
I think I might have just misunderstood how to use it.
sbpayne 56 minutes ago [-]
I don't know that you misunderstood. This is one of my biggest gripes with Dspy as well. I think it takes the "prompt is a parameter" concept a bit too far.
This matches my experience with Dspy. I ended up removing it from our production codebase because, at the time, it didn't quite work as effectively as just using Pydantic and so forth.
The real killer feature is the prompt compilation; it's also the hardest to get to an effective place and I frequently found myself needing more control over the context than it would allow. This was a while ago, so things may have improved. But good evals are hard and the really fancy algorithms will burn a lot of tokens to optimize your prompts.
Main reason to me is that its layers on layer on top of the base LLM calls with not so much to show for it. Also a lot of native features (like for examples geminis native structured responses) aren't well supported.
memothon 47 minutes ago [-]
I think the real problem with using DSPy is that many of the problems people are trying to solve with LLMs (agents, chat) don't have an obvious path to evaluate. You have to really think carefully on how to build up a training and evaluation dataset that you can throw to DSPy to get it to optimize.
This takes a ton of upfront work and careful thinking. As soon as you move the goalposts of what you're trying to achieve you also have to update the training and evaluation dataset to cover that new use case.
This can actually get in the way of moving fast. Often teams are not trying to optimize their prompts but even trying to figure out what the set of questions and right answers should be!
sbpayne 45 minutes ago [-]
Yeah, I think Dspy often does not really show it's benefit until you have a good 'automated metric', which can be difficult to get to.
I think the unfortunate part is: the way it encourages you to structure your code is good for other reasons that might not be an 'acute' pain. And over time, it seems inevitable you'll end up building something that looks like it.
memothon 42 minutes ago [-]
Yeah I agree with this. I will try to use it in earnest on my next project.
That metric is the key piece. I don't know the right way to build an automated metric for a lot of the systems I want to build that will stand the test of time.
brokensegue 43 minutes ago [-]
i've tried it a few times and it's never really helped as much as i expected. though i know they've released a couple times since I last tried it.
sbpayne 42 minutes ago [-]
yeah what I'm trying to get across here is that: Dspy does not solve an immediate problem, which is why many feel this way and consequently why it doesn't have great adoption!
But on the other hand, I think people unintentionally end up re-implementing a lot of Dspy.
QuadmasterXLII 51 minutes ago [-]
If you find yourself adding a database because thats less painful than regular deployments from your version control, something is hair on fire levels of wrong with your CICD setup.
sbpayne 46 minutes ago [-]
I think this misunderstands the need for iteration! Maybe I could have written it more clearly :).
The reality is that you don't want to re-deploy for every prompt change, especially early on. You want to get a really tight feedback loop. If prompt change requires a re-deploy, that is usually too slow. You don't have to use a database to solve this, but it's pretty common to see in my experience.
ijk 27 minutes ago [-]
I've been reaching for BAML when I really need prompt iteration at speed.
I have never heard of this! I took a quick look. I think I'm definitely not in the right audience for a tool like this, as I am more comfortable just writing code. But I think putting a UI over things like this _forces_ the underlying system to be more declarative...
So in practice I imagine you get at a lot of the same ideas / benefits!
jatins 43 minutes ago [-]
Would have been nice if the post actually showed how Dspy does the things that were handrolled
sbpayne 42 minutes ago [-]
This is great feedback! I'll work on an update tonight :)
sbpayne 1 hours ago [-]
I consistently hear great things from Dspy users. At the same time, it feels like adoption is always low.
Stranger still: it seems like every company I have worked with ends up building a half-baked version of Dspy.
CuriouslyC 54 minutes ago [-]
Two issues:
1. People don't want to switch frameworks, even though you can pull prompts generated by DSPy and use them elsewhere, it feels weird.
2. You need to do some up-front work to set up some of the optimizers which a lot of people are averse to.
simopa 44 minutes ago [-]
"Great engineers write bad AI code" made my day ;)
sbpayne 43 minutes ago [-]
hahaha this has just been my entire last few years of experience :)
markab21 14 minutes ago [-]
I think the entire premise that the prompting is the surface area for optimizing the application is fundamentally the wrong framing, in the same way that in 1998 better cpam will save CGI. It's solving the wrong problems now, and the limitations in context and model intelligence require a tool like Dspy.
The only thing I'd grab dspy for at this point is to automate the edges of the agentic pipeline that could be improved with RL patterns. But if that is true, you're really shorting yourself by giving your domain DSPY. You should be building your own RL learning loops.
My experience: If you find yourself reaching for a tool like Dspy, you might be sitting on a scenario where reinforcement learning approaches would help even further up the stack than your prompts, and you're probably missing where the real optimization win is. (Think bigger)
sbpayne 13 minutes ago [-]
Yeah, I find it hard to recommend Dspy. At the same time, I can't escape the observation that many companies are re-implementing a lot of parts of it. So I think it's important to at least learn from what Dspy is :)
dzonga 45 minutes ago [-]
at /u/ sbpyane - very useful info and pricing page as well.
useful for upcoming consultants to learn how to price services too.
sbpayne 41 minutes ago [-]
Highly recommend following @jxnl on X for consulting / positioning / pricing
LoganDark 21 minutes ago [-]
This article seemingly misses any explanation of what DSPy even is or why it's supposedly so complicated and unfamiliar. Supposedly it solves the problems illustrated in the article, but that argument is badly presented.
villgax 33 minutes ago [-]
Nobody uses it except for maybe the weaviate developer advocates running those jupyter cells.
tinyhouse 50 minutes ago [-]
A lot of these ideas Dspy and RLM (from the same people IIRC) are more marketing than solving a real problem.
sbpayne 43 minutes ago [-]
This is a surprising take to me! Would love to learn more about what you mean. I feel like the problems they solve seem so direct to me. For example: RLMs are an approach to long context problems. Not every problem is a good fit for RLMs for sure, but I can see some problems where I imagine it would work well!
maxothex 3 minutes ago [-]
[dead]
leontloveless 4 minutes ago [-]
[dead]
leontloveless 4 minutes ago [-]
[dead]
jee599 45 minutes ago [-]
[dead]
Rendered at 16:07:00 GMT+0000 (Coordinated Universal Time) with Vercel.
> Typed I/O for every LLM call. Use Pydantic. Define what goes in and out.
Sure, not related to DSPy though, and completely tablestakes. Also not sure why the whole article assumes the only language in the world is Python.
> Separate prompts from code. Forces you to think about prompts as distinct things.
There's really no reason prompts must live in a file with a .md or .json or .txt extension rather than .py/.ts/.go/.., except if you indeed work at a company that decided it's a good idea to let random people change prod runtime behavior. If someone can think of a scenario where this is actually a good idea, feel free to elighten me. I don't see how it's any more advisable than editing code in prod while it's running.
> Composable units. Every LLM call should be testable, mockable, chainable.
> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.
And LiteLLM or `ai` (Vercel), the actually most used packages, aren't? You're comparing downloads with Langchain, probably the worst package to gain popularity of the last decade. It was just first to market, then after a short while most realized it's horrifically architected, and now it's just coasting on former name recognition while everyone who needs to get shit done uses something lighter like the above two.
> Eval infrastructure early. Day one. How will you know if a change helped?
Sure, to an extent. Outside of programming, most things where LLMs deliver actual value are very nondeterministic with no right answer. That's exactly what they offer. Plenty of which an LLM can't judge the quality of. Having basic evals is useful, but you can quickly run into their development taking more time than it's worth.
But above all.. the comments on this post immediately make clear that the biggest differentiator of DSPy is the prompt optimization. Yet this article doesn't mention that at all? Weird.
This was my take as well.
My company recently started using Dspy, but you know what? We had to stand up an entire new repo in Python for it, because the vast majority of our code is not Python.
For example: I don't use Dspy at work! And I'm working in a primarily dotnet stack, so we definitely don't use Dspy... But still, I see the same patterns seeping through that I think are important to understand.
And then there's a question of "how do we implement these patterns idiomatically and ergonomically in our codebase/langugage?"
I think one thing that's lost in all of the LLM tooling is that it's LLM-or-nothing and people have lost knowledge of other ML approaches that actually work just fine, like entity recognition.
I understand it's easier to just throw every problem at an LLM but there are things where off-the-shelf ML/NLP products work just as well without the latency or expense.
For many of the same reasons. A plethora of alteratives, personal preference, weird ideology, appropriateness for the task, inertia, not-invented-here.
The list goes on.
The fact that you have to bundle input+output signatures and everything is dynamically typed (sometimes into the args) just make it annoying to use in codebases that have type annotations everywhere.
Plus their out of the box agent loop has been a joke for the longest time, and writing your own if feasible but it's night and day when trying to get something done with pydantic-ai.
Too bad because it has a lot of nice things, I wish it were more popular.
https://google.github.io/adk-docs/
Disclaimer, I use ADK, haven't really looked at Dspy (though I have prior heard of it). ADK certainly addresses all of the points you have in the post.
I think I might have just misunderstood how to use it.
I highly recommend checking out this community plugin from Maxime, it helps "bridge the gap": https://github.com/dspy-community/dspy-template-adapter
The real killer feature is the prompt compilation; it's also the hardest to get to an effective place and I frequently found myself needing more control over the context than it would allow. This was a while ago, so things may have improved. But good evals are hard and the really fancy algorithms will burn a lot of tokens to optimize your prompts.
I think it solves some of this friction!
This takes a ton of upfront work and careful thinking. As soon as you move the goalposts of what you're trying to achieve you also have to update the training and evaluation dataset to cover that new use case.
This can actually get in the way of moving fast. Often teams are not trying to optimize their prompts but even trying to figure out what the set of questions and right answers should be!
I think the unfortunate part is: the way it encourages you to structure your code is good for other reasons that might not be an 'acute' pain. And over time, it seems inevitable you'll end up building something that looks like it.
That metric is the key piece. I don't know the right way to build an automated metric for a lot of the systems I want to build that will stand the test of time.
But on the other hand, I think people unintentionally end up re-implementing a lot of Dspy.
The reality is that you don't want to re-deploy for every prompt change, especially early on. You want to get a really tight feedback loop. If prompt change requires a re-deploy, that is usually too slow. You don't have to use a database to solve this, but it's pretty common to see in my experience.
So in practice I imagine you get at a lot of the same ideas / benefits!
Stranger still: it seems like every company I have worked with ends up building a half-baked version of Dspy.
1. People don't want to switch frameworks, even though you can pull prompts generated by DSPy and use them elsewhere, it feels weird.
2. You need to do some up-front work to set up some of the optimizers which a lot of people are averse to.
The only thing I'd grab dspy for at this point is to automate the edges of the agentic pipeline that could be improved with RL patterns. But if that is true, you're really shorting yourself by giving your domain DSPY. You should be building your own RL learning loops.
My experience: If you find yourself reaching for a tool like Dspy, you might be sitting on a scenario where reinforcement learning approaches would help even further up the stack than your prompts, and you're probably missing where the real optimization win is. (Think bigger)
useful for upcoming consultants to learn how to price services too.