It sounds impossible to be paretooptimal for complicated problems. How do you know GPT-4o-mini would be optimal. I feel like there is always room on the table for a potential GPT-5o-mini to be more optimal. The solution space of possible gen ai models is gigantic, so we can only improve our solution over time and never find the most optimal one.
brianbelljr 5 days ago [-]
Yes, maybe theoretically. Practically though you will have to ship your agent with the LLMs that are available today and you will need to pick one. I don’t think the authors were trying to solve for like “best forever”,probably wasn’t their intent. For that I think you would need some kind of proof which sort of says that some kind of theoretical maximum is reached, and a proof like that is not a thing in most applied computer science fields.
simianwords 5 days ago [-]
Interesting but I'm a bit lost. You are optimising but how do you know the ground truth of "good" and "bad"? Do you manually run the workflow and then decide based on a predefined metric?
A new OSS framework uses multi-objective Bayesian Optimization to efficiently search for Pareto-optimal RAG workflows, balancing cost, accuracy, and latency across configurations that would be impossible to test manually.
Could I ignore all flow runs that have an estimated cost above a certain threshold, so that the overall cost of optimization is less? Suppose I choose an acceptable level of accuracy and then skip some costlier exploration stuff. Is there a risk it doesn't find certain optimal configurations even under my cost limit?
How does the system deal with long context-length documents that some models can handle and others can't? Does this approach work for custom models?
Are you going to flesh out the docs? Looks like the folder only has two markdown files right now.
Any recommendations for creating the initial QA dataset for benchmarking? Maybe creating a basic RAG system, using those search results and generations as the baseline and then having humans check and edit them to be more comprehensive and accurate. Any chance this is on the roadmap?
Cool stuff, I'm hoping this approach is more widely adopted!
...would it be accurate to say that syftr finds Pareto-optimal choices across cost, accuracy, and latency, where accuracy is decided by an LLM whose assessments are 90% correlated to that of human labelers?
Are there 3 objectives: cost, accuracy, and latency or 2: cost and accuracy?
5 days ago [-]
andriyvel 6 days ago [-]
looks interesting!
jstummbillig 5 days ago [-]
and exhausting.
diabolicalrobot 8 days ago [-]
[dead]
Rendered at 14:24:50 GMT+0000 (Coordinated Universal Time) with Vercel.
Or do you rely on generic benchmarks?
You need custom QA pairs for custom scenarios.
Github: https://github.com/datarobot/syftr
Paper: https://arxiv.org/abs/2505.20266
I've got a few questions:
Could I ignore all flow runs that have an estimated cost above a certain threshold, so that the overall cost of optimization is less? Suppose I choose an acceptable level of accuracy and then skip some costlier exploration stuff. Is there a risk it doesn't find certain optimal configurations even under my cost limit?
How does the system deal with long context-length documents that some models can handle and others can't? Does this approach work for custom models?
Suppose I want to create and optimize for my own LLM-as-a-judge metrics like https://mastra.ai/en/docs/evals/textual-evals#available-metr..., how can I do this?
Are you going to flesh out the docs? Looks like the folder only has two markdown files right now.
Any recommendations for creating the initial QA dataset for benchmarking? Maybe creating a basic RAG system, using those search results and generations as the baseline and then having humans check and edit them to be more comprehensive and accurate. Any chance this is on the roadmap?
Cool stuff, I'm hoping this approach is more widely adopted!
...would it be accurate to say that syftr finds Pareto-optimal choices across cost, accuracy, and latency, where accuracy is decided by an LLM whose assessments are 90% correlated to that of human labelers?
Are there 3 objectives: cost, accuracy, and latency or 2: cost and accuracy?