key_points: [
"Structured output from LLMs, like JSON, is a common challenge."
"Existing solutions like response_format: 'json' and function calling often disappoint."
"The article compares multiple frameworks designed to handle structured output."
"Handling and preventing malformed JSON is a critical concern."
"Two main techniques for this: parsing malformed JSON or constraining LLM token generation."
"Framework comparison includes details on language support, JSON handling, prompt building, control, model providers, API flavors, type definitions, and test frameworks."
"BAML is noted for its robust handling of malformed JSON using a new Rust-based parser."
"Instructor supports multiple LLM providers but has limitations on prompt control."
"Guidance, Outlines, and others apply LLM token constraints but have limitations with models like OpenAI's."
]
take_way: "Consider using frameworks that efficiently handle malformed JSON and offer prompt control to get the desired structured output from LLMs."
that's a great question, there's three main benefits:
1. seeing the full prompt, even though that python code feels leaner, somehow you need to convert it to a prompt. a library will do that in some way, BAML has a VSCode playground to see the entire prompt + tokenization. If we had to do this off of python/ts, we would run into the halting problem and making the playground would be much much harder.
2. there's a lot of codegen we do for users, to make life easier, e.g. w/o BAML, to now do streaming for the resume, you would have to do something like this:
class PartialResume:
name: Optional[str]
education: List[PartialEducation]
skills: List[str]
and then at some point you need to reparse PartialResume -> Resume, we can codegen all of that for you, and give you autocomplete, type-safety for free.
3. We added a lot of static analysis / jump to definition etc to JINJA (which we use for strings), and that is much easier to navigate than f-strings.
4. Since its code-gen we can support all languages way easier, so prompting techniques in python work the same exact way for the same code in typescript.
the main one is that most people don't own the model. so if you use openai / anthropic / etc then you can't use token masking. in that case, reprompting is pretty much the only option
In the specific cases of openai and anthropic, both have 'tool use' interfaces which will generate valid JSON following a schema of your choice.
You're right, though, that reprompting works with pretty much everything out there, including hosted models that don't have tool use as part of their API. And its simple too, you don't even need to know what "token masking" is.
Reprompting can also apply arbitrarily criteria that are more complex than just a json schema. You ask it to choose an excerpt of a document and the string it returns isn't an excerpt? Just reprompt.
It does. With OpenAI at least you definetly can use token masking. There are some limitations but even those are circumventable. I have used token masking on the OpenAI API with LMQL without any issues.
our paid product is still in Beta actually as we're continuing to build it out, but BAML itself is and always will be open source (runs fully locally as well - no extra network calls).
in terms of parsing, I do think we're likely the best approach as of now. Most other libraries do reprompting or rely on constraining grammars which require owning the model. Reprompting = slow + $$, constraining grammars = require owning the model. we just tried a new approach: parse the output in a more clever way.
basically to support generation they would need to modify pick_best to support constraining. That would make it so they can't optimize the hot loop at their scales. They support super broad output constraints like JSON which apply to everyone, but that leads to other issues (things like chain-of-thought/reasoning perform way worse in structured responses).
XML is also a great option, but there are a few trade offs:
> XML is a many more tokens (much slower + $$$ for complex schemas)
> regardless of if you're looking for } or </output> its really a matter of "does your parser work". when you have three tokens that need to be correct "</" "output", ">", the odds of a mistake are higher, instead of when you just need "}".
That said, the parser is much easier to write, we're actually considering supporting XML in BAML. have you found any reductions of accuracy?
Also, not sure if you saw this, but apparently Claude doesn't actually prefer XML, it just happens to work well with it. Was recently new info for myself as well.
https://x.com/alexalbert__/status/1778550859598807178 (devrel @ Anthropic)
I think once you have < and / the rest becomes much easier to predict. In a way it “spreads” the prediction over several tokens.
The < indicates that the preceding information is in fact over. The “/“ represents that we are closing something and not starting a subtopic. And the “output” defines what we are closing. The final “>” ensures that our “output” string is ended. In JSON all of that semantic meaning get put into the one token }.
Hmm, that's an interesting way of thinking about it. The way I see it, I trust XML less, because the sparser representation gives it more room to make a mistake: if you think of every token as an opportunity to be correct or wrong, the higher token count needed to represent content in XML gives the model a higher chance to get the output wrong (kinda like the birthday paradox).
(Plus, more output tokens is more expensive!)
e.g.
using the cl_100k tokenizer (what GPT4 uses), this JSON is 60 tokens:
Hey everyone! One of the creators of BAML here! Appreciate sharing this post. For anyone interested in playing around with an interactive version of BAML online, check it out here: https://www.promptfiddle.com
Really interesting library! In the docs, could you describe in a bit more detail which kind of JSON errors it tolerates? And which models currently work best with your parsing approach?
fyi, we actually fix those specific errors in our parser :)