We don't know since no one is releasing their data. Calling these models open so...

DreamGen · on March 17, 2024

A big distinction is that you can built on top (fine-tune) thus released models as well as if they released the pre-training data.

llm_trw · on March 17, 2024

You can also build on top of binaries if you use gotos and machine code.

shwaj · on March 18, 2024

This seems intentionally obtuse. What you say is true, but it is very obvious that this is much more of a pain than if you had the source code. On the other hand, fine tuning is just as easy, regardless of whether you have the original training data.

llm_trw · on March 18, 2024

If you don't know the original training data statistical distribution then catastrophic forgetting is guaranteed with any extra training.

shwaj · on March 18, 2024

I was going to ask for a reference to support this (although please provide one if handy), but the search term “catastrophic forgetting” is a great entry into the literature. Thanks.

samus · on March 18, 2024

One could also disassemble an executable and build on top of it. Not for the faint of heart and probably illegal, but possible unless it was deliberately obfuscated. Compared to that, it is impossible with state-of-the-art methods to systematically extract the training data from an LLM model. Fragments yes, but not all of it.

visarga · on March 18, 2024

You can do better - generate synthetic data covering all topics. And to make it less prone to hallucination, use RAG or web search for reference material. The Phi-1.5 model was trained on 300B of synthetic tokens generated with chatGPT and it showed a 5x bump in efficiency, punching well above its line.

Synthetic data can be more diverse if you sample carefully with seeded concepts, and it can be more complex than average web text. You can even diff against a garden variety Mistral or LLaMA and only collect knowledge and skills they don't already have. I call this approach "Machine Study", where AI makes its own training data by studying its corpus and learning from other models.

adrianN · on March 18, 2024

Or shell scripts

tarruda · on March 18, 2024

You can fine tune without the pre training data too.

Mistral models are one example, they never released pre training data and there are many fine tunes.

DreamGen · on March 19, 2024

We are in agreement -- that's exactly what I am saying :)

swalsh · on March 17, 2024

We should just call it open weight models at this point.

cl3misch · on March 18, 2024

FWIW the Grok repo uses the term "open weights".

mcv · on March 26, 2024

Exactly. Training data is key for any trained AI system. And I strongly suspect that every single company active in these modern AI systems is still struggling with how to tune their training data. It's easy to get something out of it, but it's hard to control what you get out of it.

cainxinth · on March 18, 2024

> We don't know since no one is releasing their data.

Is anyone else just assuming at this point that virtually everyone is using the pirated materials in The Pile like Books3?

JohnFen · on March 19, 2024

I think it's really, really clear that the majority of the data used to train all of these things was used without permission.

Zambyte · on March 19, 2024

The model is open weight, which is less useful than open source, but more useful than fully propriety (akin to the executable binaries you compare to)

boulos · on March 18, 2024

How about "weights available" as similar to the "source available" moniker?

fragmede · on March 18, 2024

weights available or model available, but yes.

drexlspivey · on March 17, 2024

Their data is the twitter corpus which is public. Or do you want a dump of their database for free too?

minimaxir · on March 17, 2024

Twitter tweet data in itself is both highly idiosyncratic and short by design, which alone is not conductive towards training a LLM.

llm_trw · on March 17, 2024

Saying "It's just the twitter public corpus." is like saying "Here's the Linux Kernel, makefiles not included."

zx8080 · on March 18, 2024

Or even "here's the Linux Kernel makefiles, no sources included, enjoy".

jahsome · on March 17, 2024

[flagged]

drexlspivey · on March 17, 2024

Requiring a company to publish their production database for free is delusional. I haven't mentioned musk anywhere in my comment, you must be obsessed with him.

jahsome · on March 18, 2024

It's fascinating you doubled down on your own straw man and still have the nerve to call others delusional.

You missed my point, which I wasn't very clear about, so my mistake. Although it doesn't seem like you're interested in understanding anyway.