This seems intentionally obtuse. What you say is true, but it is very obvious that this is much more of a pain than if you had the source code. On the other hand, fine tuning is just as easy, regardless of whether you have the original training data.
I was going to ask for a reference to support this (although please provide one if handy), but the search term “catastrophic forgetting” is a great entry into the literature. Thanks.
One could also disassemble an executable and build on top of it. Not for the faint of heart and probably illegal, but possible unless it was deliberately obfuscated. Compared to that, it is impossible with state-of-the-art methods to systematically extract the training data from an LLM model. Fragments yes, but not all of it.
You can do better - generate synthetic data covering all topics. And to make it less prone to hallucination, use RAG or web search for reference material. The Phi-1.5 model was trained on 300B of synthetic tokens generated with chatGPT and it showed a 5x bump in efficiency, punching well above its line.
Synthetic data can be more diverse if you sample carefully with seeded concepts, and it can be more complex than average web text. You can even diff against a garden variety Mistral or LLaMA and only collect knowledge and skills they don't already have. I call this approach "Machine Study", where AI makes its own training data by studying its corpus and learning from other models.
Exactly. Training data is key for any trained AI system. And I strongly suspect that every single company active in these modern AI systems is still struggling with how to tune their training data. It's easy to get something out of it, but it's hard to control what you get out of it.
Requiring a company to publish their production database for free is delusional. I haven't mentioned musk anywhere in my comment, you must be obsessed with him.
Calling these models open source is like calling a binary open source because you can download it.
Which in this day and age isn't far from where were at.