Meh, it depends a lot on the dataset, which are heavily skewed towards the main ...

mirekrusin · 2025-10-28T15:32:34 1761665554

But the only way to unskew it is to remove main language data because there isn't really any to add, no?

tensor · 2025-10-28T16:28:17 1761668897

You can also correctly bias your sampling so that when selecting new training instances each language is chosen equally. Generally the diversity of data is good, unless that data is "wrong" which, ironically, is probably most of the internet, but I digress.

RobotToaster · 2025-10-28T16:29:16 1761668956

Aren't they about as different as American English and British English?

svobodovic · 2025-10-28T19:40:13 1761680413

The difference ia larger than let's say just a "dialect". They really are different languages, even though we generally understand each other quite well (younger generations less so). I've heard it's about as different as e. g. Danish and Swedish - not sure if that comparison is helpful.