Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Meh, it depends a lot on the dataset, which are heavily skewed towards the main languages. For example they almost always confuse Czech and Slovak and often swap one for the other in middle of chats


But the only way to unskew it is to remove main language data because there isn't really any to add, no?


You can also correctly bias your sampling so that when selecting new training instances each language is chosen equally. Generally the diversity of data is good, unless that data is "wrong" which, ironically, is probably most of the internet, but I digress.


Aren't they about as different as American English and British English?


The difference ia larger than let's say just a "dialect". They really are different languages, even though we generally understand each other quite well (younger generations less so). I've heard it's about as different as e. g. Danish and Swedish - not sure if that comparison is helpful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: