I've got a very real world use case I use DistilBERT for - learning how to label...

minimaxir · 2025-08-14T23:39:02 1755214742

ModernBERT may be a better base model if finetuning a model for a specific use case: https://huggingface.co/blog/modernbert

diwank · 2025-08-15T15:10:02 1755270602

also ettin is a new favorite and a solid alternative: https://huggingface.co/jhu-clsp/ettin-encoder-1b

I'd encourage you to give setfit a try, along with aggressively deduplicating your training set, finding top ~2500 clusters per label, and using setfit to train multilabel classifier on that.

Either way- would love to know what worked for you! :)

ramoz · 2025-08-15T03:38:13 1755229093

Please provide updates when you have them.

weird-eye-issue · 2025-08-15T04:35:26 1755232526

It's going to perform badly unless you have very few tags and it's easy to classify them

AJRF · 2025-08-15T07:04:29 1755241469

You can solve this by training a model per taxonomy, then wrap the individual models into a wrapper model to output joint probabilities. The largest amount of labels I have in a taxonomy is 8.