I've got a very real world use case I use DistilBERT for - learning how to label wordpress articles. It is one of those things where it's kind of valuable (tagging) but not enough to spend loads on compute for it.
The great thing is I have enough data (100k+) to fine-tune and run a meaningful classification report over. The data is very diverse, and while the labels aren't totally evenly distributed, I can deal with the imbalance with a few tricks.
Can't wait to swap it out for this and see the changes in the scores. Will report back
I'd encourage you to give setfit a try, along with aggressively deduplicating your training set, finding top ~2500 clusters per label, and using setfit to train multilabel classifier on that.
Either way- would love to know what worked for you! :)
You can solve this by training a model per taxonomy, then wrap the individual models into a wrapper model to output joint probabilities. The largest amount of labels I have in a taxonomy is 8.
The great thing is I have enough data (100k+) to fine-tune and run a meaningful classification report over. The data is very diverse, and while the labels aren't totally evenly distributed, I can deal with the imbalance with a few tricks.
Can't wait to swap it out for this and see the changes in the scores. Will report back