I didn't try original BERT at all because I didn't get good results from any LLMs on small document excerpts, so I assumed that a substantial context was necessary for good results. Traditional BERT only accepts up to 512 tokens, while ModernBERT goes up to 8192. I ended up using a 2048 token limit.
Would you happen to know of any resources for how to distill a ModernBERT model out of a larger one? I'm interested in doing exactly what you did, but I don't know how to start.
I was trying to identify "evergreen" and "time-sensitive" kinds of writing -- basically, I wanted to figure out if web pages captured in 2016 would still have content that's interesting to read today or if the passage of time would have rendered them irrelevant.
Here's the training code that I used to fine-tune ModernBERT from the ~5000 pages I had labeled with Llama 3.3. It should be a good starting point if you have your own fine-tuning task like this. If you can get away with a smaller context than I used here, it will be much faster and the batches can be larger (requires experimentation).