I've been working on an editor on top of ProseMirror to support saving web conte...

I've been working on an editor on top of ProseMirror to support saving web content in the form of rich text and predefined schemas. Given that you have academic research in this area of web OCR, what's the current literature or tools on saving web content using both html and visual cues from the rendered html? For example, both <figure><img><figcaption> and <div><img><p> visually look like captioned images, but are represented differently in html. Is there a way to parse that into a simple [figure, [img], [figcaption]] schema?