If it's dynamically updating based on a database of information that's not shipp...

tshaddox · on Oct 22, 2021

> If it's dynamically updating based on a database of information that's not shipped to the app in it's entirety, you either have to hope you've somehow seen and preserved all the date from exploring the app, or accept that some data may be lost.

Well, yeah, that's true of all normal websites too. That's precisely what web crawlers are for. If there's no index page that links to all pages, or some way of iterating through all the pages, you wouldn't be able to exhaustively archive any web site.

kbenson · on Oct 22, 2021

> Well, yeah, that's true of all normal websites too.

Not exactly. While you may miss data that isn't requested specifically, you can crawl the site and get most/all that is accessible through links at least. Stuff only available through search results won't show, but if it's discoverable through browsing, you can get it.

The same can't necessarily be said for custom interfaces that are JS heavy, possibly with non-link click actions, custom sliders, a graphical representation of a map that expects a click on a region, etc. An old style page that lists all the regions (like states, or counties in a state), or even that has a dropdown in a form? Those are much easier to crawl and archive.

brundolf · on Oct 22, 2021

Sure, that's fair if they don't have a single call that fetches the whole dataset. Though I'd think an article would often be covering a specific, bounded dataset to make its point, and wouldn't need to query a table of indeterminate length

kbenson · on Oct 22, 2021

We'd hope. Sometimes weird choices are made, or even not-so-weird choices (like if some site in some other country lifts the whole thing and presents it as their own) that cause sites to choose to be a bit harder to scrape than you would assume.