Interesting take on it. Some people probably wouldn't like to be called soft but there is likely some truth to it.
I feel it really comes down to priorities.
Scraping has always been a means to a end for most companies. Get data and then use it for something valuable. Before getting the data was easy, but now it is getting increasingly harder.
I think the key here is highlighting the fact that the time of cheap/easy/low skilled access to web data is ending. Companies either need to skill up on understanding how to bypass anti-bots or pay someone else to do it for them and they focus on the data.
I just worry we're collapsing two things into one bucket:
harder in absolute terms vs harder relative to how much real engineering effort teams are willing to invest.
Those aren't the same, and to me the distinction matters.
One of the main ideas, we explored here is how scraping has shifted from being mainly a technical challenge to an economic one:
- Infrastructure and proxies have gotten cheaper, but anti-bot defenses have evolved fast.
- Because of that, the real cost of scraping is now the cost per successful result, and spikes of 5x–20x can happen when defenses tighten.
- The bottleneck today isn’t just “can you scrape it?”, it’s whether you can do it profitably and efficiently.
I’d love to hear how folks here are dealing with rising scraping costs or what strategies have worked when data value doesn’t obviously outweigh defense costs.
Nice concept. I've definitely seen this play out in practice.
A lot of sites aren't impossible to scrape, but they're steadily getting more expensive. We're having to lean more on residential proxies, headless browsers etc just to get the same data that used to be straightforward...
We just dropped the State of Web Scraping 2025 report. TL;DR: scraping is scaling—fast.
- Market boom: Web scraping is growing 15% YoY and projected to hit $13B by 2033. Web data is now a real asset class. The gold rush is on.
- AI + scraping: LLMs are getting surprisingly good at generating spiders, debugging selectors, and auto-healing. Still brittle, but improving fast. 2025 might be the year of the “self-healing” scraper.
- Bot wars intensify: Anti-bot tech is getting aggressive, Cloudflare, decoy pages, forced JS rendering, login walls. Scraping popular sites is now high-effort and high-cost.
- Proxy market shake-up: Residential/mobile proxy prices have dropped 25–50% in the last year, thanks to scrappy newcomers. But domain-level pricing is rising, creating more complexity and less transparency.
- Legal landscape: Lines are getting clearer: public data is generally safe; behind logins is risky. AI crawlers are under increasing scrutiny, and enforcement is likely to tighten.
- Scraping stack evolution: New tools are focusing on stealth, AI assistance, and integration into real data pipelines. The modern scraping stack looks more like infrastructure than hacked-together scripts.
Big picture: 2025 is shaping up to be a turning point. Smarter scrapers, tougher competition, higher stakes.
Instead of focusing solely on the "market boom" and technological advancements, we should be having a serious conversation about the societal implications and the long-term sustainability of this scraping free-for-all.
The "higher stakes" mentioned aren't just about cost and effort; they're about the potential erosion of online privacy, the destabilization of websites reliant on ad revenue, and the ethical responsibility of data practitioners.
It would be interesting to explore ways of profit sharing to cover loss of ad revenue, better caching to reduce the cost on websites themselves, and simple ways to opt in or opt out of personal information being used online!
Absolutely agree, AI's rapid evolution is making high-quality data more valuable than ever. As the saying goes, "data is the new gold," and in the AI race, it's the sharpest competitive edge. While LLMs are getting better at automating scraping tasks, they're not perfect yet—so reliable, precise data collection is still critical. Teams building AI tools need solid scraping pipelines to stay ahead, and in 2025, that’s becoming less of a nice-to-have and more of a must-have.
Awesome! I agree, recently I’ve seen a huge spike in scraping demand lately, especially from teams building AI products. Everything from training data to competitive monitoring.
LLMs are great for speeding up spider development, but maintenance is still tough with dynamic content and aggressive bot protection.
Scraping in 2025 isn’t just about data collection, it’ll be about reliability, scale, and staying compliant. Feels like we’re entering the enterprise era of scraping.
The ethics of these free VPNs and hidden proxy SDKs are very questionable. But they are crazy profitable for the proxy providers running them so unlikely to go away.
It is this type of attitude that is why websites are becoming so aggressive in blocking web scrapers.
Being an "ethical web scraper" is about your own ethics, not abusing other peoples servers/data, and preserving the open internet for everyone.
Yes, if you slam someones website you will probably get the data you want but it will damage it for the collective. If everyone just slammed websites then:
1) the website owners will just get pissed off and either shut data behind logins or make it inaccessible. This is basically what happened with LinkedIn.
2) make it much harder and costlier to scrape by using more advanced anti-bots, requiring everyone to use more expensive residential proxies and headless browsers to get past them.
Web scraping can be a burden for websites, so everyone should approach it in as responsible and ethical way as possible.
I doubt many website owners feel they have an ethical responsibility to provide public APIs to their data so people wouldn't scrape their content. It is up to them what they want to do with it.
I just feel like the whole thing is a chicken and egg problem - both sides could be nicer to the other but no one wants to go first!
I do like the idea of identifying web scrapers in the user agent but I'm not sure how many websites would use it...
Some web scraping can be unethical, say for example if you are scraping a site solely to mirror their content and add zero value to the original content owner.
However, there are a lot of web scraping use cases which are beneficial to the site being scraped and actually add value. Two examples:
- Google: Ahrefs & SEMRush scrape Google so they can provide SEO analytics to companies looking to grow their companies. Googles keyword analytics aren't great, so Google has effectively outsourced providing a good analytics tool to Ahrefs & SEMRush who products increase the value of the Google SERPs ecosystem.
- Amazon + Other E-Commerce: Amazon wants brands and 3rd party stores to list products on their site, and the companies scraping Amazon to provide product placement tools to their users make it easier and more profitable to list products on Amazon. Leading to more and more companies listing products on Amazon.
> Some web scraping can be unethical, say for example if you are scraping a site solely to mirror their content and add zero value to the original content owner.
Good point, wouldn't say archiving is unethical at all...I was thinking more along the lines of someone scraping a entire segment of a websites data and reproducing it 1 for 1 on their own site with zero value add.
I think we can't make broad statements saying that web scraping is ethical or unethical, it isn't that black or white. It really depends on what is being scraped, how is it being used, and the intention of the scraper.
Interesting!...I'm not a lawyer, so the content for this piece was based on commentary in the below article. Was written by their lawyer, but would love to hear your counter point to it. Always good to get multiple viewpoints on something.
The Zyte article isn't inaccurate; it's just a simplified assessment of a complicated issue. If you'd like a more nuanced perspective on this, please read my guest post of Prof. Goldman's blog.
I feel it really comes down to priorities.
Scraping has always been a means to a end for most companies. Get data and then use it for something valuable. Before getting the data was easy, but now it is getting increasingly harder.
I think the key here is highlighting the fact that the time of cheap/easy/low skilled access to web data is ending. Companies either need to skill up on understanding how to bypass anti-bots or pay someone else to do it for them and they focus on the data.