Tuesday, December 13, 2016

Bulk web scraping a website and then reselling the content can land you in legal hot water

This interesting article on web scraping just came to my attention:

New York Times: Auction Houses Face Off in Website Data Scraping Lawsuit

Summary: An auction house in New York is suing an auction house in Dallas for copyright law violations regarding scraping the New York auction house's website listings including their listing photos and then SELLING those listings and photos in an aggregate database for profit.

As I'm the author of one of the most powerful and flexible web scraping toolkits (the Ultimate Web Scraping Toolkit), I have to reiterate the messaging found on the main documentation page: Don't use the toolkit for illegal purposes! If you are going to bulk scrape someone's website, you need to make sure you are free and clear legally for doing so and that you respect reasonable rate limits and the like. Reselling the data acquired with a scraping toolkit seems like an extremely questionable thing to do from a legal perspective.

The problem with bulk web scraping is that it costs virtually nothing on the client end of things. Most pages with custom data these days are served dynamically to some extent. It takes not insignificant CPU time and RAM on the server side to build a response. If one bad actor is excessively scraping a website, it drags down the performance of the website for everyone else. If a website operator starts to rate limit requests, then they will run into additional technical issues (e.g. they'll end up blocking Googlebot and effectively delist themselves from Google search results).

The Ultimate Web Scraper Toolkit is capable of easily hiding requests in ways that mimic a real web browser or a bot like Googlebot. It follows redirects, passes in HTTP Referer headers, transparently and correctly handles HTTP cookies, and can even extract forms and form fields from a page. It even has an interactive mode that comes about as close to the real thing from the command-line that a person can get in a few thousand lines of code. I'm only aware of a few toolkits that can even come close to the capabilities of the Ultimate Web Scraper Toolkit, so I won't be surprised if it comes to light that what I've built has been used for illegal purposes despite the very clear warnings. With great power comes great responsibility and it appears that some people just aren't capable of handling that.

Personally, I limit my bulk scraping projects and have only gotten into trouble one time. I was web scraping a large amount of publicly available local government data for analysis and the remote host was running painfully slowly. I was putting a 1 second delay between each request, which seemed reasonable but apparently my little scraper project was causing serious CPU issues on their end and they contacted me about it. The correct fix on their end was probably to apply a database index but I wasn't going to argue for that as I had already retrieved enough data at that point anyway for what I needed to do and so I terminated the scraper to avoid upsetting them. Even though I was well within my legal rights to keep the scraper up and running, maintaining a healthy relationship with people who I might need to work with in the future is important to me.

Here's an interesting twist: Googlebot is the worst offender when it comes to web scraping. I've seen Googlebot scrape little rinky-dink websites at upwards of over 30,000 requests per hour! (And that's restricting looking at only requests in Googlebot's officially published IP address range in the server logs.) I'd hate to see Googlebot stats for larger websites. If you don't allow Googlebot to scrape your website, then you don't get indexed by Google. If you rate limit Googlebot, you get dropped off the first page in listing results on Google. Ask any website systems administrator and they will tell you the same thing about Googlebot being a flagrant abuser of Internet infrastructure. But no one is suing Google over the extreme abuse of our infrastructure because everyone wants/needs to be listed on Google so that various business operations function, but at least one auction house is happy to sue another auction house while, at the same time, they both are happy to let Googlebot run unhindered over the same infrastructure. I believe FCC Net Neutrality rules could be used as a defense play here to limit damages awarded - IMO, the Dallas auction house, if they did indeed do what was claimed, violated copyright law - reselling scraped photos is the worst part. Then again, so is Google and you simply can't play favorites under the FCC Net Neutrality rules. Those same listings are certainly web scraped and indexed by Google because they want people to find them, which means Google has similarly violated copyright law to be able to index the site. Google then resells those search results in the form of the AdWords platform and turns a profit at the expense of the New York auction house. No one's batting an eyelash at that legally questionable conflict of interest and the path of logic to this point is arguably the basis of a monopoly claim against Google/Alphabet. IMO, under FCC Net Neutrality rules, unless the New York auction house (Christie's) similarly sues Google/Alphabet, they shouldn't be allowed to claim full copyright damages from the Dallas auction house.

At any rate, all of this goes to show that we need to be careful as to what we scrape and we definitely don't sell/resell what we scrape. Stay cool (and legal) folks!

No comments:

Post a Comment