labs.insert-title.com

Competetive Data Scraping

Brute Research

In my work I spend a lot of time gathering data from non API sources, which has the upside of being very rewarding when you pull it off successfully, but has the downside of forcing you into potentially uncharted and ethically-challenging territory.

Size up the target

What's the take?

How much data am I going to be scraping? If I am about to scrape a measly thousand pages this doesn't take a lot of thought, just get in there and get it, but if I am confronted with hundreds of thousands of pages to scrape or even millions the strategy becomes very different. You ever try storing a few hundred thousand files in a single directory? Don't.

What's the security like?

Where are the guards, how might I be noticed? Some websites employ mechanisms to detect the likes of me coming, and will block IP addresses automatically for a certain amount of time, showing up on this radar tends to lead towards blacklisting. Always have a healthy list of proxy servers or a map to all the local wireless spots in your area... your choice.

How much time should I take?

Can I get away with it all day, or should I wait until after dark when the customers are gone? A client has expectations and you really have to be able to sit down and determine exactly what it will take to get the amount of information needed. When you are looking at scraping hundreds of thousands of website pages you'll find that shaving off a half a second per 10 pages, or some other minute improvement, will add up very quickly. I usually break such project into phases like

  1. Amount of time to get the data
  2. Amount of time to parse out the data
  3. Amount of time to structure the data in a user-friendly format.

Why?

Why would anyone want to do this? Aside from any malicious use of screen scraping like RSS scraping for content generation or just plain plagiarism and the like, for a legitimate business owner, there just so happens to be two very good reasons.

First, some data is just not publicly available in a useable format, in this day and age it would seem that screen scraping is becoming less and less of a practiced art, but there are still some darkened corners of the Internet that just beg for someone to come along and use that secluded data in a more productive manner.

Second, in many instances cost is a big factor. There are certain organizations that will sell you their data or allow access to it on a subscription basis when that same data is simply sitting on a public website somewhere waiting to be plucked. In addition to the second reason, there are often caveats to the data provided by the source in which a tiered level of information distribution is employed, meaning you get the data and find out that in order to get what you really wanted you have to pay more.

What to expect

Expect to need more hardware, more custom software and more coffee. Possibly some sort of counseling from time to time. Data scraping for profit can be painfully complex. The wear on you wont compare to the wear on your hardware though.

Comments temporarily disabled for now, you can find me @joemaddalone