Misc

MAY 30 2008

Random things this month
Author:

In the midst of running around closing "mad deals, yo" and laughing about the new l33tspeak terms like intarwebs and pwned I ran across a few things in the last month or so.

A tacky URL

Upsell?

I really think the person who decided on this had spent the day with a marketing person who filled their head with the term "Upsell."


Some great software

Just try out the brainstorm function, I've never been able to realize ideas so fast. They have a web sitemap software as well, but this is truly the "cat's meow" for those of us who need a substantial amount of planning for a client or ourselves.


The Greatest Browser Ever!

Data scrapers unite! Our prayers have been answered, the data gods have given us great manna from heaven and it is called Kirix Strata


Also

A really solid designer-illustrator guy someone needs to hire

0 Comments

APR 23 2008

Competetive Data Scraping
Author: Joe Maddalone

Brute Research

In my work I spend a lot of time gathering data from non API sources, which has the upside of being very rewarding when you pull it off successfully, but has the downside of forcing you into potentially uncharted and ethically-challenging territory.

Size up the target

What's the take?

How much data am I going to be scraping? If I am about to scrape a measly thousand pages this doesn't take a lot of thought, just get in there and get it, but if I am confronted with hundreds of thousands of pages to scrape or even millions the strategy becomes very different. You ever try storing a few hundred thousand files in a single directory? Don't.

What's the security like?

Where are the guards, how might I be noticed? Some websites employ mechanisms to detect the likes of me coming, and will block IP addresses automatically for a certain amount of time, showing up on this radar tends to lead towards blacklisting. Always have a healthy list of proxy servers or a map to all the local wireless spots in your area... your choice.

How much time should I take?

Can I get away with it all day, or should I wait until after dark when the customers are gone? A client has expectations and you really have to be able to sit down and determine exactly what it will take to get the amount of information needed. When you are looking at scraping hundreds of thousands of website pages you'll find that shaving off a half a second per 10 pages, or some other minute improvement, will add up very quickly. I usually break such project into phases like

  1. Amount of time to get the data
  2. Amount of time to parse out the data
  3. Amount of time to structure the data in a user-friendly format.

Why?

Why would anyone want to do this? Aside from any malicious use of screen scraping like RSS scraping for content generation or just plain plagiarism and the like, for a legitimate business owner, there just so happens to be two very good reasons.

First, some data is just not publicly available in a useable format, in this day and age it would seem that screen scraping is becoming less and less of a practiced art, but there are still some darkened corners of the Internet that just beg for someone to come along and use that secluded data in a more productive manner.

Second, in many instances cost is a big factor. There are certain organizations that will sell you their data or allow access to it on a subscription basis when that same data is simply sitting on a public website somewhere waiting to be plucked. In addition to the second reason, there are often caveats to the data provided by the source in which a tiered level of information distribution is employed, meaning you get the data and find out that in order to get what you really wanted you have to pay more.

What to expect

Expect to need more hardware, more custom software and more coffee. Possibly some sort of counseling from time to time. Data scraping for profit can be painfully complex. The wear on you wont compare to the wear on your hardware though.

0 Comments

JUL 29 2006

Google, baby, I'm sorry
Author: Joe Maddalone

I am a datascraper, my hands is permanently puckered... I am a datascraper.

The above line... it's an American Pop reference for my fellow film buffs out there - in the meantime Google... baby... really... I wont do it again. I was drunk and she meant nothing to me... and she said she was 18... and the devil made me do it.

The above message is what I see regardless of what i search for at this point on Google... I am curious how long I am banned from even searching now. To be clear.. I deserved this one.

3 Comments

OCT 28 2005

Google goes down, Google goes down
Author: Joe Maddalone

10/28/2005... about 4:10 PM (CST)

Saw this today, and it just tickled me.


For those of you who don't know it, this site is more or less blacklisted from Google.. So I always get a warm fuzzy from things like this.

5 Comments


LATEST POSTS

Portfolio of the Future Just a random bit of sketching I was doing. Random things this month On my way to the interwebs I recall all I have learned today. CSV Splitter Reinventing the wheel one app at a time. Competetive Data Scraping Not yet an Olympic sport. Date Image Thing GDI+ to overprocess blog post dates!

ADS

MOST POPULAR

Multiple IEs in Windows Firefox Vs. The World Who Is Xperya? ActionScript Form Fields Quick Watermark IE 7 beta 2 standalone Text Link Ads

IP Address Tool Chicago Web Design Free Text Messaging