As usual, rather than download a small program I am certain is available to no end, I decided to reinvent the wheel in order to manage one of my client's need.
The primary function is to split large CSV files and maintain the headers in order to create smaller files for uploading to a catalog website that limits file upload size.
Built on .NET 2.0, does not include framework, no installationIn my work I spend a lot of time gathering data from non API sources, which has the upside of being very rewarding when you pull it off successfully, but has the downside of forcing you into potentially uncharted and ethically-challenging territory.
What's the take?
How much data am I going to be scraping? If I am about to scrape a measly thousand pages this doesn't take a lot of thought, just get in there and get it, but if I am confronted with hundreds of thousands of pages to scrape or even millions the strategy becomes very different. You ever try storing a few hundred thousand files in a single directory? Don't.
What's the security like?
Where are the guards, how might I be noticed? Some websites employ mechanisms to detect the likes of me coming, and will block IP addresses automatically for a certain amount of time, showing up on this radar tends to lead towards blacklisting. Always have a healthy list of proxy servers or a map to all the local wireless spots in your area... your choice.
How much time should I take?
Can I get away with it all day, or should I wait until after dark when the customers are gone? A client has expectations and you really have to be able to sit down and determine exactly what it will take to get the amount of information needed. When you are looking at scraping hundreds of thousands of website pages you'll find that shaving off a half a second per 10 pages, or some other minute improvement, will add up very quickly. I usually break such project into phases like
Why would anyone want to do this? Aside from any malicious use of screen scraping like RSS scraping for content generation or just plain plagiarism and the like, for a legitimate business owner, there just so happens to be two very good reasons.
First, some data is just not publicly available in a useable format, in this day and age it would seem that screen scraping is becoming less and less of a practiced art, but there are still some darkened corners of the Internet that just beg for someone to come along and use that secluded data in a more productive manner.
Second, in many instances cost is a big factor. There are certain organizations that will sell you their data or allow access to it on a subscription basis when that same data is simply sitting on a public website somewhere waiting to be plucked. In addition to the second reason, there are often caveats to the data provided by the source in which a tiered level of information distribution is employed, meaning you get the data and find out that in order to get what you really wanted you have to pay more.
Expect to need more hardware, more custom software and more coffee. Possibly some sort of counseling from time to time. Data scraping for profit can be painfully complex. The wear on you wont compare to the wear on your hardware though.
For shiggles, I wanted to add a nice little trendy date format to my postings, but everything I found seemed a bit too wordy for me. It always looks like this:
<div class="post-date"> <span class="month">10</span> <span class="day">04</span> <span class="year">1977</span> </div>
This site is dynamically generated, so it wouldn't take too much effort in order to implement such a structure, but I'm stubborn... so here goes
<%@ Import Namespace="System.Drawing" %>
<%@ Import Namespace="System.Drawing.Imaging" %>
<script language="VB" runat="server">
Sub Page_Load(sender as Object, e as EventArgs)
dim strDt as string = request.querystring("dt")
dim strMonth as string = left(MonthName(Month(strDt)),3).ToUpper()
dim strDay as string = Day(strDt).ToString()
if strDay.Length=1
strDay="0" & strDay
end if
dim strYear as string = Year(strDt).ToString()
Dim baseMap as Bitmap = new Bitmap(95, 13)
'13 cuts it off, which looks cool -- see emersian.com
Dim myGraphic as Graphics = Graphics.FromImage(baseMap)
Dim upBrush as SolidBrush = new SolidBrush(Color.black)
Dim downBrush as SolidBrush = new SolidBrush(Color.steelblue)
Dim MonthFont as Font = new Font("tahoma", 11,FontStyle.Bold)
Dim dtFont as Font = new Font("tahoma", 14,FontStyle.Bold)
myGraphic.FillRectangle(new SolidBrush(Color.white), 0, 0, 100, 25)
myGraphic.DrawString(strMonth, MonthFont, upBrush, 0, 0)
myGraphic.DrawString(strDay, MonthFont, downBrush, 30, 0)
myGraphic.DrawString(strYear, MonthFont, upBrush, 50, 0)
myGraphic.TextRenderingHint = System.Drawing.Text.TextRenderingHint.AntiAlias
Response.ContentType = "image/gif"
baseMap.Save(Response.OutputStream, ImageFormat.GIF)
myGraphic.Dispose()
baseMap.Dispose()
End Sub
</script>
%>
< img src="dt.aspx?dt=DATE STRING HERE" />
The image to load is designated in the Onload if the "holder" sprite
img.loadMovie("your file here");
Don't lose sight of what is actually important to your survival and what is not. So many of us get caught up in the clutter of various advertising scenarios and side projects that we can easily forget how we started and who our bread and butter customers really are. I recently reviewed the last few years of accounts and came to the realization that one of my most neglected avenues of income had added up to equal the payments of my largest client. Needless to say I have reinvested efforts into it and am beginning to see real results
Somewhere I once read that in five years you will be the people you associate with, the books you read and the music you listen to. This sounds a bit harsh, but I have to admit that I have ssen it firsthand and it's solid advice.
Surround yourself with the people you admire. Collaborate and invite critique. Seek out those who challenge and inspire you.
Getting dismayed is natural, it's really those who keep an eye on the prize that prevail. Fundamentally, you cant lose if you don't play, but can't win either. Keep doing what it is that you love, keep trying new things and your day will come.
Many, many of the most successful businesses around, especially all these web startups are founded on the idea of fixing an existing problem or adding a functionality that was needed. See a problem fix it. Just look at 37Signals, who knew that TaDaLists would turn into Basecamp, Hirise, etc.
Darren Rowse, who lives in "The House that Google Built" started only two years ago - sounds crazy doesn't it?
Know your exit. Know your exit. Know your exit.
I am not saying break the rules, but be certain to push them all the way to their limits. No one ever got rich by not pushing the boundaries of either customers or industry. This is absolutely true of todays online businesses. Look at the most successful eBayers... thousands and thousands of items listed, Bloggers with hundreds of sites, Google isn't just sitting back and raking it in they are constantly pushing the boundaries of what they can get away with. You should to.
A wrong is a wrong is a wrong, don't be afraid to admit when you are wrong. As long as you learn from your wrong decisions they are valuable decisions
Maybe a bit of both is really needed, but so many seem to fall into the pattern of playing it safe when what is needed is a little bit of courage
Despite what people may thing the Internet is still very much like the Wild Wild West
I am certain I left out a ton of great lines from gangster movies that could be added here, perhaps you know one?
Data visualization is such a help sometimes, I don't know why Google does not utilize the charts API in their Adsense reporting, but having been experimenting with the Charts API while developing this site a bit I decided to utilize it for some actually valuable research
In reviewing Jan 1 thru Apr 15 this year and last, I have come to the conclusion that server downtime is my biggest enemy.
It's not really my hosting companies fault though, I am just endlessly tweaking code and trying it out on the production server when I should be using a test environment first. Well that changes today, no more tweaking without testing. Hunkering down and reinvesting myself into all this has been great so far, I can't believe how motivated I am.
I hope you folks will bare with the breakneck pace of posts lately, just trying to keep up with my own mind
Now if I could only get past this -- Starting Sep 13, 2007 only websites with over 100,000 daily page views across user pages will be eligible to participate in the AdSense API program. I could really have something.
As I mentioned before I have been out of the blogging/SEO mind set for about a year
Just with some initial modifications such as better CSS, regular updates and getting involved in the communities I enjoy a significant increase in traffic has occurred on this site specifically. So here are my methodologies at this point
Manual submission seems to be a long dead concept, however I still think it is absolutely the way to go despite the whole Web 2.0 notion that between Twitter and RSS feeds the world will take notice. However I am also a fan of WebCEO for that quick fix initial submission to SERPs (and yes that is an affiliate link, but the Free version is still great!)
Regular updates means just that. I recall Darren Rowse used to do his blog-a-thons where he blog non-stop for 24 hours and I commend his efforts, but outside of that if you can only commit to posting once a week just make sure you do it each week. Modern SE SPiders keep track of the regularity of updates and come to check at those intervals determined by statistical data, keep them fed and happy. Your readers don't want to come back to the same page each day either, if they know you'll be updating at a regular interval they will return at those intervals.
Keeping track of how users get to your site can leverage a great amount of insight as to whether what you are writing works or not. Keep an eye on the search terms users entered to get to your site and follow suit
Read those comments. Answer the questions and be a part of your own community
Microsoft Virtual Earth is this easy to implement
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script type="text/javascript"
src="http://dev.virtualearth.net/mapcontrol/v5/mapcontrol.js">
</script>
<script type="text/javascript">
var map = null;
function GetMap()
{
map = new VEMap('myMap');
map.LoadMap();
}
</script>
</head>
<body onload="GetMap();">
<div id='myMap'
style="position:relative; width:500px; height:400px;"></div>
</body>
</html>