Cheap computing and the ability to store a lot of data at a low cost have made the concept of big data a big business. But amid the frenzy to gather that data, especially unstructured information scraped from or accessed via crawling web sites, companies might be pushing the boundaries of polite (or ethical) behavior. They may also be stealing valuable IP. So is it stoppable and could the current solutions lead to the demise of the open web?
The web is full of all kinds of data, some in data-friendly formats such as CSV files and others in indecipherable text or pricing formats that require companies to clean it before shoving it in a database to use it. Companies such as Infochimps or Microsoft’s Windows Azure Marketplace are trying to take some of the messier files and offer them to people. Other companies, such as Factual and Datafiniti, are building businesses based on scraping the content from sites and then creating customized databases for clients.
And scraping is complex. The act of indexing a web page and then pulling the data from it can be a beneficial action, such as when Google indexes your web site, but not everyone is a good scraper. When done without regard to the host site it can suck up a site’s bandwidth or even appropriate their intellectual property. Some argue that the behavior is problematic, while others argue that preventing it hurts consumers, society and maybe even the open web. So should one have a license to webcrawl?
To Continue Reading: Click Here
------------------------------------------------------
Source: gigaom.com
By: Stacy Higginbotham

No comments:
Post a Comment