Doesn't die -- HTML being a specification in name only, there are a lot of really crazy web pages out there that render on browsers but are pathological edge cases.
Does a good job of distinguishing 'good' links from 'bad' links on a page. -- Lots of pages have links that should not be followed, some are easy they are rendered in the same color as the background (SEO black hat link juice) and others refer to crawler traps.
Crawler traps come in many forms -- Rich Skrenta created a great example one where the page generated a random number and said "%d is an interesting number" here are two more interesting numbers "%d and %d" the each link went to a new URL that ended in the number. So if you tried to crawl that site exhaustively you would fill your entire crawler cache with random number pages.
Dynamic importance scaling -- you want to crawl the 'best' pages for a topic so you need to figure out a way to measure which pages are important and which aren't. This was the secret sauce of the PageRank patent Google had but it's been gamed to death by SEO types. So now you need better heuristics to understand which are the more important links to follow.
Effective crawl frontier management - for every billion pages you decide to crawl there are probably 20 to 50 billion pages you "know about". These URIs that are known but not yet crawled are referred to as the 'crawl frontier'. Picking where to go looking in the crawl frontier to find useful new pages is half art and half good machine learning.
Good algorithmic de-packing -- many many pages today are generated algorithmicly from a set of rules, whether it is the product pages on Amazon or posts in a PHP forum, if you can recognize the algorithm early, you can effectively avoid crawling pages that are duplicates or not useful.
Good page de-duping -- There is a lot of repetition on the web. Whether it is the 'how to sign up' page of every PHPBBB site ever or the same product with 10 different keywords in the URI.
Selective JS interpretation -- sometimes the page exists in the JS code, not in the HTML code, so unless you want to store 'this page needs Javascript enabled to run' into your crawler cache you need to recognize this situation and get the page out of the Javascript.
When you say google and microsoft have this advantage in creating data sets, is it just the massive size of the web indices they are able to compile or do they use their crawlers in specific ways for compiling structured data that would be more useful for certain ML projects than a general web index?
Are there any tweaks you'd make to a crawler if you sent it out with the purpose of creating a dataset for a specific AI / ML project, rather than a general purpose web index?
> ... do they use their crawlers in specific ways for
> compiling structured data that would be more
> useful for certain ML projects than a general
> web index?
There are many uses for a large index. For example, they decode into structured data for many of the 'one box' results, a small box that shows up on the search results which has the answer to your query, even though that answer came from a web page. This is good for the consumer, they get their answer right away without clicking through to a web page, and its good for Google as it keeps the customer on the search results page with its advertising rather than having go to some page on the web potentially with someone else's advertising on it.
Google also post processed crawl data to indicate the spread of flu in their experiment of extracting health data from query logs.
> Are there any tweaks you'd make to a crawler if you
> sent it out with the purpose of creating a dataset
> for a specific AI / ML project, rather than a
> general purpose web index?
Yes there are many. Some of them made it into the Watson crawler. One of Blekko's claims to fame was their notion of 'slashtags' which were curated lists of known 'good' pages on a topic. Using such pre-validated URI lists can help you improve the fidelity of the datasets you collect. There are also clever ways to use existing data to validate the new data you are looking at. I'm on a couple of patent applications around that space which, if they ever issue, will make things a bit more obvious than they are today :-).
Doesn't die -- HTML being a specification in name only, there are a lot of really crazy web pages out there that render on browsers but are pathological edge cases.
Does a good job of distinguishing 'good' links from 'bad' links on a page. -- Lots of pages have links that should not be followed, some are easy they are rendered in the same color as the background (SEO black hat link juice) and others refer to crawler traps.
Crawler traps come in many forms -- Rich Skrenta created a great example one where the page generated a random number and said "%d is an interesting number" here are two more interesting numbers "%d and %d" the each link went to a new URL that ended in the number. So if you tried to crawl that site exhaustively you would fill your entire crawler cache with random number pages.
Dynamic importance scaling -- you want to crawl the 'best' pages for a topic so you need to figure out a way to measure which pages are important and which aren't. This was the secret sauce of the PageRank patent Google had but it's been gamed to death by SEO types. So now you need better heuristics to understand which are the more important links to follow.
Effective crawl frontier management - for every billion pages you decide to crawl there are probably 20 to 50 billion pages you "know about". These URIs that are known but not yet crawled are referred to as the 'crawl frontier'. Picking where to go looking in the crawl frontier to find useful new pages is half art and half good machine learning.
Good algorithmic de-packing -- many many pages today are generated algorithmicly from a set of rules, whether it is the product pages on Amazon or posts in a PHP forum, if you can recognize the algorithm early, you can effectively avoid crawling pages that are duplicates or not useful.
Good page de-duping -- There is a lot of repetition on the web. Whether it is the 'how to sign up' page of every PHPBBB site ever or the same product with 10 different keywords in the URI.
Selective JS interpretation -- sometimes the page exists in the JS code, not in the HTML code, so unless you want to store 'this page needs Javascript enabled to run' into your crawler cache you need to recognize this situation and get the page out of the Javascript.
That's just off the top of my head.