Sure; Doesn't die -- HTML being a specification in name only, there are a lot of...

canoebuilder · on Aug 16, 2017

Cool, thanks.

When you say google and microsoft have this advantage in creating data sets, is it just the massive size of the web indices they are able to compile or do they use their crawlers in specific ways for compiling structured data that would be more useful for certain ML projects than a general web index?

Are there any tweaks you'd make to a crawler if you sent it out with the purpose of creating a dataset for a specific AI / ML project, rather than a general purpose web index?

ChuckMcM · on Aug 16, 2017

   > ... do they use their crawlers in specific ways for
   > compiling structured data that would be more 
   > useful for certain ML projects than a general 
   > web index?

There are many uses for a large index. For example, they decode into structured data for many of the 'one box' results, a small box that shows up on the search results which has the answer to your query, even though that answer came from a web page. This is good for the consumer, they get their answer right away without clicking through to a web page, and its good for Google as it keeps the customer on the search results page with its advertising rather than having go to some page on the web potentially with someone else's advertising on it.

Google also post processed crawl data to indicate the spread of flu in their experiment of extracting health data from query logs.

   > Are there any tweaks you'd make to a crawler if you 
   > sent it out with the purpose of creating a dataset
   > for a specific AI / ML project, rather than a
   > general purpose web index?

Yes there are many. Some of them made it into the Watson crawler. One of Blekko's claims to fame was their notion of 'slashtags' which were curated lists of known 'good' pages on a topic. Using such pre-validated URI lists can help you improve the fidelity of the datasets you collect. There are also clever ways to use existing data to validate the new data you are looking at. I'm on a couple of patent applications around that space which, if they ever issue, will make things a bit more obvious than they are today :-).

kensoh · on Aug 16, 2017

Thanks Chuck for sharing, I enjoyed your sharing these details :)