<$BlogRSDUrl$> Marcus P. Zillman, M.S., A.M.H.A. Author/Speaker/Consultant
Marcus P. Zillman, M.S., A.M.H.A. Author/Speaker/Consultant
Internet Happenings, Events and Sources


Sunday, July 11, 2004  

Scraping the Web for Implied Data
http://searchenginewatch.com/searchday/article.php/3374821

Dr. Gary Flake, Principal Scientist & Head of Yahoo! Research Labs, thinks that there is more implied data (or inferable metadata) than "raw" data on the Web, and that we are barely scratching the surface of it. "Today, all search engines are scraping for some simple forms of implied data: language, locality, etc. What's missing from this list is a nearly infinite collection of relationships that are obvious to most any human reader but extremely difficult to infer from a single document." He gives the example of a very technical document about protein folding, which assumes that the reader would know the specification language and much else about the material being presented. An ordinary reader might sense the document "makes reference to physics in a non-trivial way," an expert would note even more implied facts ("the article may be out-dated by now," "the author is considered an authority in this domain," or "there's an expectation that diseases will be curable if these advances continue," etc.). Flake says: "In total, all of the implied data amounts to the stuff that all of us carry in our heads but no one bothers to write down; yet these factoids are essential to understanding and meaning. Some people in AI have been trying to codify these factoids for decades (and in many forms, from ontologies to databases of common sense). We are now starting to scrape the web for these subtle relationships. The key insight is that it is not enough to look at words, concepts, or documents; one must also look at how all of these things relate to one another. This article has been added to the articles section of Deep Web Research Subject Tracer™ Information Blog.
http://searchenginewatch.com/searchday/article.php/3374821

posted by Marcus Zillman | 4:05 AM
archives
subject tracers™