Tuesday, January 02, 2007

Google Granted Patent for Estimating Similarity

This one has been in the works for a few years, originally filed on December 31, 2001. Google's Moses Samson Charikar is listed as the inventor of a new method for estimating similarity between web pages and documents - designed to filter duplicate content.

Abstract
A similarity engine generates compact representations of objects called sketches. Sketches of different objects can be compared to determine the similarity between the two objects. The sketch for an object may be generated by creating a vector corresponding to the object, where each coordinate of the vector is associated with a corresponding weight. The weight associated with each coordinate in the vector is multiplied by a predetermined hashing vector to generate a product vector, and the product vectors are summed. The similarity engine may then generate a compact representation of the object based on the summed product vector.
This is designed to reduce the amount of redundant, or nearly redundant documents crawled and returned in response to a user's search query. The method will also dump sites determined to be substantial duplicators of content.

No comments: