Tuesday, January 19, 2010

Helping computers understand language

This post is the latest in an ongoing series about how we harness the data we collect to improve our products and services for our users. - Ed.

An irony of computer science is that tasks humans struggle with can be performed easily by computer programs, but tasks humans can perform effortlessly remain difficult for computers. We can write a computer program to beat the very best human chess players, but we can't write a program to identify objects in a photo or understand a sentence with anywhere near the precision of even a child.

Enabling computers to understand language remains one of the hardest problems in artificial intelligence. The goal of a search engine is to return the best results for your search, and understanding language is crucial to returning the best results. A key part of this is our system for understanding synonyms.

What is a synonym? An obvious example is that "pictures" and "photos" mean the same thing in most circumstances. If you search for [pictures developed with coffee] to see how to develop photographs using coffee grinds as a developing agent, Google must understand that even if a page says "photos" and not "pictures," it's still relevant to the search. While even a small child can identify synonyms like pictures/photos, getting a computer program to understand synonyms is enormously difficult, and we're very proud of the system we've developed at Google.

Our synonyms system is the result of more than five years of research within our web search ranking team. We constantly monitor the quality of the system, but recently we made a special effort to analyze synonyms impact and quality. Most of the time, you probably don't notice when your search involves synonyms, because it happens behind the scenes. However, our measurements show that synonyms affect 70 percent of user searches across the more than 100 languages Google supports. We took a set of these queries and analyzed how precise the synonyms were, and were happy with the results: For every 50 queries where synonyms significantly improved the search results, we had only one truly bad synonym.

An example of a bad synonym from this analysis is in the search [dell system speaker driver precision 360], where Google thinks "pc" is a synonym for precision. Note that you can still see that on Google today, because while we know it's a bad synonym, we don't typically fix bad synonyms by hand. Instead, we try to discover general improvements to our algorithms to fix the problems. We hope it will be fixed automatically in some future changes.

We also recently made a change to how our synonyms are displayed. In our search result snippets, we bold the terms of your search. Historically, we have bolded synonyms such as stemming variants — like the word "picture" for a search with the word "pictures." Now, we've extended this to words that our algorithms very confidently think mean the same thing, even if they are spelled nothing like the original term. This helps you to understand why that result is shown, especially if it doesn't contain your original search term. In our [pictures developed with coffee] example, you can see that the first result has the word "photos" bolded in the title:


(Note that because our synonyms depend on the other words in your search and use many signals, you won't necessarily always see the word "photos" bolded for "pictures", only when our algorithms think it is useful and important to bold.)

We use many techniques to extract synonyms, that we've blogged about before. Our systems analyze petabytes of web documents and historical search data to build an intricate understanding of what words can mean in different contexts. In the above example "photos" was an obvious synonym for "pictures," but it's not always a good synonym. For example, it's important for us to recognize that in a search like [history of motion pictures], "motion pictures" means something special (movies), and "motion photos" doesn't make any sense. Another example is the term "GM." Most people know the most prominent meaning: "General Motors." For the search [gm cars], you can see that Google bolds the phrase "General Motors" in the search results. This is an indication that for that search we thought "General Motors" meant the same thing as "GM." Are there any other meanings? Many people can think of the second meaning, "genetically modified," which is bolded when GM is used in queries about crops and food, like in the search results for [gm wheat]. It turns out that there are more than 20 other possible meanings of the term "GM" that our synonyms system knows something about. GM can mean George Mason in [gm university], gamemaster in [gm screen star wars], Gangadhar Meher in [gm college], general manager in [nba gm] and even gunners mate in [navy gm].

Here are screenshots of those disambiguations of GM in action:


As a nomenclatural note, even obvious term variants like "pictures" (plural) and "picture" (singular) would be treated as different search terms by a dumb computer, so we also include these types of relationships within our umbrella of synonyms. Pictures/picture are typically called stemming variants, which refers to the fact that they share the same word stem, or root. The same systems that need to understand that "pictures" and "photos" mean the same thing also need to understand that "pictures" and "picture" mean the same thing. This is something that is even more obvious to a human but is also still a difficult task for a computer. An example of how this is difficult are the words "animal" and "animation," which share the same stem and etymology, but don't mean the same thing in standard use. Another tricky case that is very dependent on the other words in the query is "arm" vs. "arms." Arms might seem like the plural of arm, but consider how it might be used in a search: [arm reduction] vs. [arms reduction]. Google search is smart enough to know that the former is about removing fat from one's arm, and the latter is about reducing stockpiles of weaponry, and that arm/arms are dangerous synonyms in that case because they would change the meaning. These subtle differences between words that seem related is what makes synonymy very hard to get right.

Here are some other examples of synonyms we thought were interesting:

[song words], "lyrics" is bolded for "words".
[what state has the highest murder rate], "homicide" is bolded for "murder".
[himalayan kitten breeder], Google knows that "cat breeder" is the same as "kitten breeder".
[dura ace track bb axle njs], Google knows that "bb" here means "bottom bracket".
[software update on bb color id], "blackberry is bolded for "bb".
[bb cream dark], Google knows here that bb means "blemish balm".
[southeastern usa bb fitness & figure], "bodybuilding" is bolded for "bb."

Lastly, language is used with as much variety and subtlety as is present in human culture, and our algorithms still make mistakes. We flinch when we find such mistakes; we're always working to fix them. One of the best ways for us to discover these problems is to get feedback from real users, which we then use to inspire improvements to our computer programs. If you have specific complaints about our synonyms system, you can post a question at the web search help center forum or you can tweet them with the hash tag #googlesyns. You can also turn off a synonym for a specific term by adding a "+" before it or by putting the words in quotation marks.

No comments:

Post a Comment