Читать книгу Search Analytics for Your Site - Louis Rosenfeld - Страница 21
ОглавлениеThe Before-and-After Test
John focused on analyzing a few really common search queries to see how well they were performing—queries that represented needs that huge numbers of Vanguard’s intranet searchers wanted addressed. If you’re familiar with the “long tail,”[1] these would be considered the “short head.” (If you’re not, don’t worry—you’ll learn the basics in Chapter 2.) John wanted to compare how well these queries performed before and after—with the original search system and now with the new one.
Next, John needed some metrics for these common queries so he could compare them. He knew that there wasn’t a single metric that would be perfect, so he hedged his bets and came up with two sets of metrics respectively: relevancy and precision.[2] Relevancy measured how well the search engine returned a query’s best match at the top of all results. Precision measured how relevant the top results were. (To be fair, John didn’t invent precision; he borrowed it from the information retrieval researchers, who have been using it for years.) Let’s take a closer look at these two sets of metrics and how John used them.
So What’s Relevant?
John went through his list of common search queries. To test how relevant each would be, he had to make an informed judgment (also known as a guess) at what a reasonable searcher would want to find for each query. Reasonable, as in the results don’t seem like they were selected by a crazy person.
We’ve already seen one good example of such a situation: finding a colleague’s phone number in the staff directory. There’s a clear, obvious, and correct answer to this question. But in many cases where the answer wasn’t so obvious, John got out his red pen and deleted those queries from his relevancy test. He was now working with a cleaned-up set of queries that he was confident had “right answers”—ones like “company address.”
John determined the best matches for each remaining query. He then tested each query by recording where the best match ranked among the search results. Then he measured performance a few different ways. Was it the first result? If not, did it make the top five “critical” results? Each of these measurements had something to say about how well queries were performing. They helped in two ways: they revealed outliers that were problematic, and they helped track overall search system performance over time. Figure 1-1 shows the former: queries, such as “job descriptions” that have high numbers stand out problematically from their peers and deserve some attention.
http://www.flickr.com/photos/rosenfeldmedia/5690980802/
Figure 1-1. In a relevancy test, queries ideally find most reasonable results at position #1 on the search results page. A large distance from the top position suggests a poorly performing query.
John’s relevancy test turned out to be very helpful. As Figure 1-1 shows, we can see which queries weren’t retrieving their ideal result at or near the top of the search engine results page.
Yet there are two major limitations with relevancy testing: First, it leaves out many queries that don’t have a “right answer”—queries that might be common and important. Second, this method relies on guessing what would be “right” for searchers, so it is a highly subjective measure. But a simple test like this one is a good starting point. It is consistent, and though it involves some subjective evaluation, it does so within a consistent framework. In this case, it allowed John to generate some simple test results from a representative sample. If the search engine failed this test—as Vanguard’s did—then you have some serious problems (which they did).
Precision: Getting Beyond Relevance
That’s why John decided to also introduce another set of metrics: precision. Precision measures the number of relevant search results divided by the total number of search results. It tells you how many of the search engine’s results are good ones. John specifically looked at the precision of the top five results—the critical ones that a searcher would likely scan before giving up.
To test precision, John developed a scale for rating each result that a tested query retrieved, based on the information the searcher provided.
Relevant (r): The result’s ranking is completely relevant.
Near (n): The result is not a perfect match, but it’s clearly reasonable for it to be ranked highly.
Misplaced (m): It’s reasonable for the search engine to have retrieved the result, but it shouldn’t be ranked highly.
Irrelevant (i): The result has no apparent relationship to the query.
Rather than guessing at what the searcher’s intent was, John was simply looking to assess how reasonable it was for the search engine to return each result, and whether or not the search engine put it in the right place. He recorded an r, n, m, or i for each result in a spreadsheet, as shown in Figure 1-2.
http://www.flickr.com/photos/rosenfeldmedia/5690980818/
Figure 1-2. Each result for each query was rated as Relevant, Near, Misplaced, or Irrelevant.
John then used a few different ways to calculate precision for each query. He came up with three simple standards—strict, loose, and permissive—to reflect a range of tolerances for different levels of precision.
Strict: Only results ranked as relevant were acceptable (r).
Loose: Both relevant and near results were counted (r+n).
Permissive: Relevant, near, and misplaced results were counted (r+n+m).
You can see how each query scored differently for each of these three precision standards in Figure 1-3. For example, of the first five search results for the query “reserve room,” two were relevant (r), two were nearly relevant (n), and one was misplaced (m). In strict terms, precision was 40% (two of five results were relevant); in loose terms, 80% (four of five were relevant or nearly relevant); and all were relevant in permissive terms.
http://www.flickr.com/photos/rosenfeldmedia/5690405259/
Figure 1-3. Each query’s precision scores were then calculated in three different ways: Strict, Loose, and Permissive.
[1] Chris Anderson’s excellent book The Long Tail (Hyperion, 2006) described the long tail phenomenon and its impact on commerce sites like Amazon and Netflix.
[2] In web analytics, these are referred to as accuracy and precision.