So John wasn’t satisfied. He’d tried to put the brakes on the search engine’s launch to avert a disaster and had failed.
Measuring the Unmeasurable
Of course, John wasn’t going to give up. Otherwise, this story would be a very boring way to kick off a book! Besides, a large IT investment—and people’s jobs—were at stake.
When John first started working on the project, his goal was to introduce user-centered thinking to the search engine selection process to complement the technical tests that IT would be using. To do so in an environment that was both technical and, as a corporation, driven by the bottom line, he had to wade into some treacherous waters—he’d have to come up with some metrics to quantify the experience of using the current search engine.
Now you might wonder what the big deal was. Either the search engine found the damned thing, or it didn’t—should be pretty easy to measure, right? Well, not quite.... There certainly are searches that work that way, for example, looking up a colleague’s phone number in the Vanguard staff directory. But many—probably most—searches don’t have a single “right” answer. “Parking,” “benefits,” and “experts” are all common queries on the Vanguard Intranet. They are also questions that have many answers—some more right than the others, but none that are ideal or perfect. From the perspective of users, relevance is very often relative.
Most designers know that it’s difficult to measure search performance and, well, just about any aspect of the user experience. In fact, being asked to do so causes droplets of sweat to form on many a designer’s brow. It just doesn’t feel right. Experience is difficult to boil down to a few simple, measurable actions. Considering that most of those in the field don’t have advanced degrees in statistics—and probably experienced similarly sweaty moments during high school algebra—it’s not surprising.
Yet, here was John Ferrara, with a bachelor’s degree in communications, sallying forth to measure the user experience of Vanguard’s search system.
The Before-and-After Test
John focused on analyzing a few really common search queries to see how well they were performing—queries that represented needs that huge numbers of Vanguard’s intranet searchers wanted addressed. If you’re familiar with the “long tail,”[1] these would be considered the “short head.” (If you’re not, don’t worry—you’ll learn the basics in Chapter 2.) John wanted to compare how well these queries performed before and after—with the original search system and now with the new one.
Next, John needed some metrics for these common queries so he could compare them. He knew that there wasn’t a single metric that would be perfect, so he hedged his bets and came up with two sets of metrics respectively: relevancy and precision.[2] Relevancy measured how well the search engine returned a query’s best match at the top of all results. Precision measured how relevant the top results were. (To be fair, John didn’t invent precision; he borrowed it from the information retrieval researchers, who have been using it for years.) Let’s take a closer look at these two sets of metrics and how John used them.
So What’s Relevant?
John went through his list of common search queries. To test how relevant each would be, he had to make an informed judgment (also known as a guess) at what a reasonable searcher would want to find for each query. Reasonable, as in the results don’t seem like they were selected by a crazy person.
We’ve already seen one good example of such a situation: finding a colleague’s phone number in the staff directory. There’s a clear, obvious, and correct answer to this question. But in many cases where the answer wasn’t so obvious, John got out his red pen and deleted those queries from his relevancy test. He was now working with a cleaned-up set of queries that he was confident had “right answers”—ones like “company address.”
John determined the best matches for each remaining query. He then tested each query by recording where the best match ranked among the search results. Then he measured performance a few different ways. Was it the first result? If not, did it make the top five “critical” results? Each of these measurements had something to say about how well queries were performing. They helped in two ways: they revealed outliers that were problematic, and they helped track overall search system performance over time. Figure 1-1 shows the former: queries, such as “job descriptions” that have high numbers stand out problematically from their peers and deserve some attention.
http://www.flickr.com/photos/rosenfeldmedia/5690980802/
Figure 1-1. In a relevancy test, queries ideally find most reasonable results at position #1 on the search results page. A large distance from the top position suggests a poorly performing query.
John’s relevancy test turned out to be very helpful. As Figure 1-1 shows, we can see which queries weren’t retrieving their ideal result at or near the top of the search engine results page.
Yet there are two major limitations with relevancy testing: First, it leaves out many queries that don’t have a “right answer”—queries that might be common and important. Second, this method relies on guessing what would be “right” for searchers, so it is a highly subjective measure. But a simple test like this one is a good starting point. It is consistent, and though it involves some subjective evaluation, it does so within a consistent framework. In this case, it allowed John to generate some simple test results from a representative sample. If the search engine failed this test—as Vanguard’s did—then you have some serious problems (which they did).
Precision: Getting Beyond Relevance
That’s why John decided to also introduce another set of metrics: precision. Precision measures the number of relevant search results divided by the total number of search results. It tells you how many of the search engine’s results are good ones. John specifically looked at the precision of the top five results—the critical ones that a searcher would likely scan before giving up.
To test precision, John developed a scale for rating each result that a tested query retrieved, based on the information the searcher provided.
Relevant (r): The result’s ranking is completely relevant.
Near (n): The result is not a perfect match, but it’s clearly reasonable for it to be ranked highly.
Misplaced (m): It’s reasonable for the search engine to have retrieved the result, but it shouldn’t be ranked highly.
Irrelevant (i): The result has no apparent relationship to the query.