Читать книгу Search Analytics for Your Site - Louis Rosenfeld - Страница 31
Your Secret Weapon
ОглавлениеThank your lucky stars: SSA remains safely under the radar. No one owns it, and the people in most organizations who are closest to it—the IT folks who manage the search engine—aren’t likely to worry much about things like user intent. So if you can crack open the data, you (and your organization) will own the keys to a very powerful secret weapon. Read ahead.
Anatomy of a Search Log Entry
Avi Rappoport, Search Tools Consulting— http://searchtools.com/
Though most of us are now using analytics applications that provide some SSA reporting functionality, you may be in a situation where you’ll have to create your own reports—either because the analytics application doesn’t support your specific needs—or because you don’t have access to an analytics application. In both cases, you’ll need to process the data yourself.
Working with search engine transaction logs, you’ll find the search query, any search parameters (such as language or date), and the number of matches retrieved by the search engine. Most also contain the date and time, and some kind of searcher identifier. Understanding the format makes it easier to understand search analytics reports, recognize what they can and can’t tell you, and perform special processing for unusual questions.
Many search engines conform to the NCSA extended Web server log format,[7] so that’s what we’ll cover here. These text files have a standard field order, with spaces between them. To indicate a field with internal spaces, it needs double quotes or square brackets at the start and end.
However, there’s no place in the NCSA extended format for the hit count (the number of items matched in the search), so search engines tend to slide it in the middle or hang it off the end. If your search log format is not documented, you may need to do some sleuthing: you can figure this out by entering several unique searches that you know will generate no matches, and then look in the search log for those terms.
BASIC FIELDS
A simple query entry in this log format looks like this:
XX.XX.XX.14 - - [10/Jul/2010:10:24:13 -0800] "GET /search?q=noise HTTP/1.1" 200 9429 111
We can break that down into fields for better analysis, as shown in Table 2-2.
Table 2-2. http://www.flickr.com/photos/rosenfeldmedia/5826101122/
Fields By Position | ||||||||
---|---|---|---|---|---|---|---|---|
#1 | #2 | #3 | #4 | #5 | #6 | #7 | #8 | |
meaning | ip | - | - | date/timestamp | search request | response code | bytes | hits |
example | xx.xx.xx.14 | - | - | [10/Jul/2010:10:24:13-0800] | “GET/search?q= noise HTTP/I.I” | 200 | 9429 | III |
Table 2-3 provides even more detail on each field.
Table 2-3. http://www.flickr.com/photos/rosenfeldmedia/5826101190/
Details About Fields | |||
---|---|---|---|
Position | Field | Example | Meaning |
#1 | IP or host name | XX.XX.XX.14 | ID of the computer sending the search. |
#2 | auth. user | - | usually empty, RFC931 authentication |
#3 | user name | - | usually empty |
#4a | date | [10/Jul/2010 | date of the query in standard form |
#4b | time | :10:24:i3 | time of the query in standard form |
#4C | offset | -0800] | offset time from GMT[a] |
#5a | request | “GET | HTTP results (form action) |
#5b | URL | /search.html | search results page URL |
#5c | parameters | ?query=noise | search terms and other options |
#5d | version | HTTP/1.1” | version (always the same) |
#6 | response code | 200 | server response code (if it’s not 200, you are in trouble) |
#7 | bytes | 9249 | bytes returned (the size of the search results HTML page) |
#8 (non-standard but widely used) | hit count | III | number of matches found[b] |
[a] The GMT offset is important because you must have accurate timestamps to look for patterns of usage, such as spikes of traffic at lunchtime. Tracking the time relative to GMT lets analytics systems merge search logs from multiple time zones, which is especially important when adjusting for Daylight Savings Time. [b] Some search engines return the approximate number of hits, rather than provide a definitive number. This is usually because they are reserving the option to check whether the user has security access to additional documents. If you don’t have confidential documents, you may be able to disable the access check and get a real number. |
WHAT EXTENDED LOG ENTRIES LOOK LIKE
Optional fields can be quite helpful as well. These include the “referer” field (it should be “referrer,” but the spec spelled it wrong, so now we’re stuck with this misspelling), which can offer insights into site navigation problems; the user-agent for recognizing various platforms using the search; and an optional cookie, which is better than IP address for tracking searchers. To conform to other Web log formats, these fields might come before the hit count and time taken fields.
An extended log entry could look like this (detailed below in Table 2-4):
XX.XX.XX.14 - - [10/Jul/2010:10:24:13 -0800] "GET /search?q=noise HTTP/1.1" 200 9429 111 0028 "http://search.example.com/ search?q=sound HTTP/1.1" "Mozilla/5.0 (iPhone; U; CPU iPhone OS 2_2 like Mac OS X; en-us) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5G77 Safari/525.20" "USERID=CustomerACooke;IMPID=01234"
Table 2-4. http://www.flickr.com/photos/rosenfeldmedia/5826101254/
Extended Fields | |||
---|---|---|---|
Position | Field | Example | Meaning |
#9 | referer URL | http://search.example.com/search?q=sound | The page that the user was on when he searched: in this case, from a search results page for the query “sound”. |
#10 | user-agent | “Mozilla/5.0 (iPhone; U; CPU iPhone OS 2_2 like Mac OS... | The browser or app that sent the query. These are most useful for getting client metrics (especially mobile) and recognizing robot crawlers. |
#11 | cookie | “USERID=CustomerA; IMPID=01234” | Cookie for server session (rare). |
SEARCH PARAMETERS
Most search engines stick to the common format for additional options and settings (such as language or in the search part of the request). They start after the results page URL with a question mark and then put in a code followed by an equal sign followed by a value, delimited by an ampersand (or comma or semicolon), like this:
search.html?qq=noise&zone=all
There’s no standard, so the query parameter might be q, qq, qt, qry, query, w, words, s, st, search,
or something else entirely. This, and all the other codes, should be documented by the search vendor or open-source group. (We’ve provided an example below, as well as details in Table 2-5.) You’ll find this information useful if you need to “teach” your analytics application what to look for to identify—and parse out—actual queries from your logs. Here is an example of a query parameter:
search?q=noise&l=f1&s=21&p=20v=housewares&i=1
Table 2-5. http://www.flickr.com/photos/rosenfeldmedia/5826101316/
Query Parameters | |||
---|---|---|---|
Code | Field | Example | Meaning |
q | query | q=noise | The search terms, in this case “noise” |
1 | language | l=fi | The searcher’s language, here it’s Finnish |
s | stan | S=2I | Start the display at result number 21 |
p | per page | P=20 | Show 20 results per page |
v | section | v=housewares | Limit the query to the housewares section |
i | simple | i=I | Show the simple search interface |
The contents of the log file enable site search analytics: the entries provide the evidence needed to deduce how your users are searching and how well the site search is helping them. Cherish the logs or at least keep an archive: you may need to go back someday.
[7] The NCSA combined/extended log format is documented at http://publib.boulder.ibm.com/tividd/td/ITWSA/ITWSA_info45/en_US/HTML/guide/c-logs.html#combined and http://httpd.apache.org/docs/2.2/mod/mod_log_config.html#examples