Saturday, 20 August 2005
Anatomy of Blog Analytics: Data Collection
[For those of you just tuning in, you mihgt want to read this and this first. I'm going to be talking about this diagram. ]
Let's say you're a marketer and you want to know:
"What percentage of Gen X Females mention Johnny Depp?"
According to my wife, the answer is "all of them" but let's pretend for a moment that we didn't know and that we wanted to determine the answer using Blog Analytics. How would we get to an answer? How accurate would the answer be? What would the answer mean?
These are all important questions that should be asked when applying any research methodology. With traditional research techniques such as telephone survey or focus groups, the answers are pretty obvious to marketers and market researchers, but blog analytics, I suspect, looks a little more like voodoo than science to lots of these folks.
When we ask these questions, what we really want to know is if the findings of the research will be valid, or in other words, to what extent will the findings be affected by error. All market research methodologies have error: sampling error, error from question bias, order bias, respondent's lying, technical error (data keying error, etc.) and so forth. Because the mechanics of blog analytics are so new, I don't think they are deeply understood by marketers. I think one area that marketers and analysts need to be better informed is the sources of error possible in blog analytics. Not to discredit the methodology but to provide insight into it. Understanding the sources of error in Blog Analytics can help marketers and analysts better recognize appropriate opportunities for applying this important technology. It will also help them to scrutinize results in meaningful ways – much like they do today with survey results for example. A good survey researcher can not only spot suspect data but can often predict its cause based on a deep understanding of the methodology.
So let's jump in with sampling.
Blog Analytics Sampling
Everyone seems to be obsessed with estimating the size of the blogosphere and aspiring to collect and index each and every post. While this is an admirable goal, I'm not sure it's even possible (and suspect it's not). What this means is that no one is measuring all of the blogosphere. This is okay. Outside the bologsphere we don't measure everyone – we just measure some people and assume that these folks' opinions represent everyone else's, in other words we sample.
All Blog Analytics companies sample (even if they try to get all the post – they're bound to miss a bunch). Some firms prioritize the source and give preferential treatment based on some criteria: popularity as measured by links, blog host (index last/don't index Live Journal for example). Some firms only collect post from specific blogs such as known political bloggers for example. Some only collect post that are in a particular language. The point being, everyone samples – either by choice or by technical limitation.
In my recent evaluation of Umbria Communications, I noted that they claim to monitor millions of blogs (this is a common claim). But in their published works such as their coverage of the 2004 Presidential Election they provide very little information on how many or which blogs were included in the analysis. For example, they have a graph showing total posts about Bush vs. those about Kerry. On 11/7 they show approx 114K mentions for Bush and 72K mentions of Kerry. Then they show this information as a percentage of total mentions – approx 61% Bush, 39% Kerry. (Before you grab a calculator: 114 / (114 + 72) = .6129 so the math works, but keep reading).
To the general public this looks fine – but to a market researcher, there are lots of unanswered questions...
The percentages imply that the mentions were mutually exclusive – either a post was about Bush OR Kerry.
What if a post was about both Bush and Kerry?
Did this post get thrown out?
Was it counted twice – once for each candidate?
If a post can count twice then what do the percentages mean? (Think of marbles – 1 black, 1 white and 9 black & white swirl. Are 50% of the marbles white?)
Umbria does some interesting speaker analysis, namely supposedly using NLP to determine demographics and determining "new speakers" based on post frequency.
Does this analysis include posts coming from speakers younger than voting age? (Young folks would be excluded in survey research).
If a single speaker posted multiple times per day was that counted once or each time?
During the 10 week period of the study, Umbia identified anywhere from about 20K to 180K postings per week. How many speakers does this represent?
Umbria claims that they have NLP tools that help them determine context.
Are posters talking about the presidential candidates or the baked bean folks and the director?
So are we looking at people that are talking about the candidates or any mention?
Does a mention of Bush or Kerry count if it is not in reference to the election?
I think you're probably getting the point. When looking at Blog Analytics, before any of the findings – especially quantitative ones – are used, an analyst needs an understanding of what is represented by the base.
Source of Sampling Error: Post Collection Bias
Of course, understanding your sample doesn't mean that it is without error. So what kind of error can creep into a blog analytic sample? I'm glad you asked. The first opportunity is the point that the blog posts are collected. Where do the blog posts come from? Is the source adequate and appropriate for the study? What posts are missed using a particular collection strategy? Does the collection strategy impose artificial bias? These are all valid questions for an analytics vendor.
To relate back to my original graph, here's a deconstructed version to help illustrate what I'm talking about. (Remember, our hypothetical question is "What percentage of Gen X Females mention Johnny Depp?" )

So what's common practice and how can bias be introduced? Well, I can't speak in detail about any particular vendors methodology but I can present some generalities. I think most Blog Analytics vendors are relying on RSS or Atom feeds instead of scraping web pages. Although I think how the feeds are discovered and the priority in which they are checked varies widely. Some rely heavily on pig servers; some poll a list of know feeds; some follow links in other posts and crawl for likely feeds. I haven't investigated it, but the big companies probably use all of the above. At this point I don't think finding feeds is the problem – I think collecting all of the new posts is the problem. Somebody has to be left out.
Who is that somebody again depends on strategy – but I think in most cases these are so called long-tail folks. In the blogosphere, this means folks with a no links to them, probably with low post volume, probably on a blog hosts known for giggly prepubescent chatter (Live Journal for example) or low hanging fruit for spam bloggers, inevitably and sadly now being called sploggers (Google's Bogger for example), or people with blogs running on their own server who haven't registered them with blog search engines and don't use ping servers. These are the lone voices that are most likely – although not always - left out.
These voices are usually not speaking to the general public – they're talking to close friends or family. There are lots of abandon blogs in this category too. Some vendors understand that these voices may be important in aggregate but don't have the bandwidth or storage to collect all of the seemingly mindless chatter so they take a true sampling approach, collecting data from these blogs only after everything else is collected and then only a limited amount with strategies designed to collect a representative sample of this category of blog.
Furthermore, a post collection strategy which might work for a general purpose public search engine such as Technorati might be lousy for market research of a particular demographic or community. [DISCLAIMER: I don't know how Technorati actually chooses what to index – I'm just using them an an example of a public search engine.] This is because it is in the search engines best interest (and that of their target customer) to index the most popular sites first – as these are the most likely target of the search. This strategy biases against the aforementioned prepubescent monoposters. This of course doesn't matter unless you're trying to measure the angst of 13 year old kids - which you might be.
There are errors on the other side of the fence too – meaning over-collection. In the rush to collect all of the posts, many firms scarfs up spam blogs and foreign language posts and automated feeds generated from aggregating other feeds (essentially double counting the original post). For the purpose of searching, foreign language blog are important – but for Blog Analytics where language dependent techniques are used to count posts, languages other than the expected one can create wonky results. I've written about this using BlogPulse here and here. Technorati has recently introduced language filters to help combat this issue. It is worth mentioning that BlogPulse is a wonderful free search service provided by Intelliseek and the fact they they collect blogs in all languages doesn't mean that they skew the results of their commercial services. Them kids over at Intelliseek is bright.
All of these approaches can lead to bias for or against some group. The real question is:
Is the post collection strategy used by your analytics vendor capable of producing a representative sample for your target study group?
In the above picture is the green oval (or an extractable subset) representative of the teal one?
Stay Tuned
Next post, I'll talk about natural language processing and paring down collected blog posts into the base of a study.
