Saturday, 20 August 2005

Anatomy of Blog Analytics: Data Collection

[For those of you just tuning in, you mihgt want to read this and this first. I'm going to be talking about this diagram. ]

Let's say you're a marketer and you want to know:

"What percentage of Gen X Females mention Johnny Depp?"

According to my wife, the answer is "all of them" but let's pretend for a moment that we didn't know and that we wanted to determine the answer using Blog Analytics. How would we get to an answer? How accurate would the answer be? What would the answer mean?

These are all important questions that should be asked when applying any research methodology. With traditional research techniques such as telephone survey or focus groups, the answers are pretty obvious to marketers and market researchers, but blog analytics, I suspect, looks a little more like voodoo than science to lots of these folks.

When we ask these questions, what we really want to know is if the findings of the research will be valid, or in other words, to what extent will the findings be affected by error. All market research methodologies have error: sampling error, error from question bias, order bias, respondent's lying, technical error (data keying error, etc.) and so forth. Because the mechanics of blog analytics are so new, I don't think they are deeply understood by marketers. I think one area that marketers and analysts need to be better informed is the sources of error possible in blog analytics. Not to discredit the methodology but to provide insight into it. Understanding the sources of error in Blog Analytics can help marketers and analysts better recognize appropriate opportunities for applying this important technology. It will also help them to scrutinize results in meaningful ways – much like they do today with survey results for example. A good survey researcher can not only spot suspect data but can often predict its cause based on a deep understanding of the methodology.

So let's jump in with sampling.

Blog Analytics Sampling

Everyone seems to be obsessed with estimating the size of the blogosphere and aspiring to collect and index each and every post. While this is an admirable goal, I'm not sure it's even possible (and suspect it's not). What this means is that no one is measuring all of the blogosphere. This is okay. Outside the bologsphere we don't measure everyone – we just measure some people and assume that these folks' opinions represent everyone else's, in other words we sample.

All Blog Analytics companies sample (even if they try to get all the post – they're bound to miss a bunch). Some firms prioritize the source and give preferential treatment based on some criteria: popularity as measured by links, blog host (index last/don't index Live Journal for example). Some firms only collect post from specific blogs such as known political bloggers for example. Some only collect post that are in a particular language. The point being, everyone samples – either by choice or by technical limitation.

In my recent evaluation of Umbria Communications, I noted that they claim to monitor millions of blogs (this is a common claim). But in their published works such as their coverage of the 2004 Presidential Election they provide very little information on how many or which blogs were included in the analysis. For example, they have a graph showing total posts about Bush vs. those about Kerry. On 11/7 they show approx 114K mentions for Bush and 72K mentions of Kerry. Then they show this information as a percentage of total mentions – approx 61% Bush, 39% Kerry. (Before you grab a calculator: 114 / (114 + 72) = .6129 so the math works, but keep reading).

To the general public this looks fine – but to a market researcher, there are lots of unanswered questions...

The percentages imply that the mentions were mutually exclusive – either a post was about Bush OR Kerry.

  • What if a post was about both Bush and Kerry?

  • Did this post get thrown out?

  • Was it counted twice – once for each candidate?

  • If a post can count twice then what do the percentages mean? (Think of marbles – 1 black, 1 white and 9 black & white swirl. Are 50% of the marbles white?)

Umbria does some interesting speaker analysis, namely supposedly using NLP to determine demographics and determining "new speakers" based on post frequency.

  • Does this analysis include posts coming from speakers younger than voting age? (Young folks would be excluded in survey research).

  • If a single speaker posted multiple times per day was that counted once or each time?

  • During the 10 week period of the study, Umbia identified anywhere from about 20K to 180K postings per week. How many speakers does this represent?

Umbria claims that they have NLP tools that help them determine context.

  • Are posters talking about the presidential candidates or the baked bean folks and the director?

  • So are we looking at people that are talking about the candidates or any mention?

  • Does a mention of Bush or Kerry count if it is not in reference to the election?

I think you're probably getting the point. When looking at Blog Analytics, before any of the findings – especially quantitative ones – are used, an analyst needs an understanding of what is represented by the base.

Source of Sampling Error: Post Collection Bias

Of course, understanding your sample doesn't mean that it is without error. So what kind of error can creep into a blog analytic sample? I'm glad you asked. The first opportunity is the point that the blog posts are collected. Where do the blog posts come from? Is the source adequate and appropriate for the study? What posts are missed using a particular collection strategy? Does the collection strategy impose artificial bias? These are all valid questions for an analytics vendor.

To relate back to my original graph, here's a deconstructed version to help illustrate what I'm talking about. (Remember, our hypothetical question is "What percentage of Gen X Females mention Johnny Depp?" )




So what's common practice and how can bias be introduced? Well, I can't speak in detail about any particular vendors methodology but I can present some generalities. I think most Blog Analytics vendors are relying on RSS or Atom feeds instead of scraping web pages. Although I think how the feeds are discovered and the priority in which they are checked varies widely. Some rely heavily on pig servers; some poll a list of know feeds; some follow links in other posts and crawl for likely feeds. I haven't investigated it, but the big companies probably use all of the above. At this point I don't think finding feeds is the problem – I think collecting all of the new posts is the problem. Somebody has to be left out.

Who is that somebody again depends on strategy – but I think in most cases these are so called long-tail folks. In the blogosphere, this means folks with a no links to them, probably with low post volume, probably on a blog hosts known for giggly prepubescent chatter (Live Journal for example) or low hanging fruit for spam bloggers, inevitably and sadly now being called sploggers (Google's Bogger for example), or people with blogs running on their own server who haven't registered them with blog search engines and don't use ping servers. These are the lone voices that are most likely – although not always - left out.

These voices are usually not speaking to the general public – they're talking to close friends or family. There are lots of abandon blogs in this category too. Some vendors understand that these voices may be important in aggregate but don't have the bandwidth or storage to collect all of the seemingly mindless chatter so they take a true sampling approach, collecting data from these blogs only after everything else is collected and then only a limited amount with strategies designed to collect a representative sample of this category of blog.

Furthermore, a post collection strategy which might work for a general purpose public search engine such as Technorati might be lousy for market research of a particular demographic or community. [DISCLAIMER: I don't know how Technorati actually chooses what to index – I'm just using them an an example of a public search engine.] This is because it is in the search engines best interest (and that of their target customer) to index the most popular sites first – as these are the most likely target of the search. This strategy biases against the aforementioned prepubescent monoposters. This of course doesn't matter unless you're trying to measure the angst of 13 year old kids - which you might be.

There are errors on the other side of the fence too – meaning over-collection. In the rush to collect all of the posts, many firms scarfs up spam blogs and foreign language posts and automated feeds generated from aggregating other feeds (essentially double counting the original post). For the purpose of searching, foreign language blog are important – but for Blog Analytics where language dependent techniques are used to count posts, languages other than the expected one can create wonky results. I've written about this using BlogPulse here and here. Technorati has recently introduced language filters to help combat this issue. It is worth mentioning that BlogPulse is a wonderful free search service provided by Intelliseek and the fact they they collect blogs in all languages doesn't mean that they skew the results of their commercial services. Them kids over at Intelliseek is bright.

All of these approaches can lead to bias for or against some group. The real question is:

Is the post collection strategy used by your analytics vendor capable of producing a representative sample for your target study group?

In the above picture is the green oval (or an extractable subset) representative of the teal one?

Stay Tuned

Next post, I'll talk about natural language processing and paring down collected blog posts into the base of a study.

Posted by Matt Galloway at 10:36 PM in Word-o-Mouth

Anatomy of Blog Analytics: Background

In a post a few days ago, I presented a diagram to set the stage for a discussion about evaluating blog analytics methodologies. There's a lot to be said and I'll probably be writing a couple posts on the topic. But I think it would be helpful to start with some back story and a definition of scope.

Definition of Scope

I'm going to focus on blog analytics. Most, if not all, of the stuff I'm going to talk about can be (and is being) applied to other mediums - online forums, news groups – even computer analysis of open ends collected from traditional research I 'd imagine. But my focus is going to be on blogs, not so much because they're trendy but because the technical aspects of collecting blog data is a little more clear cut (which is arguably why so many folks are doing it). But as both Matthew Hurst points out and Jonathan Carson agrees (in comments on Matthew's blog) – blogs are just the tip of the online consumer generated content iceberg.

I think blog analytics are becoming an important part of the market research landscape – but certainly not the whole landscape. Generally speaking, there needs to be a deeper understanding of blog analytic techniques among consumers of MR data before they can really be applied effectively.

Survey research techniques, for example, have evolved over decades and decades. The basics are pretty simple to understand – even intuitive. (Note to traditional MR folks and statisticians: don't freak out. There are lots of very complex nuances with survey work but I'm talking about concepts – the basic idea is pretty straight forward.)

Online consumer generated content analytics, on the other hand, are far from straight forward. From understanding the mechanics and limitations of blog post data collection, to having reasonable expectations of data mining techniques such as natural language processing and machine learning – I expect most marketing folks think of blog analytics as more black art than science.

Maybe I'm wrong - if you're a marketer and you're feel that you and your colleagues have a good grasp on this, please email me.

But assuming I'm right, the Blog Analytics industry needs to be thinking about education. This is the thought driving my discussion of analytic techniques.

Blog Analytics: Quantitative or Qualitative?

One big questions most analysts should be asking – Is blog analytics quantitative or qualitative research?

I love this question. I love it because it forces an important point to center stage: Blog Analytics is not like anything we've done before - it's new – it's both qualitative and quantitative. Unfortunately, this might make it a hard sell to some marketing folks because they tend to be polar about which camp they're in.

Think about a simple example of blog mining – do a BlogPulse trend graph on "Johnny Depp". It's easy to see this is a quantitative analysis, there's percentages and a neat graph. If you were a survey researcher you might think about this as a sort of "unaided awareness." But what do these percentages mean? They are mentions in unsolicited, unprompted narratives. This is more like a focus group – only without the focus. This is more qualitative approach to data collection. Furthermore, to gain any sort of actual insight into why something is rising or falling in mention, a researcher needs to start reading posts (or apply a computerized qualitative algorithm.)

When considering an Blog Analytics vendor, I think I'd ask the question "Is you approach qualitative or quantitative?" The answer should be revealing.

My discussion here is going to focus on both qualitative and quantitative approaches of collection and analysis with quantitative reporting. This is what Umbria Communications seems to be doing based on their published findings.

Blog Analytics: Long Tail or Influential?

After I posted my diagram, Jonathan Caron or BuzzMetrics and Matthew Hurst of Intelliseek jumped ahead – partly in response to Danger's comments on my original post. All three are interesting perspectives and I recommend you read them. But for the under-motivated, here's the cliff notes version:

Danger: Blogs can't answer specific questions unless bloggers offer the answers without being asked. (Good point.) Blogs may not reflect the thoughts and opinions of the "general population who live in the hinterlands of white space outside of the cozy confines of your diagramed blogosphere." (Also good point – lot's of folks are trying to answer this.)

Jonathan: Thinks Danger has missed the point. "Analyzing buzz is very specifically NOT about extrapolating out to the general population." he tells us. Instead, he asserts, buzz (the blogosphere in our case) is about reaching early adopters and the early majority. Jonathan's approach is an Influentials strategy – also quite valid I think.

Matthew: Thinks Jonathan's perspective is a bit narrow and that blog trends can be translated outside the blogosphere if supported with a model build on background (traditional) research.

I think these three guys are all correct (no, no, no, not about Danger missing the point or Jonathan being narrow). The lesson here is that, depending on the particular application, Blog Analytics can be either or both influencer focused and long-tail focused. To Jonathan's point, though, most research I've seen indicates online posters are far more likely to exhibit Influential characteristics than the general public – so by virtue of measuring blogs alone you might skew towards early adopter. Matthew's point is also a good one – when looking at products or services for which online posters are not necessarily acting as Influentials – such as movies or movie stars – the blogosphere probably acts a lot like the long-tail, and perhaps by extension, the general public. Furthermore, there are techniques to diminish or amplify these phenomenon depending on the focus of the study.

Again, it depend on approach. This is why marketers need to understand the options and methodologies – so they can determine what a prospective vendor is measuring.

Traditional Analogies

While Blog Analytics is new and different, there are lot's of analogies that can be drawn to tried and true traditional research methodologies. For Blog Analytics to make their way onto the MR landscape, something else will need to be displaced. To figure out what, marketers will need to understand where Blog Analytics can outperform traditional methods (either with better data or lower cost). To help understand these strengths and weaknesses, I think it's helpful to draw analogies from new analytics techniques to traditional techniques such as telephone survey, intercept survey, focus groups, etc.

An example related to the long tail/influencer discussion about might be an analogy to sample selection for a traditional phone survey. If a researcher wants "general population" opinion using a phone survey, they might use an RDD sampling methodology. To do this, the subject of the research needs to be high incidence meaning that if you randomly call someone they'll know what you're talking about. A good example might be an awareness study of a CPG product – something like "Have you heard of the laundry detergent X?"

In the blogosphere, laundry detergent doesn't tend to be a hot topic or the theme of specific blogs. If detergent is mentioned at all, it's in passing. These folks are not laundry detergent Influentials. A blog analytics approach might yield similar results in revealing unaided impressions of the product. It might also be able to gauge effectiveness of a marketing campaign by measuring change in mentions over time. This approach could be adjusted to focus only only those speakers who have posted about the product in the last 12 months – this would be more like an "of those who are unaided aware" study or maybe a customer sat study.

If on the other hand, a research is doing a political study. You might use a phone survey and call registered voters. The blogosphere is skewed Influential when it comes to politics so you'd probably get different results (arguably more valuable results – but I'm stealing Jonathan's thunder). But since politics is so popular online you could screen out the most influential voices (measured by links perhaps, or by frequency of mention) and get a more general population view. This is analogous to screener questions in a phone survey – "Have you volunteered to help with a political campaign in the last 12 months?" might be used in a phone survey to throw out those who's view point is less than objective and therefore potentially non-representative.

The point being, marketers might understand Blog Analytic techniques better, if they were presented in the framework of the research that they already understand deeply.

Stay Tuned...

Okay, enough with the back story and besides this post is already getting ridiculously long – a point for which I continue to get criticized. So I'm gonna stop here. With my next post, I'll hit the ground running and dig into my fancy graph.

Posted by Matt Galloway at 5:35 PM in Word-o-Mouth
« August »
SunMonTueWedThuFriSat
 123456
78910111213
14151617181920
21222324252627
28293031