Friday, 24 June 2005

BlogPulse is Broken!

« Hey, I Scooped The Wall Street Journal! | Main | BlogPulse Eats Own Dog Food »

Something is terribly wrong with BlogPulse. To illustrate this, here is a chart of the last 2 months for "he or she or I or me or said or wrote or blog" that I ran today:




I picked this phrase because I knew that it would yield a lot of hits across all types of posts. As we've discussed, I originally thought that the gradual decline might represent diversification of the blogosphere but know I'm forming a new theory...

BlogPulse has gotten a lot of blog buzz lately. Technorati reports 1,177 links to BlogPulse. As far as I can tell, folks have only been talking about BlogPulse for a fairly short period of time and it's a very powerful and addictive tool. But if you graph anything right now (if you can get one to run) the numbers plummets on Monday, June 20, 2005. At first I thought that they might have added a huge number of additional blog entries, but what little analysis I can do indicates that they've stayed in the 360,000-425,000 blog entries/day range. For these reasons, I think that BlogPulse is suffering some major load problems and that these problems have been artificially skewing results for at least the last few months.

My theory is that BlogPulse's indexer is not able to keep up with the massive quantity of new blog posts. Because of it's trending capabilities, it is critical to keep all blog index data for the trend period – not just the recent stuff. As the index database grows, the harder it is for the indexer to keep up. The result is that a decreasing percentage of total blog entries get indexed each day – even though they are being collected and stored. Since the vertical axis of the trend graphs are percentages, the fact that not all blog entries were indexed should not have a significant impact on trends – assuming that the blog entries that were indexed are a representative sample (for the sake of argument let's assume that this is a reasonable assumption). However, if when calculating the percentages, BlogPulse uses the total number of blog entries collected instead of the total number indexed as the denominator (base) the result would be a gradual artificial downward trend. I now think this is what has been happening.

This trend continues until there is a catastrophic failure in the indexing process and less than half of what normally gets indexed (even with the decline) is indexed – POOF! - you get "Black Monday". I think this follows and I'm betting I'm not far off.

The sad thing is that BlogPulse isn't yet talking about it. I can't believe that the folks at BlogPulse don't realize the something bad is happening. As folks who are measuring blogs you would think they would get transparency – especially given the nature of this wonderful product.

In light of all the recent buzz on using blogs for market research, this is bad news. Pete Blackshaw has been evangelizing these technique all over the place and I agree with him that this is powerful and disruptive – but it has to be credible and statistically sound. That said, if the data is flawed so are the insights drawn from it. If this research is to be taken seriously, Mr. Blackshaw and Intelliseek need to step to keyboard and blog an explanation.

Posted by Matt Galloway at 12:57 AM in Technology & Culture

Comments on this entry:

Left by Natalie Glance at Fri, 24 Jun 9:49 AM

Matt,

You are absolutely correct that it is necessary to normalize with respect to the number of blog posts indexed per day (vs. number collected). This is what we do. The downward trend is a bug in indexing that occurred on June 20th, not a systematic error.
We're working on fixing those posts now.

(Sorry, we're no longer accepting comments. You're looking at the 2005-2006 archieve version of The Basement.)
« June »
SunMonTueWedThuFriSat
   1234
567891011
12131415161718
19202122232425
2627282930