Thursday, 30 June 2005
Dissecting BlogPulse: Part 1
« Welcome Dustbury Readers! | Main | Good BlogPulse News from Intelliseek »Whew! I've been distracted lately by all the flag waving and burning going on here lately. Speaking of which, be sure to check out the latest comments on Flag Desecration Amendment Follow-Up.
But now it's time to get back to everybody's favorite free blog trending tool: BlogPulse.com. Late last week I announced that BlogPulse was broken and since then I keep hoping that they'll straighten things out - but so far, no chicken. Ever since June 20 - Black Monday – their trend graphs have been erroneously low. They completely messed up my Jeff Jarvis experiment. New graphs are no fun to run 'cause I know they're wrong. In short, they're muckin' with my chi.
...but alas, the internet never sleeps and my public awaits. [Yes Joey, I may have just jumped the shark.]
Being unable to rely on BlogPulse data over the last week has prompted me to take a new tack. Instead of focusing on what BlogPulse can tell us about, well, the blog pulse, I've been focusing on what BlogPulse can reveal about itself: how do it work and how do it don't. The comments we've seen from Intelliseek have assured us that the problem is not systematic and it is localized to June 20 and the (increasing number of) days immediately following, but I'm still skeptical and I've indicated that I would share why – this is where I'll begin.
In several graph trends that I examined I notices a general downward trend that I couldn't necessary explain. After Black Monday I started trying to find a way to illustrate this phenomenon and explore possible explanations. To this end I started thinking about search phrases that should maintain a reasonably constant % hit no matter what happens to the universe size of the blogosphere. I started thinking about this when I did the masculine vs. feminine word trend graph in my original Who are those pesky bloggers anyway? post. I noticed that these trends stayed roughly parallel but were both declining. At the time I speculated about the changing composition of the blogosphere affecting these numbers, but the more that I thought about it, the less probably I thought it was. No matter what you write about, you still use "he" and "she" about the same amount. Still, this was just a hunch and not very scientific.
So I started working on a more scientific test. After some head scratching I came across this 100 Most Frequently Used Words list on About.com. This list is not terribly official and the word frequencies depend on the docs you analyze but I ultimately felt that these 100 Most Frequent were as good as any. My premise was that, over time, the percentage of blog posts in which at least one of these words appeared should stay roughly constant regardless what people were writing about unless something dramatic changed (more on what might change later).
Now before I charge off into my findings we need a quick less in index based full text searching. There's a biscillion techniques used for full text searching but one of the most common is the removal of "noise words" from both the index and the query. Noise words are words that don't help distinguish one document from another. There are doctoral theses written about how to select noise words, but in simple general purpose text search engines it is usually the words that are very frequently used. Since these words (for example: a, the, an, to, in ,is) are used in almost every blog entry they have no value in terms of distinguishing documents during search and as such they are thrown out. If you do a search for one of these words individually you will get a flat line on zero. Further more, these words are thrown out of combinational queries so searches for "cat hat" return the same result set as "cat in the hat" because "in" and "the" are thrown out.
I didn't want to waste my time with noise words so I evaluated each of my 100 Most Frequently Used Words to determine is whether or not it was noise. This analysis yielded 31 noise words leaving 69 "signal words". They are:
about, after, all, an, been, called,
can, could, did, do, down, each, find,
first, from, had, has, have, he, her,
him, his, how, I, its, just, know, like,
little, long, made, make, many, may,
more, most, my, now, one, only, other,
out, over, people, said, see, she, so,
some, than, them, time, two, up, use,
very, water, way, we, were, what, when,
where, which, who, words, would, you,
your
I then created a 6 month BlogPulse trend of all entries that contain at least one of these 69 words. This is the resulting graph:

As you can see there has been a gradual but steady downward trend of the frequency of posts with at least one of these common words. This is concerning. We need to understand what is going on here if we are to use this data set for anything serious. What could cause this percentage to drop 10 to 12% in six months? I don't think writing style could change that dramatically nor do I believe that topical diversification could have an impact of this magnitude considering the breadth of the search phrase I'm using. But what is it?
Well I'm still not exactly sure but you'll have to wait for another post for the exploration of my hypotheses.
Oh, but before I sign off for the night, notice the lowest point on the graph – that's Black Monday. Now notice that the crash began a couple of days earlier but didn't plummet until June 20. Even if we discover an explanation for the general downward trend, the fact is that BlogPulse was well into failure days before Black Monday but it still took until June 27 before they posted anything on their site acknowledging something was wrong. Two words for you BlogPulse: transparency
Stay tuned for Part 2...
