Tuesday, 5 July 2005

Blog spam is responsible for the decline of god

« The First iTunes Enabled Mobile Phone | Main | Oklahoma Sales Tax Reform & Kenny Rogers Review »

(or Dissecting BlogPulse: Part2)

In Dissecting BlogPulse: Part 1 I discussed a downward trend in the number of blog postings that contain at least one of 69 common English words but stopped short of discussing the possible root cause(s) of this trend. This is what I'm investigating today.

Revisiting Our Common Words

When running trend graphs on BlogPulse - or any search engine for that matter – the length of time required for the query is directly related to the complexity of that query. In the case of ORing 69 words together, the required query time is quite large and, when I did this for the previous post, required several attempts to actually get the query to complete. So before I started adding complexity in my query, I needed to simplify the one I had. One of the interesting things about all of the 69 common words I used in my original graphs is that their frequencies stay almost unnaturally parallel. Here are some examples:






The parallel nature of these trends further suggests that the decrease is not due to a change in writing style or topical diversification because these types of phenomenon would upset the frequency of these words relative to each other and thus disturb the parallel trends. On a personal note, I find it a little disturbing that when you take 300,000 to 500,000 blog entries from 10 million blog authors on any given day the frequency of entries that contains any one of these words stays about constant. We're so predictable. Anyway - what this also tells us is that any subset of these words will yield a trend that will approximate the trend of the original set of 69 words. I picked the four words with the highest usage: I, have, you, all, they and my.

Here's the trend for the original set of 69 words:



And here is the trend for the subset of (I, have, you, all, they, my):



As you can see, other than a magnitude difference of 10-15 points, the trends are very similar.

Query Strategy

So now that we have a simplified common word query, we need some hypotheses to test and a strategy to test them. Let's look at these in reverse order. Strategy - my first impulse was to simply use a query like "NOT (i OR have OR you OR all OR they OR my)" and examine samples of results to look for evidence to support a hypothesis. This strategy fails for two reasons: 1.) BlogPulse does not support standalone negative queries, i.e. you can say "a AND NOT b" but you can't just say "NOT b" – BlogPulse will return an empty set. 2.) BlogPulse typically indexes 3-500,000 blog entries per day. Trying to manually analysis 25-50% is not practical and my access to this data is limited to their web interface.

So what strategy do we use? Well, we sample. In our case, I looked for queries that yielded approximately constant frequencies across our time period of interest. We then subdivide these entries into two group - those that contain at least one of our six common words and ones that don't. This allows us to use the query pattern "a AND NOT b" and will yield smaller data sets that we can reasonable evaluate manually.

Hypotheses

So what are we looking for? We've already ruled out a a dramatic change in writing style and topic diversification. So what else could it be? I think decline in commonly used English words is the by product of three trends:

  • the increase of spam blogs

  • the increase in non-English language blogs

  • the increase in RSS/Atom feeds that contain empty or partial description and/or content payload

The first of these I'm going to tackle today. The second of which I hope to tackle tomorrow. The third of which I'm going to label "negligible" because, quite frankly, I'm not that interested in it and it's my blog, damn it.

The Rise of Blog Spam

Blog spam has been the topic of several recent articles as seen here, here and here. Most of the focus however has been on how it makes legitimate blogs harder to find, causes performance problems for the host or is generally annoying. For trending in blogs, however, blog spam is more serious – it has the potential to completely undermine confidence in consumer insights mined from legitimate blogs.

So, why do I think spam is part of our problem and how can we validate this idea? Blog spam tends to be designed to get search engine hits to drive the user to some sleazy website. To accomplish this, blog spammers create RSS feeds that contain little more than a series of words that are related to the product or service they are promoting. This results in posts that are not written in conversational English and subsequently have a disproportionately low representation of words that occur frequently in normal writing.

To test this idea we need a word or phrase that would normally be used in blog discussions but might also be a likely target for spammers. When I first noticed spam blogs I was doing some analysis of automobile manufactures. It turns out that spammers frequently flood the blogosphere on a single day with dozens or even hundreds of posts containing lists of automotive brand names, model numbers and names and auto descriptions. For example, if we look at all mentions of the word "honda" versus mentions of "honda" plus at least one of our common English words (I, have, you, all, they, my) versus mentions of "honda" with none of our common words we get this:



(Click here to run this trend yourself.)

As it turns out, all three of those big spike are spam and not buzz. The one on April 12 happens to contain the word "all" and it gets past our very primitive spam filter. While it's overstating and oversimplifying to say that the orange line is all spam, from observation (after reading many of the underlying posts) in my opinion it is fair to say that it is certainly representative of spam. The important point here is that "spam" is increasing and artificially inflating the percentage of mentions of "honda". In at least three cases, spam spikes potentially leads us to the false conclusion that Honda was the buzz when it fact it was the bait.

The Honda example is a microcosmic look at how spam can adversely affect brand analysis but it is a small part of a greater trend. The trends here are slight but real and growingand it's happening with every brand with which spammers feel they can bait readers. To discover more dramatic evidence, let's examine more traditional spam words. Here is a trend of entries that contain various traditional spam phrases or words:



(Click here to run this trend yourself.)

Again, this is just a small sampling of spam words – the volume of all spam is much bigger. It's worth mentioning that not all blog entries that contain one or more of these the words or phrases is spam – but since the trend of their use is rising significantly against the use of the most commonly used English words and since from email we know these are common spam phrases it's probably pretty representative. I encourage you to run this trend on BlogPulse and click on the points of the graph, read the underlying blog entries and decide for yourself. Some of these blogs are amazingly clever – embedding dozens of hyper links into news stories and computer generated prose to make them look like legitimate, human written blogs.

In my earlier post Who are those pesky bloggers anyway? I included a trend that showed the decline of the word "god" in blogs and wondered aloud as to the root cause. I now offer this explanation:



(Click here to run this trend yourself.)

There you have it. Blog spam is responsible for the decline of god. Spammers, it turns our seem to avoid "god" in spam blog entries. When legitimate bloggers write about "god" it tends to be in conversational language and as such includes the use of our common words. To validate this we can trend:



(Click here to run this trend yourself.)

Since the frequency of the word "god" is not propped up by spam as we saw with "honda", proportionate use of the word "god" is in steep decline.

Conclusions

Spam is definitely on the rise in the blogosphere and is threatening to diminish the potential of data mining from blog data.

Stay tuned, in part 3 we will look at non-English language blogs...

Posted by Matt Galloway at 8:32 PM in Technology & Culture
« July »
SunMonTueWedThuFriSat
     12
3456789
10111213141516
17181920212223
24252627282930
31