Wednesday, 31 August 2005
Hillary Clinton Writes Like A Man!
I've been thinking a lot about determining demographics with text analytics. I'd suspect there's a fairly big margin of error in these types of algorithms but I did't know for sure – or how big even if I was sure. So I've started digging a little more into research in this area. But first off, some hands on fun...
Shawn Lea pointed me to a cool toy – The Gender Genie – which, given a small writing sample, applies some simple algorithms to try to determine gender of the writer. Go ahead and go try...I'll wait...
Okay, pretty neat huh? So, now to illustrate my point that these things have some error, I started copying-and-pasting the posts of women into the Gender Genie to see who writes like a girl. The results, to me, were surprising – I actually had trouble finding female posts that the Genie thought were written by females. Here's a chart of a few examples. The links go to the particular posts I used. For each test I used the Genie's "blog entery" setting.
|
Writes Like A Man |
Writes Like A Woman |
|---|---|
I measured all of Hillary's posts, just to be fair, and it was unanimous – Hillary Clinton writes like a man! (but so do most of the other women I measured).
Interestingly I couldn't find any males that wrote like females according to Gender Genie. It seems the Genie has a masculine bias. Sexist pig.
Novelty you say? Maybe, but it's loosely based on some serious science. Namely that of a couple serious scientists Moshe Koppel of Bar-Ilan University in Israel, and Shlomo Argamon of Illinois Institute of Technology. The Gender Genie is based on their paper "Gender, Genres, and writing Style in Formal Written Texts". For those of you interested in applied blog analytics, this paper discusses characteristics of formal writing that could be used to determine gender from text – just like Umbria Communications. Speaking of Umbria, it looks like they are sponsoring a serious academic Symposium on Computational Approaches to Analyzing Weblogs at Stanford University in March 2006. Moshe Koppel is on the program committee.
Mr. Koppel has also written on determining sentiment from text analysis. Sentiment is the word the NLP crowd uses to mean whether a particular text passage discusses its subject in a positive or negative way. This is another of Umbria's claims to fame. Alas, sentiment is a topic for another post.
Koppel and Argamon's paper is focused on formal writing because previous work in the area has been centered on informal writing. This is because informal writing, the conventional wisdom has been, more easily reveals gender queues than does formal writing. Koppel and Argamon had good results. They reportedly developed an algorithm that correctly identified the gender of 80% of their test corpus authors. Their test corpus was well defined, with each work averaging over 42,000 words with equal representation of male and female writers balanced nearly down to the subject level.
In other words, their process has 20% error in laboratory conditions.
In academic terms this is amazing to me. Frankly, 80% accuracy is much higher than I expected. But as a commercial market research tool I'm not sure of the value, 20% is a big margin of error in terms of significant difference in market research data. To be fair, Koppel and Argamon made no assertions as to the commercial value nor did they suggest an application. Their work is impressive and important and I want to be clear that I'm not suggesting otherwise. I'm merely asking the question, if a blog analytics vendor had 20% error in a gender recognizing algorithm, would it be commercially valuable?
Consider this:
You collect blog posts from 200 bloggers that mention your product.
Automated text analysis identifies 100 as male and 100 as female.
70% of the males posts indicate that they like your product.
50% of the females indicate that they like your product.
So, males like your product more than females, right? Wrong. These results are inconclusive. Why? Because what if 20% of your males that liked you products were really misidentified females and 20% of the females that didn't like your product are misidentified men? Then 70% of (actual) females like your product and 50% of males do. With 20% error the difference is not statistically different.
With that said, you have to have a measured difference over 40% (twice the potential error) to be considered significant. If the measured results are this big, chances are you won't need an analytics tool to measure it.
And keep in mind, 80% accuracy is in laboratory conditions – consistent target audience, correct grammar and complete sentences, reasonably length works (> 42,000 words). The average blog post is something like 200 words. Half the time the blogger is quoting someone else, uses poor grammar, incomplete sentences and images of cartoons on business cards. Uh, wait, that last one is just Hugh Macleod. The point is that the blogosphere is anything but laboratory conditions. I can't imagine anyone is accurately determining gender for 80% of all posts in the blogosphere.
Koppel and Argamon also tell us that formal writing is more difficult to determine gender. And that females tend to be "involved" writers while males tend to be "informational" writers. These differences are more subtle as the genre moves from fiction to non-fiction thus making the determination more difficult. Also remember that the simplified implementation of the Gender Genie skews masculine.
Thinking again at how this information might apply to blog analytics, it would seem to me that the younger the writer, the less formal the writing would tend to be and the easier it would be to determine gender. In older writers, or more specifically, bloggers, I would think that writers would start to skew towards more informational, utilitarian styles. I'm guessing that Gen X and Boomer women that are blogging, on average, aren't writing diary styles blogs at LiveJournal but instead are writing about Java and Passionate Users or Marketing and Public Relations. And if you were talking politics? What then...
Well, Umbria provided Buzz Reports to CNN for the 2004 Presidential Election, the site is here. They showed the demographic breakdown of blog posters that mentioned Bush in a pie chart that follows:

Now, look at the distribution of men to women, in total and among the three age groups. Does that look right to you? How could this change with 20% error in gender identification? What's Umbria's estimated error? I don't know. Thus far, they're not saying. But if you're considering buying research from anyone using these types of text analytics it is a question you should ask.
