Thursday, May 21, 2009

The World's Worst Person?

A certain fascination exists with identifying the public figure with the lowest overall sentiment ranking. It tells us something about a given society to discover who the most demonized figure is, the person spoken about with the greatest anger or rancor.

Although our system does not make it easy to extract people by negative sentiment, any active news reader should be able to construct a rogue gallery of evil and destructive people -- like mass murders Adolf Hitler and Osama bin Laden, or the swindler Bernard Madoff. But as shown by these sentiment polarity charts, all their reported vileness pales next to Lori Drew.

Who is Lori Drew? She is a mother whose cyber-bullying drove a fragile 13-year old girl (a rival of her daughter) to suicide, through messages sent from a fake MySpace profile. The public outrage over this pushes many buttons -- from the general hatred of bullying defenseless children, public attitudes against overzealous parenting, fear of social network sites, and more. As of this writing the judge is trying to appropriately sentence her, a task complicated by the fact that she has not been convicted of anything stronger than violating the MySpace terms of service.

Now don't get me wrong. She is clearly guilty of such horrifying behavior that it is difficult to see how she can live with herself. But it is obviously an overreaction to mention her in the company of Hitler and Bin Laden.

Realize that sentiment analysis aims at capturing what the world is thinking, not what it necessarily should be thinking or that which is objectively true. Lydia sentiment signals measure interesting social and cultural phenomena, but their proper interpretation requires an understanding of context and the nature of the underlying news sources.

Edison Chen and the Computer News Processing

One of the pleasures of my sabbatical year in Hong Kong has been reading the local English newspaper (The South China Morning Post) and getting exposed to a new universe of locally-interesting characters. Edison Chen and the Computer Technician has been my favorite story of the year. Lydia news analysis provides very interesting insights into the story and by proxy the culture here in Hong Kong.

Edison Chen, son of a local tycoon, became a Cantopop (Cantonese pop music) singer and general entertainment/media personality. Think a male Paris Hilton, with a similar set of unseemly incidents involving various fights with people and, in one case, a taxi. This explains his generally negative sentiment scores (shown above) up to January 2008, when he took his computer in for repairs and hit the big time.

The computer technician (Ho Chun Sze) found a nice collection of sex photos of Mr. Chen with several female Cantopop stars, actresses, and models (Gillian Chung, Cecilia Cheung, Bobo Chan). The technician showed them to his girlfriend, who showed them to somebody else, and then they ended up on the Internet. All of these figures show up prominently as statistically juxtaposed with Edison Chen. In the wake of this scandal, Edison Chen's sentiment score suddenly turns positive, resulting from respect for his healthy social life, approval of his apologetic behavior (including retiring from the Hong Kong scene to live quietly in Vancouver), and sympathy for the fact that he was ultimately blameless for the release of the photos.

Perhaps most interesting of all are the international heatmaps displaying the spatial reference frequency (left) and sentiment (right). The frequency map shows the most intense interest in China, with secondary interest in countries with significant Cantonese communities (Canada and Australia). Chen's Chinese name was the number 1 search term in China in 2008. The sentiment map shows a negative reputation in all countries except China!
Indeed, Chen finished second to Barack Obama in the Hong Kong Person of 2008 poll by RTHK radio, with almost 30% of the vote.

Monday, May 18, 2009

Sentiment: United States vs. China

Today's graphs give me a very queasy feeling. I decided to compare the sentiment in the United States vs. China, and let's say the results are not very good for the home team.

In particular, United States sentiment has been highly negative since the beginnings of the dailies depository in November 2004, a rating I would like to attribute at least initially to the war in Iraq and the Bush administration in general. Indeed, the months marking Obama's election (November 2008) and his inauguration (January 2009) represent peaks in sentiment despite the economic crisis. But what floored me was the sharp negative spike in April 2009. I attribute this to kvetching about the long-term strength of the dollar, but regardless U.S. sentiment polarity hit a new dailies low during this month.

By comparison, check out the sentiment graph for China over the same period. It has been generally positive since November 2004, with the exception of the Sichuan Earthquake in Spring 2008. The big positive spike is August 2008 results from the enormously successful Beijing Olympics, which also give a nice boost to the U.S. that month. Negative sentiment from the world economic crisis rules the next several months, but the Chinese funk lifted in April as the U.S. continues to descend.

Now these generally negative U.S. and positive China sentiment reflect the longer term time-series from the thirty-year archival depository. Negative news always gets more play than positive news in U.S. newspapers, so it is the changes in sentiment which are more revealing than the absolute sign. The biggest plunge in U.S. sentiment in this period occurred (appropriately) September 2001.

Friday, May 15, 2009

British News Processing

I gave a demo today to Sean Carey, a professor of political science at the University of Sheffield. He is interested in using our news analysis in cahoots with polling data from the British Election Study, to better understand how voters make their decision, and why. They are gearing up for the next national election, likely to occur in Spring 2010.

Since their study revolves around why British voters vote the way they do, they are only concerned with data from British newspapers. The source set tab of TextMap Access makes it easy to create a source set from the dailies depository consisting of all newspapers from the United Kingdom. Once this source set is named and registered, it will appear as an entry as a new depository ready for use in the frequency and sentiment tabs.

One interesting discovery in playing with it was the fraction of references to a local entity like `Gordon Brown' that came from British sources. The answer proved to be about 60%, which is quite impressive considering that less than 10% of our total spidered sources are from the United Kingdom. But it makes sense that he would be what the local readership is interested in...

Particularly amusing was to look at the entities juxtaposed with Gordon Brown at different type scales. Tony Blair, who he served faithfully as Chancellor of the Exchequer, proves his strongest association over the full dailies depository (left column). The past year (center column) more strongly reflects his activities as Prime Minister, including interactions with world leaders (Obama, Sarkozy, Merkel). The strongest associations over the past month (right) column reflect recent activities. We were puzzled a bit by the strong association with Carol Ann Duffy, but a little reading revealed that Brown had just had appointed her as the first female Poet Laureate.

One minor complication of British news processing is that the spelling and word usage is slightly different from what is used in the United States. The lexical resources we employ both British and American spellings, and I expect that our NLP performance will be quite similar on British texts.

Thursday, May 14, 2009

Trends in Medline/Pubmed

Medline/Pubmed is a database of the abstracts from over 17 million published journal articles in the biomedical sciences. Processing such a corpus is short work for the current version of the Lydia system, taking only a day or so on our 28-node cluster computer. Our Medline depository provides a good example of how our news/blog processing system can be easily applied to any large-scale text corpus, with interesting results. Here are two interesting discoveries from my explorations this afternoon.

One of the most frequent entities in this depository is cancer, and one of the most frequent cancers is breast cancer. The figure above shows the pubmed frequency graph and rugplots for this disease. The log-scale frequency graph shows the relentless exponential growth in research on breast cancer since the mid-1970's. The regular, small scale bumps over the years reflect the periodicities with which journals are issues, such as quarterly or semiannually.

The rug plot (shown below the time series) shows the distribution of articles identified as news, business, sports, entertainment, or other by our statistical classification methods. Now these classifiers were tuned for news articles, and we do not expect too many sports/entertainment articles appearing in scientific journals (at least the ones I read). But still the results are quite interesting. There is a clear transition in the distribution starting in 1975, when the systematic collection of full text abstracts began.

My other experiment involved a sentiment plot for AIDS since the beginnings of the epidemic around 1982. Both the pubmed and archival depositories show the sentiment polarity of AIDS gradually but definitely drifting towards greater neutrality. The scientific sentiment represented by pubmed has improved from -0.72 in March 1983 to -0.58 in December 2009. The public sentiment about AIDS has risen from -0.78 to -0.48 over the same period. Now AIDS remains a horrible, incurable disease, but it has become regarded more as a chronic condition which can be treated than a deadly plague -- and our sentiment metrics are sensitive enough to pick up on this trend.

Wednesday, May 13, 2009

Welcome to the TextMap Blog!

Hello World!   This is the first posting of a blog on developments revolving around the Lydia / TextMap news and blog analysis project at Stony Brook University.    These will include:
  • Descriptions of newly available functionality on the TextMap website
  • Interesting little discoveries on how the world works, derived from TextMap Access data.
  • Reports on social science research based on TextMap analysis
  • Publication announcements of Lydia-oriented research out of our lab.
  • Developments at General Sentiment LLC, a startup company based on Lydia technology.
The time is right to start this blog, because a lot is now happening in the Lydia/TextMap world. Several interesting new analysis depositories (including a longer and more comprehensive newspaper corpus, PubMed abstracts, patents, and Supreme Court decisions) have just come on line as our infrastructure matures.  Our TextMap Access interface now provides instant access to this vast amount of data and analysis.   I am now spending (wasting?) substantial amounts of time playing with our data, so this blog is the perfect place to relate my discoveries.

Substantial collaborations relying on our analysis have already begun with political scientists, sociologists, and historians, but this is hopefully just the start of several beautiful friendships. Thanks for coming on board.  I look forward to having reading (and making) news together.