Monday, October 26, 2009
Alumni Reunion
The Lydia/TextMap system was built in collaboration with my graduate students. A lot of graduate students. Indeed over thirty of them to date, all properly recognized on the team webpage. I've grown quite close to them over the years, and we try to keep in touch through our annual Lydia Alumni Banquet in Manhattan.
The 2009 banquet was this past weekend, and attracted a swarm of 18 loyal Lydia-oids. I am proud to see that all are doing very well indeed, with careers progressing nicely despite the recession. Most are somehow connected to the finance industry, including a growing number in hedge funds, but several others work in technology companies such as Google and Microsoft. Several are starting families, with several engagements (Levon, Andrew, and Lohit) on top of recent weddings (Prachi, Namtrata, and Jai). I hereby move that the first alumni child be named ``Lydia''. (or if the parents prefer, TextMap :-) )
I also include a December 2008 photo of myself with three of the Lydia alums who could not attend this year's banquet, and look forward to seeing everyone at next year's banquet.
Thursday, July 2, 2009
Ethnicity detection and the origin of Skiena
Although trends apparent in single-entity time series are revealing, more subtle analysis is possible by aggregating the signals of all the entities in a given group (say women, businessmen, Africans, etc.). But we first need to identify which entities are members of the group we are interested in.
This motivates our paper ``Name-Ethnicity Classification from Open Sources'', just presented at the 15th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining in Paris, France. We developed a statistical classifier/HMM to map person names to likely ethnicities. Given `Hu Jintai', we want to return `Chinese'. Given `Dimitry Medvedev', we want to return `Russian' or at least `Eastern European'. Given `Abdullah bin Abdul Aziz', we want to return `Muslim'.
Our classifier is not perfect, but it gives us a tool to answer questions like ``How did news sentiment towards Muslims change in the wake of 9/11?'' or ``How do attitudes towards Hispanics vary across the U.S.?''. Our results are quite interesting, and believe that entity-based ethnicity and nationality classification has many applications in social science research.
My coauthors were Anurag Ambekar, Charles Ward, Jahangir Mohammed, and Swapna Male.
I encourage you to play with our ethnic name classifier at http://www.textmap.com/ethnicity to see how it works.
This week I was thrilled to see the official 1920 and 1930 census pages for my grandparents. The original Skiena (my grandfather Sol) arrived in the U.S. in 1911, but with no clear English spelling for his name. Indeed, no less than four distinct spellings are relevant for interpreting these census pages. I list them with the primary and secondary ethnicities identified by our classifier to give you some idea of our purported roots:
My Grandfather is from Russia, so Eastern European is indeed the correct answer. For the record, this is a very hard case for the classifier; usually we are also given first names and deal with more common surnames.
This motivates our paper ``Name-Ethnicity Classification from Open Sources'', just presented at the 15th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining in Paris, France. We developed a statistical classifier/HMM to map person names to likely ethnicities. Given `Hu Jintai', we want to return `Chinese'. Given `Dimitry Medvedev', we want to return `Russian' or at least `Eastern European'. Given `Abdullah bin Abdul Aziz', we want to return `Muslim'.
Our classifier is not perfect, but it gives us a tool to answer questions like ``How did news sentiment towards Muslims change in the wake of 9/11?'' or ``How do attitudes towards Hispanics vary across the U.S.?''. Our results are quite interesting, and believe that entity-based ethnicity and nationality classification has many applications in social science research.
My coauthors were Anurag Ambekar, Charles Ward, Jahangir Mohammed, and Swapna Male.
I encourage you to play with our ethnic name classifier at http://www.textmap.com/ethnicity to see how it works.
This week I was thrilled to see the official 1920 and 1930 census pages for my grandparents. The original Skiena (my grandfather Sol) arrived in the U.S. in 1911, but with no clear English spelling for his name. Indeed, no less than four distinct spellings are relevant for interpreting these census pages. I list them with the primary and secondary ethnicities identified by our classifier to give you some idea of our purported roots:
- Sheaner - Jewish (0.76) and British (0.22)
- Skeaner - British (0.86) and Jewish (0.08)
- Sciaaner - Hispanic (0.48) and Nordic (0.24)
- Skiena - Eastern European (0.67) and British (0.21)
My Grandfather is from Russia, so Eastern European is indeed the correct answer. For the record, this is a very hard case for the classifier; usually we are also given first names and deal with more common surnames.
Labels:
ethnic names,
ethnicity detection,
KDD,
Skiena
Thursday, June 18, 2009
Ahmadinejad Goes Down!
Like much of the world, I have been following the presidential election in Iran and its aftermath with great excitement. The election was crudely stolen by the incumbent Ahmadinejad after surprising open campaign, but the people of Iran have bravely taken to the streets in support of Mousavi -- the real winner. It is too early to tell who will prevail in this bare-knuckle power struggle, but you get probably guess who I am rooting for.
The sentiment polarity graph tells the interesting story. Ebbs and flows of the campaign are reflected before the vote, particularly Ahmadinejad's widely-panned debate performance on June 4 and the increasing sense that Mousavi could win. The election on June 12 drew enormous turnout followed too quickly by the announcement of a landslide Ahmadinejad victory. But within 24 hours, Mousavi's claim of fraud gains credence, and Ahmadinejad's sentiment (at least) goes down.
Thursday, June 11, 2009
SBIR Award for General Sentiment
General Sentiment, the startup company which licenced Lydia technology from Stony Brook, has just received a $100,000 Small Business Innovative Research (SBIR) phase I grant from the National Science Foundation (NSF) entitled `` Identifying and Interpreting Trends through News/Blog Analysis''.
Special thanks go to Barack Obama, as this award was funded under the American Recovery and Reinvestment Act of 2009 (ARRA).
Special thanks go to Barack Obama, as this award was funded under the American Recovery and Reinvestment Act of 2009 (ARRA).
Lydia at the Hadoop Summit!
My student Mikhail Bautin just presented his work on the Lydia processing architecture to over 700 people at the 2009 Hadoop Summit in Santa Clara, CA. He found it to be a great conference (better he says than the more academic venues I've sent him to before). There is enormous energy in the Hadoop world today as it becomes the primary system for web-type parallel processing and cloud computing in general.
Hadoop is a distributed processing system inspired by Google's MapReduce paradigm. Computations proceed in rounds of mapping (sending data packets to particular machines based on identification keys) and reduce (crunching these tuples down to a particular result). Such problems arise frequently in Lydia. For example, we can imagine mapping all the sentences in our news corpus keyed to the name of the entities within it, so we can then use reduce to count the number of occurrences of each entity and the other entities it is juxtaposed with. Hadoop manages all the messy stuff of parallel processing, like load balancing and distributed data structures and the like.
It is hard to overstate the importance that Hadoop has made to the Lydia project, efforts which are now rapidly bearing fruit. Expect to hear me soon report on results from enormous blog depositories we have spidered for years yet never previously been able to analyze. Further, we now regularly do large scale analysis of our analysis
using Hadoop, for example in studying trends across all entities across nationalities or ethnic groups.
It is equally hard to overstate the efforts Mikhail has made getting us there with our system. I can ask nothing more of my other students except that they try to "be like Mike".
Hadoop is a distributed processing system inspired by Google's MapReduce paradigm. Computations proceed in rounds of mapping (sending data packets to particular machines based on identification keys) and reduce (crunching these tuples down to a particular result). Such problems arise frequently in Lydia. For example, we can imagine mapping all the sentences in our news corpus keyed to the name of the entities within it, so we can then use reduce to count the number of occurrences of each entity and the other entities it is juxtaposed with. Hadoop manages all the messy stuff of parallel processing, like load balancing and distributed data structures and the like.
It is hard to overstate the importance that Hadoop has made to the Lydia project, efforts which are now rapidly bearing fruit. Expect to hear me soon report on results from enormous blog depositories we have spidered for years yet never previously been able to analyze. Further, we now regularly do large scale analysis of our analysis
using Hadoop, for example in studying trends across all entities across nationalities or ethnic groups.
It is equally hard to overstate the efforts Mikhail has made getting us there with our system. I can ask nothing more of my other students except that they try to "be like Mike".
Thursday, May 21, 2009
The World's Worst Person?
A certain fascination exists with identifying the public figure with the lowest overall sentiment ranking. It tells us something about a given society to discover who the most demonized figure is, the person spoken about with the greatest anger or rancor.
Although our system does not make it easy to extract people by negative sentiment, any active news reader should be able to construct a rogue gallery of evil and destructive people -- like mass murders Adolf Hitler and Osama bin Laden, or the swindler Bernard Madoff. But as shown by these sentiment polarity charts, all their reported vileness pales next to Lori Drew.
Who is Lori Drew? She is a mother whose cyber-bullying drove a fragile 13-year old girl (a rival of her daughter) to suicide, through messages sent from a fake MySpace profile. The public outrage over this pushes many buttons -- from the general hatred of bullying defenseless children, public attitudes against overzealous parenting, fear of social network sites, and more. As of this writing the judge is trying to appropriately sentence her, a task complicated by the fact that she has not been convicted of anything stronger than violating the MySpace terms of service.
Now don't get me wrong. She is clearly guilty of such horrifying behavior that it is difficult to see how she can live with herself. But it is obviously an overreaction to mention her in the company of Hitler and Bin Laden.
Realize that sentiment analysis aims at capturing what the world is thinking, not what it necessarily should be thinking or that which is objectively true. Lydia sentiment signals measure interesting social and cultural phenomena, but their proper interpretation requires an understanding of context and the nature of the underlying news sources.
Edison Chen and the Computer News Processing
One of the pleasures of my sabbatical year in Hong Kong has been reading the local English newspaper (The South China Morning Post) and getting exposed to a new universe of locally-interesting characters. Edison Chen and the Computer Technician has been my favorite story of the year. Lydia news analysis provides very interesting insights into the story and by proxy the culture here in Hong Kong.
Edison Chen, son of a local tycoon, became a Cantopop (Cantonese pop music) singer and general entertainment/media personality. Think a male Paris Hilton, with a similar set of unseemly incidents involving various fights with people and, in one case, a taxi. This explains his generally negative sentiment scores (shown above) up to January 2008, when he took his computer in for repairs and hit the big time.
The computer technician (Ho Chun Sze) found a nice collection of sex photos of Mr. Chen with several female Cantopop stars, actresses, and models (Gillian Chung, Cecilia Cheung, Bobo Chan). The technician showed them to his girlfriend, who showed them to somebody else, and then they ended up on the Internet. All of these figures show up prominently as statistically juxtaposed with Edison Chen. In the wake of this scandal, Edison Chen's sentiment score suddenly turns positive, resulting from respect for his healthy social life, approval of his apologetic behavior (including retiring from the Hong Kong scene to live quietly in Vancouver), and sympathy for the fact that he was ultimately blameless for the release of the photos.
Perhaps most interesting of all are the international heatmaps displaying the spatial reference frequency (left) and sentiment (right). The frequency map shows the most intense interest in China, with secondary interest in countries with significant Cantonese communities (Canada and Australia). Chen's Chinese name was the number 1 search term in China in 2008. The sentiment map shows a negative reputation in all countries except China!
Indeed, Chen finished second to Barack Obama in the Hong Kong Person of 2008 poll by RTHK radio, with almost 30% of the vote.
Monday, May 18, 2009
Sentiment: United States vs. China
Today's graphs give me a very queasy feeling. I decided to compare the sentiment in the United States vs. China, and let's say the results are not very good for the home team.
In particular, United States sentiment has been highly negative since the beginnings of the dailies depository in November 2004, a rating I would like to attribute at least initially to the war in Iraq and the Bush administration in general. Indeed, the months marking Obama's election (November 2008) and his inauguration (January 2009) represent peaks in sentiment despite the economic crisis. But what floored me was the sharp negative spike in April 2009. I attribute this to kvetching about the long-term strength of the dollar, but regardless U.S. sentiment polarity hit a new dailies low during this month.
By comparison, check out the sentiment graph for China over the same period. It has been generally positive since November 2004, with the exception of the Sichuan Earthquake in Spring 2008. The big positive spike is August 2008 results from the enormously successful Beijing Olympics, which also give a nice boost to the U.S. that month. Negative sentiment from the world economic crisis rules the next several months, but the Chinese funk lifted in April as the U.S. continues to descend.
Now these generally negative U.S. and positive China sentiment reflect the longer term time-series from the thirty-year archival depository. Negative news always gets more play than positive news in U.S. newspapers, so it is the changes in sentiment which are more revealing than the absolute sign. The biggest plunge in U.S. sentiment in this period occurred (appropriately) September 2001.
Friday, May 15, 2009
British News Processing
I gave a demo today to Sean Carey, a professor of political science at the University of Sheffield. He is interested in using our news analysis in cahoots with polling data from the British Election Study, to better understand how voters make their decision, and why. They are gearing up for the next national election, likely to occur in Spring 2010.
Since their study revolves around why British voters vote the way they do, they are only concerned with data from British newspapers. The source set tab of TextMap Access makes it easy to create a source set from the dailies depository consisting of all newspapers from the United Kingdom. Once this source set is named and registered, it will appear as an entry as a new depository ready for use in the frequency and sentiment tabs.
One interesting discovery in playing with it was the fraction of references to a local entity like `Gordon Brown' that came from British sources. The answer proved to be about 60%, which is quite impressive considering that less than 10% of our total spidered sources are from the United Kingdom. But it makes sense that he would be what the local readership is interested in...
Particularly amusing was to look at the entities juxtaposed with Gordon Brown at different type scales. Tony Blair, who he served faithfully as Chancellor of the Exchequer, proves his strongest association over the full dailies depository (left column). The past year (center column) more strongly reflects his activities as Prime Minister, including interactions with world leaders (Obama, Sarkozy, Merkel). The strongest associations over the past month (right) column reflect recent activities. We were puzzled a bit by the strong association with Carol Ann Duffy, but a little reading revealed that Brown had just had appointed her as the first female Poet Laureate.
One minor complication of British news processing is that the spelling and word usage is slightly different from what is used in the United States. The lexical resources we employ both British and American spellings, and I expect that our NLP performance will be quite similar on British texts.
Since their study revolves around why British voters vote the way they do, they are only concerned with data from British newspapers. The source set tab of TextMap Access makes it easy to create a source set from the dailies depository consisting of all newspapers from the United Kingdom. Once this source set is named and registered, it will appear as an entry as a new depository ready for use in the frequency and sentiment tabs.
One interesting discovery in playing with it was the fraction of references to a local entity like `Gordon Brown' that came from British sources. The answer proved to be about 60%, which is quite impressive considering that less than 10% of our total spidered sources are from the United Kingdom. But it makes sense that he would be what the local readership is interested in...
Particularly amusing was to look at the entities juxtaposed with Gordon Brown at different type scales. Tony Blair, who he served faithfully as Chancellor of the Exchequer, proves his strongest association over the full dailies depository (left column). The past year (center column) more strongly reflects his activities as Prime Minister, including interactions with world leaders (Obama, Sarkozy, Merkel). The strongest associations over the past month (right) column reflect recent activities. We were puzzled a bit by the strong association with Carol Ann Duffy, but a little reading revealed that Brown had just had appointed her as the first female Poet Laureate.
One minor complication of British news processing is that the spelling and word usage is slightly different from what is used in the United States. The lexical resources we employ both British and American spellings, and I expect that our NLP performance will be quite similar on British texts.
Labels:
Gordon Brown,
political science,
United Kingdom
Thursday, May 14, 2009
Trends in Medline/Pubmed
Medline/Pubmed is a database of the abstracts from over 17 million published journal articles in the biomedical sciences. Processing such a corpus is short work for the current version of the Lydia system, taking only a day or so on our 28-node cluster computer. Our Medline depository provides a good example of how our news/blog processing system can be easily applied to any large-scale text corpus, with interesting results. Here are two interesting discoveries from my explorations this afternoon.
One of the most frequent entities in this depository is cancer, and one of the most frequent cancers is breast cancer. The figure above shows the pubmed frequency graph and rugplots for this disease. The log-scale frequency graph shows the relentless exponential growth in research on breast cancer since the mid-1970's. The regular, small scale bumps over the years reflect the periodicities with which journals are issues, such as quarterly or semiannually.
The rug plot (shown below the time series) shows the distribution of articles identified as news, business, sports, entertainment, or other by our statistical classification methods. Now these classifiers were tuned for news articles, and we do not expect too many sports/entertainment articles appearing in scientific journals (at least the ones I read). But still the results are quite interesting. There is a clear transition in the distribution starting in 1975, when the systematic collection of full text abstracts began.
My other experiment involved a sentiment plot for AIDS since the beginnings of the epidemic around 1982. Both the pubmed and archival depositories show the sentiment polarity of AIDS gradually but definitely drifting towards greater neutrality. The scientific sentiment represented by pubmed has improved from -0.72 in March 1983 to -0.58 in December 2009. The public sentiment about AIDS has risen from -0.78 to -0.48 over the same period. Now AIDS remains a horrible, incurable disease, but it has become regarded more as a chronic condition which can be treated than a deadly plague -- and our sentiment metrics are sensitive enough to pick up on this trend.
Wednesday, May 13, 2009
Welcome to the TextMap Blog!
Hello World! This is the first posting of a blog on developments revolving around the Lydia / TextMap news and blog analysis project at Stony Brook University. These will include:
- Descriptions of newly available functionality on the TextMap website
- Interesting little discoveries on how the world works, derived from TextMap Access data.
- Reports on social science research based on TextMap analysis
- Publication announcements of Lydia-oriented research out of our lab.
- Developments at General Sentiment LLC, a startup company based on Lydia technology.
The time is right to start this blog, because a lot is now happening in the Lydia/TextMap world. Several interesting new analysis depositories (including a longer and more comprehensive newspaper corpus, PubMed abstracts, patents, and Supreme Court decisions) have just come on line as our infrastructure matures. Our TextMap Access interface now provides instant access to this vast amount of data and analysis. I am now spending (wasting?) substantial amounts of time playing with our data, so this blog is the perfect place to relate my discoveries.
Substantial collaborations relying on our analysis have already begun with political scientists, sociologists, and historians, but this is hopefully just the start of several beautiful friendships. Thanks for coming on board. I look forward to having reading (and making) news together.
Subscribe to:
Posts (Atom)