Monday, October 26, 2009
Alumni Reunion
The Lydia/TextMap system was built in collaboration with my graduate students. A lot of graduate students. Indeed over thirty of them to date, all properly recognized on the team webpage. I've grown quite close to them over the years, and we try to keep in touch through our annual Lydia Alumni Banquet in Manhattan.
The 2009 banquet was this past weekend, and attracted a swarm of 18 loyal Lydia-oids. I am proud to see that all are doing very well indeed, with careers progressing nicely despite the recession. Most are somehow connected to the finance industry, including a growing number in hedge funds, but several others work in technology companies such as Google and Microsoft. Several are starting families, with several engagements (Levon, Andrew, and Lohit) on top of recent weddings (Prachi, Namtrata, and Jai). I hereby move that the first alumni child be named ``Lydia''. (or if the parents prefer, TextMap :-) )
I also include a December 2008 photo of myself with three of the Lydia alums who could not attend this year's banquet, and look forward to seeing everyone at next year's banquet.
Thursday, July 2, 2009
Ethnicity detection and the origin of Skiena
Although trends apparent in single-entity time series are revealing, more subtle analysis is possible by aggregating the signals of all the entities in a given group (say women, businessmen, Africans, etc.). But we first need to identify which entities are members of the group we are interested in.
This motivates our paper ``Name-Ethnicity Classification from Open Sources'', just presented at the 15th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining in Paris, France. We developed a statistical classifier/HMM to map person names to likely ethnicities. Given `Hu Jintai', we want to return `Chinese'. Given `Dimitry Medvedev', we want to return `Russian' or at least `Eastern European'. Given `Abdullah bin Abdul Aziz', we want to return `Muslim'.
Our classifier is not perfect, but it gives us a tool to answer questions like ``How did news sentiment towards Muslims change in the wake of 9/11?'' or ``How do attitudes towards Hispanics vary across the U.S.?''. Our results are quite interesting, and believe that entity-based ethnicity and nationality classification has many applications in social science research.
My coauthors were Anurag Ambekar, Charles Ward, Jahangir Mohammed, and Swapna Male.
I encourage you to play with our ethnic name classifier at http://www.textmap.com/ethnicity to see how it works.
This week I was thrilled to see the official 1920 and 1930 census pages for my grandparents. The original Skiena (my grandfather Sol) arrived in the U.S. in 1911, but with no clear English spelling for his name. Indeed, no less than four distinct spellings are relevant for interpreting these census pages. I list them with the primary and secondary ethnicities identified by our classifier to give you some idea of our purported roots:
My Grandfather is from Russia, so Eastern European is indeed the correct answer. For the record, this is a very hard case for the classifier; usually we are also given first names and deal with more common surnames.
This motivates our paper ``Name-Ethnicity Classification from Open Sources'', just presented at the 15th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining in Paris, France. We developed a statistical classifier/HMM to map person names to likely ethnicities. Given `Hu Jintai', we want to return `Chinese'. Given `Dimitry Medvedev', we want to return `Russian' or at least `Eastern European'. Given `Abdullah bin Abdul Aziz', we want to return `Muslim'.
Our classifier is not perfect, but it gives us a tool to answer questions like ``How did news sentiment towards Muslims change in the wake of 9/11?'' or ``How do attitudes towards Hispanics vary across the U.S.?''. Our results are quite interesting, and believe that entity-based ethnicity and nationality classification has many applications in social science research.
My coauthors were Anurag Ambekar, Charles Ward, Jahangir Mohammed, and Swapna Male.
I encourage you to play with our ethnic name classifier at http://www.textmap.com/ethnicity to see how it works.
This week I was thrilled to see the official 1920 and 1930 census pages for my grandparents. The original Skiena (my grandfather Sol) arrived in the U.S. in 1911, but with no clear English spelling for his name. Indeed, no less than four distinct spellings are relevant for interpreting these census pages. I list them with the primary and secondary ethnicities identified by our classifier to give you some idea of our purported roots:
- Sheaner - Jewish (0.76) and British (0.22)
- Skeaner - British (0.86) and Jewish (0.08)
- Sciaaner - Hispanic (0.48) and Nordic (0.24)
- Skiena - Eastern European (0.67) and British (0.21)
My Grandfather is from Russia, so Eastern European is indeed the correct answer. For the record, this is a very hard case for the classifier; usually we are also given first names and deal with more common surnames.
Labels:
ethnic names,
ethnicity detection,
KDD,
Skiena
Thursday, June 18, 2009
Ahmadinejad Goes Down!
Like much of the world, I have been following the presidential election in Iran and its aftermath with great excitement. The election was crudely stolen by the incumbent Ahmadinejad after surprising open campaign, but the people of Iran have bravely taken to the streets in support of Mousavi -- the real winner. It is too early to tell who will prevail in this bare-knuckle power struggle, but you get probably guess who I am rooting for.
The sentiment polarity graph tells the interesting story. Ebbs and flows of the campaign are reflected before the vote, particularly Ahmadinejad's widely-panned debate performance on June 4 and the increasing sense that Mousavi could win. The election on June 12 drew enormous turnout followed too quickly by the announcement of a landslide Ahmadinejad victory. But within 24 hours, Mousavi's claim of fraud gains credence, and Ahmadinejad's sentiment (at least) goes down.
Thursday, June 11, 2009
SBIR Award for General Sentiment
General Sentiment, the startup company which licenced Lydia technology from Stony Brook, has just received a $100,000 Small Business Innovative Research (SBIR) phase I grant from the National Science Foundation (NSF) entitled `` Identifying and Interpreting Trends through News/Blog Analysis''.
Special thanks go to Barack Obama, as this award was funded under the American Recovery and Reinvestment Act of 2009 (ARRA).
Special thanks go to Barack Obama, as this award was funded under the American Recovery and Reinvestment Act of 2009 (ARRA).
Lydia at the Hadoop Summit!
My student Mikhail Bautin just presented his work on the Lydia processing architecture to over 700 people at the 2009 Hadoop Summit in Santa Clara, CA. He found it to be a great conference (better he says than the more academic venues I've sent him to before). There is enormous energy in the Hadoop world today as it becomes the primary system for web-type parallel processing and cloud computing in general.
Hadoop is a distributed processing system inspired by Google's MapReduce paradigm. Computations proceed in rounds of mapping (sending data packets to particular machines based on identification keys) and reduce (crunching these tuples down to a particular result). Such problems arise frequently in Lydia. For example, we can imagine mapping all the sentences in our news corpus keyed to the name of the entities within it, so we can then use reduce to count the number of occurrences of each entity and the other entities it is juxtaposed with. Hadoop manages all the messy stuff of parallel processing, like load balancing and distributed data structures and the like.
It is hard to overstate the importance that Hadoop has made to the Lydia project, efforts which are now rapidly bearing fruit. Expect to hear me soon report on results from enormous blog depositories we have spidered for years yet never previously been able to analyze. Further, we now regularly do large scale analysis of our analysis
using Hadoop, for example in studying trends across all entities across nationalities or ethnic groups.
It is equally hard to overstate the efforts Mikhail has made getting us there with our system. I can ask nothing more of my other students except that they try to "be like Mike".
Hadoop is a distributed processing system inspired by Google's MapReduce paradigm. Computations proceed in rounds of mapping (sending data packets to particular machines based on identification keys) and reduce (crunching these tuples down to a particular result). Such problems arise frequently in Lydia. For example, we can imagine mapping all the sentences in our news corpus keyed to the name of the entities within it, so we can then use reduce to count the number of occurrences of each entity and the other entities it is juxtaposed with. Hadoop manages all the messy stuff of parallel processing, like load balancing and distributed data structures and the like.
It is hard to overstate the importance that Hadoop has made to the Lydia project, efforts which are now rapidly bearing fruit. Expect to hear me soon report on results from enormous blog depositories we have spidered for years yet never previously been able to analyze. Further, we now regularly do large scale analysis of our analysis
using Hadoop, for example in studying trends across all entities across nationalities or ethnic groups.
It is equally hard to overstate the efforts Mikhail has made getting us there with our system. I can ask nothing more of my other students except that they try to "be like Mike".
Thursday, May 21, 2009
The World's Worst Person?
A certain fascination exists with identifying the public figure with the lowest overall sentiment ranking. It tells us something about a given society to discover who the most demonized figure is, the person spoken about with the greatest anger or rancor.
Although our system does not make it easy to extract people by negative sentiment, any active news reader should be able to construct a rogue gallery of evil and destructive people -- like mass murders Adolf Hitler and Osama bin Laden, or the swindler Bernard Madoff. But as shown by these sentiment polarity charts, all their reported vileness pales next to Lori Drew.
Who is Lori Drew? She is a mother whose cyber-bullying drove a fragile 13-year old girl (a rival of her daughter) to suicide, through messages sent from a fake MySpace profile. The public outrage over this pushes many buttons -- from the general hatred of bullying defenseless children, public attitudes against overzealous parenting, fear of social network sites, and more. As of this writing the judge is trying to appropriately sentence her, a task complicated by the fact that she has not been convicted of anything stronger than violating the MySpace terms of service.
Now don't get me wrong. She is clearly guilty of such horrifying behavior that it is difficult to see how she can live with herself. But it is obviously an overreaction to mention her in the company of Hitler and Bin Laden.
Realize that sentiment analysis aims at capturing what the world is thinking, not what it necessarily should be thinking or that which is objectively true. Lydia sentiment signals measure interesting social and cultural phenomena, but their proper interpretation requires an understanding of context and the nature of the underlying news sources.
Edison Chen and the Computer News Processing
One of the pleasures of my sabbatical year in Hong Kong has been reading the local English newspaper (The South China Morning Post) and getting exposed to a new universe of locally-interesting characters. Edison Chen and the Computer Technician has been my favorite story of the year. Lydia news analysis provides very interesting insights into the story and by proxy the culture here in Hong Kong.
Edison Chen, son of a local tycoon, became a Cantopop (Cantonese pop music) singer and general entertainment/media personality. Think a male Paris Hilton, with a similar set of unseemly incidents involving various fights with people and, in one case, a taxi. This explains his generally negative sentiment scores (shown above) up to January 2008, when he took his computer in for repairs and hit the big time.
The computer technician (Ho Chun Sze) found a nice collection of sex photos of Mr. Chen with several female Cantopop stars, actresses, and models (Gillian Chung, Cecilia Cheung, Bobo Chan). The technician showed them to his girlfriend, who showed them to somebody else, and then they ended up on the Internet. All of these figures show up prominently as statistically juxtaposed with Edison Chen. In the wake of this scandal, Edison Chen's sentiment score suddenly turns positive, resulting from respect for his healthy social life, approval of his apologetic behavior (including retiring from the Hong Kong scene to live quietly in Vancouver), and sympathy for the fact that he was ultimately blameless for the release of the photos.
Perhaps most interesting of all are the international heatmaps displaying the spatial reference frequency (left) and sentiment (right). The frequency map shows the most intense interest in China, with secondary interest in countries with significant Cantonese communities (Canada and Australia). Chen's Chinese name was the number 1 search term in China in 2008. The sentiment map shows a negative reputation in all countries except China!
Indeed, Chen finished second to Barack Obama in the Hong Kong Person of 2008 poll by RTHK radio, with almost 30% of the vote.
Subscribe to:
Posts (Atom)