Thursday, July 2, 2009

Ethnicity detection and the origin of Skiena

Although trends apparent in single-entity time series are revealing, more subtle analysis is possible by aggregating the signals of all the entities in a given group (say women, businessmen, Africans, etc.). But we first need to identify which entities are members of the group we are interested in.

This motivates our paper ``Name-Ethnicity Classification from Open Sources'', just presented at the 15th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining in Paris, France. We developed a statistical classifier/HMM to map person names to likely ethnicities. Given `Hu Jintai', we want to return `Chinese'. Given `Dimitry Medvedev', we want to return `Russian' or at least `Eastern European'. Given `Abdullah bin Abdul Aziz', we want to return `Muslim'.

Our classifier is not perfect, but it gives us a tool to answer questions like ``How did news sentiment towards Muslims change in the wake of 9/11?'' or ``How do attitudes towards Hispanics vary across the U.S.?''. Our results are quite interesting, and believe that entity-based ethnicity and nationality classification has many applications in social science research.
My coauthors were Anurag Ambekar, Charles Ward, Jahangir Mohammed, and Swapna Male.

I encourage you to play with our ethnic name classifier at to see how it works.

This week I was thrilled to see the official 1920 and 1930 census pages for my grandparents. The original Skiena (my grandfather Sol) arrived in the U.S. in 1911, but with no clear English spelling for his name. Indeed, no less than four distinct spellings are relevant for interpreting these census pages. I list them with the primary and secondary ethnicities identified by our classifier to give you some idea of our purported roots:

  • Sheaner - Jewish (0.76) and British (0.22)
  • Skeaner - British (0.86) and Jewish (0.08)
  • Sciaaner - Hispanic (0.48) and Nordic (0.24)
  • Skiena - Eastern European (0.67) and British (0.21)

My Grandfather is from Russia, so Eastern European is indeed the correct answer. For the record, this is a very hard case for the classifier; usually we are also given first names and deal with more common surnames.