Thursday, June 11, 2009

Lydia at the Hadoop Summit!

My student Mikhail Bautin just presented his work on the Lydia processing architecture to over 700 people at the 2009 Hadoop Summit in Santa Clara, CA. He found it to be a great conference (better he says than the more academic venues I've sent him to before). There is enormous energy in the Hadoop world today as it becomes the primary system for web-type parallel processing and cloud computing in general.

Hadoop is a distributed processing system inspired by Google's MapReduce paradigm. Computations proceed in rounds of mapping (sending data packets to particular machines based on identification keys) and reduce (crunching these tuples down to a particular result). Such problems arise frequently in Lydia. For example, we can imagine mapping all the sentences in our news corpus keyed to the name of the entities within it, so we can then use reduce to count the number of occurrences of each entity and the other entities it is juxtaposed with. Hadoop manages all the messy stuff of parallel processing, like load balancing and distributed data structures and the like.

It is hard to overstate the importance that Hadoop has made to the Lydia project, efforts which are now rapidly bearing fruit. Expect to hear me soon report on results from enormous blog depositories we have spidered for years yet never previously been able to analyze. Further, we now regularly do large scale analysis of our analysis
using Hadoop, for example in studying trends across all entities across nationalities or ethnic groups.

It is equally hard to overstate the efforts Mikhail has made getting us there with our system. I can ask nothing more of my other students except that they try to "be like Mike".

No comments:

Post a Comment