Thursday, May 14, 2009
Trends in Medline/Pubmed
Medline/Pubmed is a database of the abstracts from over 17 million published journal articles in the biomedical sciences. Processing such a corpus is short work for the current version of the Lydia system, taking only a day or so on our 28-node cluster computer. Our Medline depository provides a good example of how our news/blog processing system can be easily applied to any large-scale text corpus, with interesting results. Here are two interesting discoveries from my explorations this afternoon.
One of the most frequent entities in this depository is cancer, and one of the most frequent cancers is breast cancer. The figure above shows the pubmed frequency graph and rugplots for this disease. The log-scale frequency graph shows the relentless exponential growth in research on breast cancer since the mid-1970's. The regular, small scale bumps over the years reflect the periodicities with which journals are issues, such as quarterly or semiannually.
The rug plot (shown below the time series) shows the distribution of articles identified as news, business, sports, entertainment, or other by our statistical classification methods. Now these classifiers were tuned for news articles, and we do not expect too many sports/entertainment articles appearing in scientific journals (at least the ones I read). But still the results are quite interesting. There is a clear transition in the distribution starting in 1975, when the systematic collection of full text abstracts began.
My other experiment involved a sentiment plot for AIDS since the beginnings of the epidemic around 1982. Both the pubmed and archival depositories show the sentiment polarity of AIDS gradually but definitely drifting towards greater neutrality. The scientific sentiment represented by pubmed has improved from -0.72 in March 1983 to -0.58 in December 2009. The public sentiment about AIDS has risen from -0.78 to -0.48 over the same period. Now AIDS remains a horrible, incurable disease, but it has become regarded more as a chronic condition which can be treated than a deadly plague -- and our sentiment metrics are sensitive enough to pick up on this trend.