ICML Trends
I recently attended the 24th Annual International Conference on Machine Learning held at Oregon State University in Corvallis, Oregon. The last time I went was in 2002 when it was held at my alma mater here in Sydney. It turns out that five years is a long time in machine learning, so I decided to do a little data-mining on paper titles from the conference over the last 20 years to see if I could spot any trends.
At the conference dinner the program chair, Zoubin Ghahramani, presented the usual statistics for ICML 2007: the number of papers submitted, the number accepted, the number of reviewers, etc. As a bonus, he also did a analysis showing the most probable keyword given that a paper was accepted and the most probable given it was rejected.
(Update: the presentation is now available from Zoubin’s site as PDF slides. Thanks to Ricardo Silva pointing this out over at hunch.net).
Collection
I can’t remember the exact keywords but I thought the idea was a fun one so the next morning I hacked together a quick ruby script to grab the accepted paper titles for ICML from 1988 to 2006 from the list at DBLP. The titles for 2007 aren’t up there yet so I also had my script scrape titles from the ICML 2007 site.
I then tokenised the titles and applied some stemming and stop-word removal to the 1782 paper titles using a great little stemming library and a list of stop-words I found via Wikipedia. The result is a table with 1640 rows, one per term, with each row containing the number of papers with that term in it for each of the 20 years from 1988 to 2007.
Results
The first thing we can do with this data is see how many papers were accepted into the conference each year. This either gives some idea of the vitality of the field over time or the choosiness of the program committee.

The following graph shows the rise of Bayesian and kernel methods over the last decade. Especially impressive is the tripling of papers mentioning “kernel” over the last six years considering it was unheard of before 1995.

This shift in focus can also be seen when looking at the trends for “theori”, “concept” and “model”. The first two terms are more commonly used to describe symbolic models whereas the latter is favoured by statisticians.

Also interesting is the shift in emphasis from knowledge to data. I guess this can be partially explained by the ever increasing computing power we have available and how inexpensive it is to now store vast amounts of data. Peter Norvig discussed this trend in his seminar he gave a Berkeley late last year.

I’m sure there are plenty more interesting finds to make in the data. If you want to have a dive yourself, you can grab the results as a 75kb comma-separated file here: term_counts.csv. A 32kb file containing the all the raw paper titles can be downloaded here: titles.tar.gz.
Let me know if you find anything interesting.
Scripting and Scraping
As mentioned earlier, all of the data collection and analysis performed here was done using ruby scripts. The ease with which really good libraries such as WWW::Mechanize and stemmer can be installed and quickly put to use is a testament to ruby and worth a final couple of remarks.
For example, have a look at the entire code for the cleaning, stop-word removal and stemming of paper titles:
@terms = title.split(/\W/).collect { |word| word.downcase.strip }
@terms.reject! { |word| word.empty? || STOP_WORDS.include?(word) }
@terms.map! { |word| word.stem }
@terms.uniq!
The only slightly Perl-ish nastiness there is the \W regular expression to split on word boundaries. The rest, in my opinion, is both terse and clear. Hooray for iterators and closures!
The process for getting all of the 1988-2006 paper titles from the DBLP site is as follows:
Navigate to the ICML DBLP site and on each of the links labelled ‘Contents’ see if the link’s target has a 4 digit year in it. If so, create a new conference structure, follow the link and on the resulting page take the text between the ‘:’ and the ‘.’ in each ‘li’ element on the page as the title and add it to the conference.
Thanks to the WWW::Mechanize and hpricot libraries it’s almost longer to write out like that than it is to code:
agent.get(ICML_DBLP).links.text('Contents').each do |link|
year = /\d{4}/.match(link.href)[0]
confs << Conference.new(year)
page = agent.click link
page.search('//li').each do |li|
confs.last << $1 if li.inner_text.gsub('\n','') =~ /:([^.]+)\./
end
end
Once again there’s a bit of ugliness with regular expressions but that’s more or less the price you pay when dealing with text. The rest of the code is there to navigate the DBLP web pages by simulating clicking on links and using XPath-like patterns to find the relevant parts of the resulting HTML documents.
These sort of tools and the level of abstraction languages like ruby and python conceed make me remember why I enjoy programming so much. I can go from a question like, “I wonder what sort of trends there are in ICML titles” to answers in less time than it took me to write this blog post.
eight comments
For instance, here’s the rise of “Kernel” in relation to “Bayesian” over the years http://ats.cs.ut.ee/u/kt/stuff/scholartr..
It doesn’t surprise me that someone had done this sort of thing already. web-full of data about machine learning + web-full of machine learning people = pretty graphs about machine learning. :)
Thanks for the links. I might try to make the ICML data a bit more interactive in the future.
To make it even easier, you can grab it here: ruby script
Leave a new comment