Wednesday, November 7, 2007

anti-hiatus

ORKUT | CRAIGSLIST | DAYLIFE

This has been a busy time of the year. With 'interview season' finally drawing toward a close (my sincere thanks going out to HPTI, Microsoft, IBM, Google, and NetApp for their interest), I have finally found the time, motivation, and mental state(?) to update the research blog with the goings-on of the past weeks. So here we go...

Orkut - following the previous post, we ran into insurmountable privacy issues with the data collection. The project and all associated data have been piped to the bit bucket, so to speak.

Craigslist - in a similar vein to the Orkut scrap collection, Professor Ramakrishnan suggested that he would like to apply some of the data mining code that we have to the interestingly large and largely interesting postings on Craiglist, to track how users interact through time as well as a million other things. This consisted, in practice, of writing a spider to trawl all locations for all jobs for all time. Assuming there is a Phase B of the project, I am at Phase B. :)

Daylife - If I had a preference (don't CS majors convert all preferences to optimization problems?), I would have to say [that] this has been the most interesting. Some background - daylife.com is a startup that seeks to be a better news aggregator than other sites, such as news.google.com. They shared a small, 499795812 byte subset of their articles with Professor Ramakrishnan, from whom I received it with the goal of implementing storytelling for this type of data. The problem is more challenging in this case because it's a database of general-purpose news articles, showing a higher degree of ambiguity, spurious connections, etc. than the scientific texts to which the algorithm was, from my understanding, originally applied (very successfully).