Wednesday, November 7, 2007

anti-hiatus

ORKUT | CRAIGSLIST | DAYLIFE

This has been a busy time of the year. With 'interview season' finally drawing toward a close (my sincere thanks going out to HPTI, Microsoft, IBM, Google, and NetApp for their interest), I have finally found the time, motivation, and mental state(?) to update the research blog with the goings-on of the past weeks. So here we go...

Orkut - following the previous post, we ran into insurmountable privacy issues with the data collection. The project and all associated data have been piped to the bit bucket, so to speak.

Craigslist - in a similar vein to the Orkut scrap collection, Professor Ramakrishnan suggested that he would like to apply some of the data mining code that we have to the interestingly large and largely interesting postings on Craiglist, to track how users interact through time as well as a million other things. This consisted, in practice, of writing a spider to trawl all locations for all jobs for all time. Assuming there is a Phase B of the project, I am at Phase B. :)

Daylife - If I had a preference (don't CS majors convert all preferences to optimization problems?), I would have to say [that] this has been the most interesting. Some background - daylife.com is a startup that seeks to be a better news aggregator than other sites, such as news.google.com. They shared a small, 499795812 byte subset of their articles with Professor Ramakrishnan, from whom I received it with the goal of implementing storytelling for this type of data. The problem is more challenging in this case because it's a database of general-purpose news articles, showing a higher degree of ambiguity, spurious connections, etc. than the scientific texts to which the algorithm was, from my understanding, originally applied (very successfully).

Sunday, September 23, 2007

bash: /bin/rm: Argument list too long

Orkut was chosen as the social network for this research for several reasons, namely that much more information tends to be public, we are not interested in the data that is most often private (identity, etc), and the scrap feature of Orkut (similar to the Facebook wall, but more often used) which will be analyzed.

The first step was to develop a suitably large scrap corpus. The web crawler is written in python and uses the mechanize (python, not perl) module for login. Originally, everything was integrated, but this didn't scale well and made concurrency difficult. The solution was to split the crawler into user id listing (which parses friends lists to a given depth) and user info downloading (parses friends list, scrapbook, and communities) sections.

Another design mistake was what was happening with the parsed information. In what was a very clean design choice at lower starting depth levels but ultimately impractical (the branching factors of social networks are typically very high), I had delayed the output and cleanup (a file in /tmp is created by mechanize for each page access, and must be removed before exiting) stages until the script had done traversing its network to the desired depth. This wasn't obvious until I terminated a depth 3 crawl (after the date changed, my code's assumptions for the dates scraps were written would be wrong) and was left with 524282 useless tmpfiles and no data. Cheers to ReiserFS.

    # how to delete 524282 randomly* named tmpfiles
    import os
    c = '_-abcdefghijklmnopqrstuvwxyzABCDEFGHIJK' \
    'LMNOPQRSTUVWXYZ0123456789'
    cmds = []
    for i in c:
        for j in c:
            cmds.append( 'rm /tmp/tmp'+i+j+'*\n' )
    file = open('xxx','w')
    file.writelines(cmds)
    os.system('bash xxx')
    os.remove('xxx')

Wednesday, September 19, 2007

init()

我是一个学生,我在弗吉尼亚技术学计算机科学系学习。博客主题: 我的人工智能学习心得。

Привет из VT! Я студент по информатике в Вирджинском политехническом институте... этот блог о моих исследованиях в области искусственного интеллекта.

Estudio informática en VT. Este blog está sobre inteligencia artificial - particularmente, en los cambios de indicadores relacionales dentro de redes sociales y sus efectos. Bienvenidos!

So this is the first of hopefully many productive posts to my very own research blog. What kind of research? I am studying how social networks evolve in a popular social networking website, and if/what indicators exist that will allow us to make valid predictions about future member actions within the network. Comments are welcome and much appreciated!