data mining social networks: 2007/09

Orkut was chosen as the social network for this research for several reasons, namely that much more information tends to be public, we are not interested in the data that is most often private (identity, etc), and the scrap feature of Orkut (similar to the Facebook wall, but more often used) which will be analyzed.

The first step was to develop a suitably large scrap corpus. The web crawler is written in python and uses the mechanize (python, not perl) module for login. Originally, everything was integrated, but this didn't scale well and made concurrency difficult. The solution was to split the crawler into user id listing (which parses friends lists to a given depth) and user info downloading (parses friends list, scrapbook, and communities) sections.

Another design mistake was what was happening with the parsed information. In what was a very clean design choice at lower starting depth levels but ultimately impractical (the branching factors of social networks are typically very high), I had delayed the output and cleanup (a file in /tmp is created by mechanize for each page access, and must be removed before exiting) stages until the script had done traversing its network to the desired depth. This wasn't obvious until I terminated a depth 3 crawl (after the date changed, my code's assumptions for the dates scraps were written would be wrong) and was left with 524282 useless tmpfiles and no data. Cheers to ReiserFS.

    # how to delete 524282 randomly* named tmpfiles
    import os
    c = '_-abcdefghijklmnopqrstuvwxyzABCDEFGHIJK' \
    'LMNOPQRSTUVWXYZ0123456789'
    cmds = []
    for i in c:
        for j in c:
            cmds.append( 'rm /tmp/tmp'+i+j+'*\n' )
    file = open('xxx','w')
    file.writelines(cmds)
    os.system('bash xxx')
    os.remove('xxx')

我是一个学生,我在弗吉尼亚技术学计算机科学系学习。博客主题: 我的人工智能学习心得。

Привет из VT! Я студент по информатике в Вирджинском политехническом институте... этот блог о моих исследованиях в области искусственного интеллекта.

Estudio informática en VT. Este blog está sobre inteligencia artificial - particularmente, en los cambios de indicadores relacionales dentro de redes sociales y sus efectos. Bienvenidos!

So this is the first of hopefully many productive posts to my very own research blog. What kind of research? I am studying how social networks evolve in a popular social networking website, and if/what indicators exist that will allow us to make valid predictions about future member actions within the network. Comments are welcome and much appreciated!

data mining social networks

Sunday, September 23, 2007

bash: /bin/rm: Argument list too long

Wednesday, September 19, 2007

init()

Blog Archive