Sunday, September 23, 2007

bash: /bin/rm: Argument list too long

Orkut was chosen as the social network for this research for several reasons, namely that much more information tends to be public, we are not interested in the data that is most often private (identity, etc), and the scrap feature of Orkut (similar to the Facebook wall, but more often used) which will be analyzed.

The first step was to develop a suitably large scrap corpus. The web crawler is written in python and uses the mechanize (python, not perl) module for login. Originally, everything was integrated, but this didn't scale well and made concurrency difficult. The solution was to split the crawler into user id listing (which parses friends lists to a given depth) and user info downloading (parses friends list, scrapbook, and communities) sections.

Another design mistake was what was happening with the parsed information. In what was a very clean design choice at lower starting depth levels but ultimately impractical (the branching factors of social networks are typically very high), I had delayed the output and cleanup (a file in /tmp is created by mechanize for each page access, and must be removed before exiting) stages until the script had done traversing its network to the desired depth. This wasn't obvious until I terminated a depth 3 crawl (after the date changed, my code's assumptions for the dates scraps were written would be wrong) and was left with 524282 useless tmpfiles and no data. Cheers to ReiserFS.

    # how to delete 524282 randomly* named tmpfiles
    import os
    c = '_-abcdefghijklmnopqrstuvwxyzABCDEFGHIJK' \
    'LMNOPQRSTUVWXYZ0123456789'
    cmds = []
    for i in c:
        for j in c:
            cmds.append( 'rm /tmp/tmp'+i+j+'*\n' )
    file = open('xxx','w')
    file.writelines(cmds)
    os.system('bash xxx')
    os.remove('xxx')

No comments: