Saturday, March 22, 2008

term-document modeling

Following up on the conversation last night, I've incorporated variant weighting schemes into the term-document modeling. This required rewrites in two areas -- specifically, input parameters are getting possibly too unwieldy, so I wrote a fallback interface to choose valid parameters and moved parameter validation itself, wherever possible, earlier. Secondly, the load_weights() function mentioned in the last post has been expanded as to be described here.

variant term weighting schemes
In the new design, load_weights is a shell layering functionality over a more orthagonal function with more knobs. Desired weighting schemes can be tf-idf, sublinear tf scaling (is 20 occurrences really 20 times as significant?), maximum tf normalization (documents tend to repeat the same words over and over), or a custom triple of weightings for term frequency, document frequency, and normalization factor following the design here. load_weights() is able to realize the first three as configurations of triples (ntn, ltn, atn, respectively) and process/pass all types of input, following validation, to the underlying function load_weights_triple.

tf/df/normalization
Term weights then are simply multiples resulting from the three selected weighting schemes. I wasn't able to implement the pivoted unique normalization "(Section 6.4.4)" due to lack of information, but the rest are there, namely

term frequency =
- natural = tf
- logarithm = tf > 0 ? 1 + log(tf) : 0
- augmented = a + (1-a) * tf / tf_max
- boolean = tf > 0 ? 1 : 0
- log ave = (1 + log(tf)) / (1 + log(ave(tf)))

document frequency =
- no = 1
- idf = log(N / df)
- prob_idf = max(0, (N-df)/df)

normalization =
- none = 1
- cosine = 1 / sqrt(w02 + w12 + w22 + ...)
- pivoted unique = ?
- byte size = charlengthα, α < 1

confusion and gnashing of teeth
My main problems here are pivoted unique, the fact that by my interpretation, (N - df)/df should never be less than zero (?), the constants introduced for 'augment' (.4 or .5) and 'byte size' (unknown reco.) (how to tune, should they be user-configurable), and adequately testing that all my code does what it was meant to do. Apologies for missing the last meeting due to the foolish wisdom teeth removal issue. I'm also aiming for the Monday deadline for the similarity search improvement, and hope to have another blog post by then. Cheers.

This version of the 'load code,' sans input data, is available here.

Monday, March 17, 2008

storytelling pipeline, statistics section

Returning to the use of this blog...

overview
My design consists of a fairly input format-agnostic document statistics section whose output feeds a path-finder over the processed documents according to various document similarity measures. Resulting paths can then be pruned and rated using the documents to ensure acceptable relevance.

scalability
The statistics section is a serial (interdependent), very tuned set of stages. Runtime scales linearly over #documents * #words for almost all subsections (details for Daylife, PubMed corpora). Memory usage is the overarching but not insurmountable limitation on increasing the size of input data, with the low-hanging fruit being the removal of all memory leaks and more careful allocation schemes, and the long-term goal of rewriting it in parallel, only if this is the best use of time, of course, given the many other interesting opportunities in this project.

callgraph
main
- load(input_type, input_filename, documents, words)
- - load_file(filetype, filename, documents)
- - load_fixup(documents)
- - load_dict(documents, dictionary)
- - load_index(dictionary, words)
- - load_pack(documents, dictionary, pdocs, words_sz_max)
- - load_stats(words, pdocs)
- - load_weights(words, pdocs, words_sz_max)
- dump(words, documents, output_stats)
- show(words, documents, output_stats)

substages
load_file() reads the documents from the file. Each line is passed into a custom boolean function load_document(doc, line, type) that returns whether it was able to load a document from that string, which in turn calls load_document_pubmed, load_document_daylife, or assert(0). Document texts are loaded into a vector to be used by following functions.

load_fixup() converts document texts to a more usable form by removing DNA sequences, converting to lower case, removing stopwords, and stemming for each document (runs in #docs * #words, and is the slowest stage but not considerably so).

load_dict() builds a dictionary (a C++ stl::map) of words from the documents (runs in #docs log #words).

load_index() exists because indexing words into a map for each document traversal was hurting performance by a factor of ~50x over the entire program. Instead, the next functions re-represent documents not as words separated by spaces in strings, but as arrays of integer dictionary indexes. load_index traverses the dictionary of (word -> count) from the load_dict, building a new full word statistics structure from this information and reindexing the dictionary to store (word -> index in the new structure).

load_pack() then walks the documents, converting the words into an array of word index bit-or the shifted number of the word's occurrences in that document packed into a uint_32t (verifying no overflow), thus removing duplicate word entries and cutting out a log n map access algorithmically, but most importantly, saving in cache misses. This saves much time for every subsequent access in the next two stages.

load_stats(), using the re-formed documents and words lists, then builds documents-per-word, words-per-document, and standard deviation statistics for the word statistic structures (documents-per-word is needed for the following tf-idf computation), updating stddev in multiples where zero.

load_weights() computes and stores tf-idf scores (technically unneeded except for weight calculation, but interesting) and term weights per word per document.

summary
I would go over the code in detail, but it seems it's much more efficient to use your preferred editor and syntax highlighting scheme than to see dislocated snippets on blogger.com. For the code to match the descriptions, visit /load/load.cpp. Logging is in log.h, document structures in document.h, word structures in word.h, in addition to the stopwords list, Porter code, and some sample outputs. The code is clean, and with the exception of some pointer arithmetic and the adapted Porter stemmer, quite readable. I compile with "g++ load.cpp -O3 -o load". Comments/insight are very appreciated.

Wednesday, November 7, 2007

anti-hiatus

ORKUT | CRAIGSLIST | DAYLIFE

This has been a busy time of the year. With 'interview season' finally drawing toward a close (my sincere thanks going out to HPTI, Microsoft, IBM, Google, and NetApp for their interest), I have finally found the time, motivation, and mental state(?) to update the research blog with the goings-on of the past weeks. So here we go...

Orkut - following the previous post, we ran into insurmountable privacy issues with the data collection. The project and all associated data have been piped to the bit bucket, so to speak.

Craigslist - in a similar vein to the Orkut scrap collection, Professor Ramakrishnan suggested that he would like to apply some of the data mining code that we have to the interestingly large and largely interesting postings on Craiglist, to track how users interact through time as well as a million other things. This consisted, in practice, of writing a spider to trawl all locations for all jobs for all time. Assuming there is a Phase B of the project, I am at Phase B. :)

Daylife - If I had a preference (don't CS majors convert all preferences to optimization problems?), I would have to say [that] this has been the most interesting. Some background - daylife.com is a startup that seeks to be a better news aggregator than other sites, such as news.google.com. They shared a small, 499795812 byte subset of their articles with Professor Ramakrishnan, from whom I received it with the goal of implementing storytelling for this type of data. The problem is more challenging in this case because it's a database of general-purpose news articles, showing a higher degree of ambiguity, spurious connections, etc. than the scientific texts to which the algorithm was, from my understanding, originally applied (very successfully).

Sunday, September 23, 2007

bash: /bin/rm: Argument list too long

Orkut was chosen as the social network for this research for several reasons, namely that much more information tends to be public, we are not interested in the data that is most often private (identity, etc), and the scrap feature of Orkut (similar to the Facebook wall, but more often used) which will be analyzed.

The first step was to develop a suitably large scrap corpus. The web crawler is written in python and uses the mechanize (python, not perl) module for login. Originally, everything was integrated, but this didn't scale well and made concurrency difficult. The solution was to split the crawler into user id listing (which parses friends lists to a given depth) and user info downloading (parses friends list, scrapbook, and communities) sections.

Another design mistake was what was happening with the parsed information. In what was a very clean design choice at lower starting depth levels but ultimately impractical (the branching factors of social networks are typically very high), I had delayed the output and cleanup (a file in /tmp is created by mechanize for each page access, and must be removed before exiting) stages until the script had done traversing its network to the desired depth. This wasn't obvious until I terminated a depth 3 crawl (after the date changed, my code's assumptions for the dates scraps were written would be wrong) and was left with 524282 useless tmpfiles and no data. Cheers to ReiserFS.

    # how to delete 524282 randomly* named tmpfiles
    import os
    c = '_-abcdefghijklmnopqrstuvwxyzABCDEFGHIJK' \
    'LMNOPQRSTUVWXYZ0123456789'
    cmds = []
    for i in c:
        for j in c:
            cmds.append( 'rm /tmp/tmp'+i+j+'*\n' )
    file = open('xxx','w')
    file.writelines(cmds)
    os.system('bash xxx')
    os.remove('xxx')

Wednesday, September 19, 2007

init()

我是一个学生,我在弗吉尼亚技术学计算机科学系学习。博客主题: 我的人工智能学习心得。

Привет из VT! Я студент по информатике в Вирджинском политехническом институте... этот блог о моих исследованиях в области искусственного интеллекта.

Estudio informática en VT. Este blog está sobre inteligencia artificial - particularmente, en los cambios de indicadores relacionales dentro de redes sociales y sus efectos. Bienvenidos!

So this is the first of hopefully many productive posts to my very own research blog. What kind of research? I am studying how social networks evolve in a popular social networking website, and if/what indicators exist that will allow us to make valid predictions about future member actions within the network. Comments are welcome and much appreciated!