Saturday, March 22, 2008

term-document modeling

Following up on the conversation last night, I've incorporated variant weighting schemes into the term-document modeling. This required rewrites in two areas -- specifically, input parameters are getting possibly too unwieldy, so I wrote a fallback interface to choose valid parameters and moved parameter validation itself, wherever possible, earlier. Secondly, the load_weights() function mentioned in the last post has been expanded as to be described here.

variant term weighting schemes
In the new design, load_weights is a shell layering functionality over a more orthagonal function with more knobs. Desired weighting schemes can be tf-idf, sublinear tf scaling (is 20 occurrences really 20 times as significant?), maximum tf normalization (documents tend to repeat the same words over and over), or a custom triple of weightings for term frequency, document frequency, and normalization factor following the design here. load_weights() is able to realize the first three as configurations of triples (ntn, ltn, atn, respectively) and process/pass all types of input, following validation, to the underlying function load_weights_triple.

tf/df/normalization
Term weights then are simply multiples resulting from the three selected weighting schemes. I wasn't able to implement the pivoted unique normalization "(Section 6.4.4)" due to lack of information, but the rest are there, namely

term frequency =
- natural = tf
- logarithm = tf > 0 ? 1 + log(tf) : 0
- augmented = a + (1-a) * tf / tf_max
- boolean = tf > 0 ? 1 : 0
- log ave = (1 + log(tf)) / (1 + log(ave(tf)))

document frequency =
- no = 1
- idf = log(N / df)
- prob_idf = max(0, (N-df)/df)

normalization =
- none = 1
- cosine = 1 / sqrt(w02 + w12 + w22 + ...)
- pivoted unique = ?
- byte size = charlengthα, α < 1

confusion and gnashing of teeth
My main problems here are pivoted unique, the fact that by my interpretation, (N - df)/df should never be less than zero (?), the constants introduced for 'augment' (.4 or .5) and 'byte size' (unknown reco.) (how to tune, should they be user-configurable), and adequately testing that all my code does what it was meant to do. Apologies for missing the last meeting due to the foolish wisdom teeth removal issue. I'm also aiming for the Monday deadline for the similarity search improvement, and hope to have another blog post by then. Cheers.

This version of the 'load code,' sans input data, is available here.

Monday, March 17, 2008

storytelling pipeline, statistics section

Returning to the use of this blog...

overview
My design consists of a fairly input format-agnostic document statistics section whose output feeds a path-finder over the processed documents according to various document similarity measures. Resulting paths can then be pruned and rated using the documents to ensure acceptable relevance.

scalability
The statistics section is a serial (interdependent), very tuned set of stages. Runtime scales linearly over #documents * #words for almost all subsections (details for Daylife, PubMed corpora). Memory usage is the overarching but not insurmountable limitation on increasing the size of input data, with the low-hanging fruit being the removal of all memory leaks and more careful allocation schemes, and the long-term goal of rewriting it in parallel, only if this is the best use of time, of course, given the many other interesting opportunities in this project.

callgraph
main
- load(input_type, input_filename, documents, words)
- - load_file(filetype, filename, documents)
- - load_fixup(documents)
- - load_dict(documents, dictionary)
- - load_index(dictionary, words)
- - load_pack(documents, dictionary, pdocs, words_sz_max)
- - load_stats(words, pdocs)
- - load_weights(words, pdocs, words_sz_max)
- dump(words, documents, output_stats)
- show(words, documents, output_stats)

substages
load_file() reads the documents from the file. Each line is passed into a custom boolean function load_document(doc, line, type) that returns whether it was able to load a document from that string, which in turn calls load_document_pubmed, load_document_daylife, or assert(0). Document texts are loaded into a vector to be used by following functions.

load_fixup() converts document texts to a more usable form by removing DNA sequences, converting to lower case, removing stopwords, and stemming for each document (runs in #docs * #words, and is the slowest stage but not considerably so).

load_dict() builds a dictionary (a C++ stl::map) of words from the documents (runs in #docs log #words).

load_index() exists because indexing words into a map for each document traversal was hurting performance by a factor of ~50x over the entire program. Instead, the next functions re-represent documents not as words separated by spaces in strings, but as arrays of integer dictionary indexes. load_index traverses the dictionary of (word -> count) from the load_dict, building a new full word statistics structure from this information and reindexing the dictionary to store (word -> index in the new structure).

load_pack() then walks the documents, converting the words into an array of word index bit-or the shifted number of the word's occurrences in that document packed into a uint_32t (verifying no overflow), thus removing duplicate word entries and cutting out a log n map access algorithmically, but most importantly, saving in cache misses. This saves much time for every subsequent access in the next two stages.

load_stats(), using the re-formed documents and words lists, then builds documents-per-word, words-per-document, and standard deviation statistics for the word statistic structures (documents-per-word is needed for the following tf-idf computation), updating stddev in multiples where zero.

load_weights() computes and stores tf-idf scores (technically unneeded except for weight calculation, but interesting) and term weights per word per document.

summary
I would go over the code in detail, but it seems it's much more efficient to use your preferred editor and syntax highlighting scheme than to see dislocated snippets on blogger.com. For the code to match the descriptions, visit /load/load.cpp. Logging is in log.h, document structures in document.h, word structures in word.h, in addition to the stopwords list, Porter code, and some sample outputs. The code is clean, and with the exception of some pointer arithmetic and the adapted Porter stemmer, quite readable. I compile with "g++ load.cpp -O3 -o load". Comments/insight are very appreciated.