Monday, March 17, 2008

storytelling pipeline, statistics section

Returning to the use of this blog...

overview
My design consists of a fairly input format-agnostic document statistics section whose output feeds a path-finder over the processed documents according to various document similarity measures. Resulting paths can then be pruned and rated using the documents to ensure acceptable relevance.

scalability
The statistics section is a serial (interdependent), very tuned set of stages. Runtime scales linearly over #documents * #words for almost all subsections (details for Daylife, PubMed corpora). Memory usage is the overarching but not insurmountable limitation on increasing the size of input data, with the low-hanging fruit being the removal of all memory leaks and more careful allocation schemes, and the long-term goal of rewriting it in parallel, only if this is the best use of time, of course, given the many other interesting opportunities in this project.

callgraph
main
- load(input_type, input_filename, documents, words)
- - load_file(filetype, filename, documents)
- - load_fixup(documents)
- - load_dict(documents, dictionary)
- - load_index(dictionary, words)
- - load_pack(documents, dictionary, pdocs, words_sz_max)
- - load_stats(words, pdocs)
- - load_weights(words, pdocs, words_sz_max)
- dump(words, documents, output_stats)
- show(words, documents, output_stats)

substages
load_file() reads the documents from the file. Each line is passed into a custom boolean function load_document(doc, line, type) that returns whether it was able to load a document from that string, which in turn calls load_document_pubmed, load_document_daylife, or assert(0). Document texts are loaded into a vector to be used by following functions.

load_fixup() converts document texts to a more usable form by removing DNA sequences, converting to lower case, removing stopwords, and stemming for each document (runs in #docs * #words, and is the slowest stage but not considerably so).

load_dict() builds a dictionary (a C++ stl::map) of words from the documents (runs in #docs log #words).

load_index() exists because indexing words into a map for each document traversal was hurting performance by a factor of ~50x over the entire program. Instead, the next functions re-represent documents not as words separated by spaces in strings, but as arrays of integer dictionary indexes. load_index traverses the dictionary of (word -> count) from the load_dict, building a new full word statistics structure from this information and reindexing the dictionary to store (word -> index in the new structure).

load_pack() then walks the documents, converting the words into an array of word index bit-or the shifted number of the word's occurrences in that document packed into a uint_32t (verifying no overflow), thus removing duplicate word entries and cutting out a log n map access algorithmically, but most importantly, saving in cache misses. This saves much time for every subsequent access in the next two stages.

load_stats(), using the re-formed documents and words lists, then builds documents-per-word, words-per-document, and standard deviation statistics for the word statistic structures (documents-per-word is needed for the following tf-idf computation), updating stddev in multiples where zero.

load_weights() computes and stores tf-idf scores (technically unneeded except for weight calculation, but interesting) and term weights per word per document.

summary
I would go over the code in detail, but it seems it's much more efficient to use your preferred editor and syntax highlighting scheme than to see dislocated snippets on blogger.com. For the code to match the descriptions, visit /load/load.cpp. Logging is in log.h, document structures in document.h, word structures in word.h, in addition to the stopwords list, Porter code, and some sample outputs. The code is clean, and with the exception of some pointer arithmetic and the adapted Porter stemmer, quite readable. I compile with "g++ load.cpp -O3 -o load". Comments/insight are very appreciated.

No comments: