Saturday, March 22, 2008

term-document modeling

Following up on the conversation last night, I've incorporated variant weighting schemes into the term-document modeling. This required rewrites in two areas -- specifically, input parameters are getting possibly too unwieldy, so I wrote a fallback interface to choose valid parameters and moved parameter validation itself, wherever possible, earlier. Secondly, the load_weights() function mentioned in the last post has been expanded as to be described here.

variant term weighting schemes
In the new design, load_weights is a shell layering functionality over a more orthagonal function with more knobs. Desired weighting schemes can be tf-idf, sublinear tf scaling (is 20 occurrences really 20 times as significant?), maximum tf normalization (documents tend to repeat the same words over and over), or a custom triple of weightings for term frequency, document frequency, and normalization factor following the design here. load_weights() is able to realize the first three as configurations of triples (ntn, ltn, atn, respectively) and process/pass all types of input, following validation, to the underlying function load_weights_triple.

tf/df/normalization
Term weights then are simply multiples resulting from the three selected weighting schemes. I wasn't able to implement the pivoted unique normalization "(Section 6.4.4)" due to lack of information, but the rest are there, namely

term frequency =
- natural = tf
- logarithm = tf > 0 ? 1 + log(tf) : 0
- augmented = a + (1-a) * tf / tf_max
- boolean = tf > 0 ? 1 : 0
- log ave = (1 + log(tf)) / (1 + log(ave(tf)))

document frequency =
- no = 1
- idf = log(N / df)
- prob_idf = max(0, (N-df)/df)

normalization =
- none = 1
- cosine = 1 / sqrt(w02 + w12 + w22 + ...)
- pivoted unique = ?
- byte size = charlengthα, α < 1

confusion and gnashing of teeth
My main problems here are pivoted unique, the fact that by my interpretation, (N - df)/df should never be less than zero (?), the constants introduced for 'augment' (.4 or .5) and 'byte size' (unknown reco.) (how to tune, should they be user-configurable), and adequately testing that all my code does what it was meant to do. Apologies for missing the last meeting due to the foolish wisdom teeth removal issue. I'm also aiming for the Monday deadline for the similarity search improvement, and hope to have another blog post by then. Cheers.

This version of the 'load code,' sans input data, is available here.

No comments: