iglu.ir
Class TermVector

java.lang.Object
  |
  +--iglu.ir.TermVector
All Implemented Interfaces:
java.lang.Cloneable, java.io.Serializable

public class TermVector
extends java.lang.Object
implements java.io.Serializable, java.lang.Cloneable

A term vector is a mapping of all words in a language to values. Each term vector represents some concept by having various values for the terms. A term vector can represent the content of a document, a user's preferences, or any other semantic concept. This class is very similar to ValueSortedMap, but the methods and error checking are geared more to term vectors than general maps.

The new version of this class utilizes and ObjectPager to enable a large number of term vectors to be treated as if they were in memory at the same time. For most purposes, nothing special needs to be done. But if you plan on having a large number of TermVectors in memory at the same time (several thousand large vectors), then you can use the setObjectPager to a FileObjectPager.

After you do this, you don't have to do anything unusual, just treat the TermVectors normally. However, the TermVector class will swap vectors out to the ObjectPager when too many are created, and swap them back in when needed.

Version:
2.0
Author:
Ryan Scherle, Travis Bauer
See Also:
ValueSortedMap, ObjectPager, FileObjectPager, Serialized Form

Field Summary
private static ObjectPager cache
          A pager for TermVectors.
static TermVector EMPTY
          An empty term vector, for use when you don't want to do anthing to the vector.
private  java.lang.Object termsId
          A reference to this vector's ValueSortedMap
 
Constructor Summary
TermVector()
          Constructs a term vector with all terms in the world having value 0.
TermVector(java.lang.String someWords)
          Constructs a simple term vector with the string given.
TermVector(java.lang.String someWords, java.lang.String someDelimeters)
          Constructs a simple term vector with the string given.
 
Method Summary
 void clear()
          Clears all the terms from the vector
 java.lang.Object clone()
          Creates and returns a copy of this object.
 double cosineSim(TermVector tv)
          Gives the 2-norm (euclidean distance) between this vector and the given one.
 boolean equals(java.lang.Object o)
          Tests for equality.
protected  void finalize()
          Delete myself from the cache
 double get(java.lang.String term)
          Returns the value associated with term.
private  ValueSortedMap getAllTerms()
          Get this TermVector's ValueSortedMap from the pager.
 void increment(java.lang.String term)
          Adds one to the value of a term.
 void linearlyScale()
          Linearly scales the vector, to skew the data.
static void main(java.lang.String[] args)
          Runs some test cases on this class.
 void normalize()
          Normalizes the vector.
 void put(java.lang.String term, double value)
          Associates value with term in the vector.
 void putAll(TermVector additional)
          Adds the contents of another term vetor to this one.
private  void readObject(java.io.ObjectInputStream in)
          Read the VSM from the input stream
 void removeStopWords(StopList stopList)
          Removes from the list all words found in the given stoplist, as well as one-character words and words that are longer than 20 characters.
 void scaleBy(double n)
          Scales all terms of the vector by the given value.
private  void setAllTerms(ValueSortedMap vsm)
          Set this items ValueSortedMap in the pager.
static void setObjectPager(ObjectPager newObjectPager)
          Set the pager for the TermVectors.
 int size()
          Returns the number of terms with non-zero weight in the vector.
 void subtract(java.lang.String theTerm)
          Removes a single term from the list of terms.
 void subtract(TermVector subWords)
          Performs a set difference on the list of terms.
 java.util.Iterator termIterator()
          Returns an iterator for the non-zero terms contained in this vector.
static void test()
          Runs some test cases on this class.
 TermVector topN(int n)
          Returns a new (clone) TermVector containing the top n words in the TermVector, along with their values.
 java.lang.String toString()
          Returns a string representation of the vector.
 void truncateTo(int numTerms)
          Truncates this term vector to the given length.
private  void writeObject(java.io.ObjectOutputStream out)
          Write the VSM to the output stream, not only the id
 
Methods inherited from class java.lang.Object
getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

EMPTY

public static final TermVector EMPTY
An empty term vector, for use when you don't want to do anthing to the vector.


termsId

private java.lang.Object termsId
A reference to this vector's ValueSortedMap


cache

private static ObjectPager cache
A pager for TermVectors. By default it is a RAMObjectPager which means that vectors are not swapped out.

Constructor Detail

TermVector

public TermVector()
Constructs a term vector with all terms in the world having value 0.


TermVector

public TermVector(java.lang.String someWords)
Constructs a simple term vector with the string given. All terms in the string are given weight 1, all other words in the world are given weight 0.


TermVector

public TermVector(java.lang.String someWords,
                  java.lang.String someDelimeters)
Constructs a simple term vector with the string given. All terms in the string are given weight 1, all other words in the world are given weight 0.

Method Detail

setObjectPager

public static void setObjectPager(ObjectPager newObjectPager)
Set the pager for the TermVectors. If you plan to have a large number of TermVectors in memory at the same time, choose something like FileObjectPager to swap them out to disk.


getAllTerms

private ValueSortedMap getAllTerms()
Get this TermVector's ValueSortedMap from the pager. Used internally.


setAllTerms

private void setAllTerms(ValueSortedMap vsm)
Set this items ValueSortedMap in the pager. Used internally.


clear

public void clear()
Clears all the terms from the vector


size

public int size()
Returns the number of terms with non-zero weight in the vector.


put

public void put(java.lang.String term,
                double value)
Associates value with term in the vector.


putAll

public void putAll(TermVector additional)
Adds the contents of another term vetor to this one. If the vectors both contain nonzero values for the same term, the values are added together.


increment

public void increment(java.lang.String term)
Adds one to the value of a term.


get

public double get(java.lang.String term)
Returns the value associated with term. Terms that haven't had explicit values set will return a value of zero.


normalize

public void normalize()
Normalizes the vector. The resulting vector will have a (Euclidean) length of 1.


scaleBy

public void scaleBy(double n)
Scales all terms of the vector by the given value.


linearlyScale

public void linearlyScale()
Linearly scales the vector, to skew the data. All terms whose weights are nonzero will have their weights changed to consecutive positive integers, with the lowest value being 1, the second-lowest being 2, etc.


subtract

public void subtract(TermVector subWords)
Performs a set difference on the list of terms.


subtract

public void subtract(java.lang.String theTerm)
Removes a single term from the list of terms.


removeStopWords

public void removeStopWords(StopList stopList)
Removes from the list all words found in the given stoplist, as well as one-character words and words that are longer than 20 characters.


termIterator

public java.util.Iterator termIterator()
Returns an iterator for the non-zero terms contained in this vector. Only the terms are included in the iterator, but they are ordered by their value. If the value is needed, it can be recalled with the get() method. This iterator does not support the remove operation.


truncateTo

public void truncateTo(int numTerms)
Truncates this term vector to the given length. The top numTerms terms, ordered by value, are kept.


clone

public java.lang.Object clone()
Creates and returns a copy of this object.

Overrides:
clone in class java.lang.Object

toString

public java.lang.String toString()
Returns a string representation of the vector. Only the terms with non-zero values are shown.

Overrides:
toString in class java.lang.Object

topN

public TermVector topN(int n)
Returns a new (clone) TermVector containing the top n words in the TermVector, along with their values.


equals

public boolean equals(java.lang.Object o)
Tests for equality.

Overrides:
equals in class java.lang.Object

cosineSim

public double cosineSim(TermVector tv)
Gives the 2-norm (euclidean distance) between this vector and the given one. If both of the vectors are empty, it returns 1. If only one of the vectors is empty, it returns 0.


test

public static void test()
Runs some test cases on this class.


finalize

protected void finalize()
                 throws java.io.IOException
Delete myself from the cache

Overrides:
finalize in class java.lang.Object
java.io.IOException

writeObject

private void writeObject(java.io.ObjectOutputStream out)
                  throws java.io.IOException
Write the VSM to the output stream, not only the id

java.io.IOException

readObject

private void readObject(java.io.ObjectInputStream in)
                 throws java.io.IOException,
                        java.lang.ClassNotFoundException
Read the VSM from the input stream

java.io.IOException
java.lang.ClassNotFoundException

main

public static void main(java.lang.String[] args)
Runs some test cases on this class.