iglu.ir
Class TFIDFVectorCreator

java.lang.Object
  |
  +--iglu.ir.AbstractVectorCreator
        |
        +--iglu.ir.TrainableVectorCreator
              |
              +--iglu.ir.TFIDFVectorCreator
All Implemented Interfaces:
java.io.Serializable, VectorCreator

public class TFIDFVectorCreator
extends TrainableVectorCreator
implements java.io.Serializable

Generates TFIDF vectors from a document set. It will construct the document frequency automatically with documents passed in through addDoc, or you can initialize the class with your own document frequency information. To generate the vectors, it uses the standard termFrequency*log_2(N/documentFrequency).

Author:
Travis Bauer, Ryan Scherle
See Also:
JDBCVectorCreator, Serialized Form

Field Summary
private  TermVector docOccurs
          A term vector indicating the number of documents in which a term appears.
private  int highestRank
           
private  int numDocs
          The number of documents in the corpus
 
Fields inherited from class iglu.ir.AbstractVectorCreator
 
Constructor Summary
TFIDFVectorCreator()
          Create a new TFIDFVectorCreator with no data.
TFIDFVectorCreator(TermVector docOccurs, int numDocs)
          Create a new TFIDFVectorCreator using the supplied information.
 
Method Summary
 void addDoc(Document d)
          Add a document to the corpus.
 void addDoc(TermVector freqVec)
          Add a document to the corpus, when the term frequencies are known.
 void addDocSet(DocumentSet ds)
          Add an entire document set.
 TermVector getDocOccurs()
          Returns a term vector indicating the number of documents in which each term appears.
 int getNumDocs()
           
 TermVector getVector(Document d)
          Get a vector for the given document.
 TermVector getVector(TermVector freqVec)
          Get a vector for the given document when the term frequencies are known.
static void main(java.lang.String[] args)
          Runs some tests on this class.
 void setLimitTopN(int highestRank)
          Returns vectors containing only the topN most frequently occuring terms
 void setNumDocs(int n)
           
static void test()
          Runs some tests on this class.
 java.lang.String toString()
          Returns a string representation of this object.
 
Methods inherited from class iglu.ir.AbstractVectorCreator
cleanUp, setDictionary, setLinearlyScale, setMaxSize, setNormalize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

docOccurs

private TermVector docOccurs
A term vector indicating the number of documents in which a term appears.


highestRank

private int highestRank

numDocs

private int numDocs
The number of documents in the corpus

Constructor Detail

TFIDFVectorCreator

public TFIDFVectorCreator()
Create a new TFIDFVectorCreator with no data.


TFIDFVectorCreator

public TFIDFVectorCreator(TermVector docOccurs,
                          int numDocs)
Create a new TFIDFVectorCreator using the supplied information. Lets you begin generating vectors right away.

Parameters:
docOccurs - A TermVector in which the value of each term is the number of documents in which it appears.
numDocs - The number of documents in the corpus.
Method Detail

setLimitTopN

public void setLimitTopN(int highestRank)
Returns vectors containing only the topN most frequently occuring terms


addDoc

public void addDoc(Document d)
Add a document to the corpus.

Specified by:
addDoc in class TrainableVectorCreator
Parameters:
d - a Document value

addDoc

public void addDoc(TermVector freqVec)
Add a document to the corpus, when the term frequencies are known. Documents that have been pre-processed can use this method to create term vectors more quickly.

Parameters:
freqVec - a TermVector that indicates the frequency of each term in this document

addDocSet

public void addDocSet(DocumentSet ds)
Add an entire document set.


toString

public java.lang.String toString()
Returns a string representation of this object.

Overrides:
toString in class java.lang.Object

getDocOccurs

public TermVector getDocOccurs()
Returns a term vector indicating the number of documents in which each term appears. (This is the document frequency vector used by the TFIDF formula).


setNumDocs

public void setNumDocs(int n)

getNumDocs

public int getNumDocs()

getVector

public TermVector getVector(Document d)
Get a vector for the given document.

Specified by:
getVector in interface VectorCreator
Parameters:
d - a Document value
Returns:
a TermVector value

getVector

public TermVector getVector(TermVector freqVec)
Get a vector for the given document when the term frequencies are known. Documents that have been pre-processed can use this method to create term vectors more quickly.

Parameters:
freqVec - a TermVector that indicates the frequency of each term in this document

test

public static void test()
Runs some tests on this class.


main

public static void main(java.lang.String[] args)
Runs some tests on this class.