iglu.ir
Class GeneralVectorCreator

java.lang.Object
  |
  +--iglu.ir.AbstractVectorCreator
        |
        +--iglu.ir.TrainableVectorCreator
              |
              +--iglu.ir.GeneralVectorCreator
All Implemented Interfaces:
java.io.Serializable, VectorCreator

public class GeneralVectorCreator
extends TrainableVectorCreator
implements java.io.Serializable

Generates vectors using various standard indexing schemes. It can be configured to computer vectors according to the following algorithms:

Local Weights

Global Weights

This vector creator can also be configured to only index according to the top N occurring terms (i.e. the N terms which occur in the most number of documents, not the N terms which occur the most frequently overall. In this case, the global and local weights are still computed according to one of the above algorithms, but it will only generate the weights for the top N most frequently occurring terms in the data set.

Two of the global weighting schemes, Entropy and Normal, require an entire corpus at one in order to compute global weights. For this reason, if you are going to use one of these two, you must configure it by calling the corresponding setXGlobalWeights(DocumentSet d) call before using the vector creator and never change the state of the vector creator (by adding more documents) after that.

Author:
Travis Bauer
See Also:
JDBCVectorCreator, Serialized Form

Field Summary
private  int _NEWENTROPY
           
private  int _NEWNORMALIZED
           
private  int _OKAY
           
private  int _RECOMPUTE
           
static int AUGMENTEDNORMALIZED
           
static int BINARY
           
private  int dataStatus
           
private  TermVector documentFrequency
          A term vector indicating the number of documents in which a term appears.
static int ENTROPY
           
static int GFIDF
           
private  TermVector globalFrequency
           
static java.lang.String[] globalNames
           
private  TermVector globalWeight
           
private  int globalWeightScheme
           
private  int highestDocFreq
           
private  double highestRank
           
private  int highestTermFreq
           
static int IDF
           
static java.lang.String[] localNames
           
private  int localWeightScheme
           
static int LOGARITHMIC
           
static int NONE
           
static int NORMAL
           
private  int numDocs
          The number of documents in the corpus
static int PROBABILISTICINVERSE
           
static int TERMFREQUENCY
           
private  TermVector validIndexingTerms
           
 
Fields inherited from class iglu.ir.AbstractVectorCreator
 
Constructor Summary
GeneralVectorCreator()
          Create a new GeneralVectorCreator with no data.
GeneralVectorCreator(int localWeight, int globalWeight)
           
GeneralVectorCreator(TermVector globalFrequency, TermVector documentFrequency, int numDocs)
          Create a new GeneralVectorCreator using the supplied information.
 
Method Summary
 void addDoc(Document d)
          Add a document to the corpus.
 void addDoc(TermVector freqVec)
          Add a document to the corpus, when the term frequencies are known.
 void addDocSet(DocumentSet ds)
          Add an entire document set.
 int getNumDocs()
           
 TermVector getVector(Document d)
          Get a vector for the given document.
 TermVector getVector(TermVector freqVec, int docLength)
          Get a vector for the given document when the term frequencies are known.
static void main(java.lang.String[] args)
          Runs some tests on this class.
private  void reComputeInternalData()
          fixes the data which can change when documents are added
 void setEntropyGlobalWeights(DocumentSet d)
          Sets global weights using the Entropy algorithm.
 void setLimitTopN(int highestRank)
          Returns vectors containing only the topN most frequently occuring terms, where by most frequent, we mean the highest document frequency.
 void setNormalizedGlobalWeights(DocumentSet d)
          Set global weights using the normal algorithm.
 void setNumDocs(int n)
           
static void test()
          Runs some tests on this class.
 java.lang.String toString()
          Returns a string representation of this object.
 
Methods inherited from class iglu.ir.AbstractVectorCreator
cleanUp, setDictionary, setLinearlyScale, setMaxSize, setNormalize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

documentFrequency

private TermVector documentFrequency
A term vector indicating the number of documents in which a term appears.


globalFrequency

private TermVector globalFrequency

globalWeight

private TermVector globalWeight

validIndexingTerms

private TermVector validIndexingTerms

highestRank

private double highestRank

_RECOMPUTE

private int _RECOMPUTE

_OKAY

private int _OKAY

_NEWENTROPY

private int _NEWENTROPY

_NEWNORMALIZED

private int _NEWNORMALIZED

dataStatus

private int dataStatus

numDocs

private int numDocs
The number of documents in the corpus


highestDocFreq

private int highestDocFreq

highestTermFreq

private int highestTermFreq

BINARY

public static final int BINARY
See Also:
Constant Field Values

LOGARITHMIC

public static final int LOGARITHMIC
See Also:
Constant Field Values

AUGMENTEDNORMALIZED

public static final int AUGMENTEDNORMALIZED
See Also:
Constant Field Values

TERMFREQUENCY

public static final int TERMFREQUENCY
See Also:
Constant Field Values

localNames

public static final java.lang.String[] localNames

NONE

public static final int NONE
See Also:
Constant Field Values

ENTROPY

public static final int ENTROPY
See Also:
Constant Field Values

IDF

public static final int IDF
See Also:
Constant Field Values

GFIDF

public static final int GFIDF
See Also:
Constant Field Values

NORMAL

public static final int NORMAL
See Also:
Constant Field Values

PROBABILISTICINVERSE

public static final int PROBABILISTICINVERSE
See Also:
Constant Field Values

globalNames

public static final java.lang.String[] globalNames

localWeightScheme

private int localWeightScheme

globalWeightScheme

private int globalWeightScheme
Constructor Detail

GeneralVectorCreator

public GeneralVectorCreator()
Create a new GeneralVectorCreator with no data.


GeneralVectorCreator

public GeneralVectorCreator(int localWeight,
                            int globalWeight)

GeneralVectorCreator

public GeneralVectorCreator(TermVector globalFrequency,
                            TermVector documentFrequency,
                            int numDocs)
Create a new GeneralVectorCreator using the supplied information. Lets you begin generating vectors right away.

Parameters:
globalFrequency - Global Frequency Counts for the corpus
documentFrequency - Document Frequency counts for the corpus
numDocs - The number of documents in the corpus.
Method Detail

setLimitTopN

public void setLimitTopN(int highestRank)
Returns vectors containing only the topN most frequently occuring terms, where by most frequent, we mean the highest document frequency.


addDoc

public void addDoc(Document d)
Add a document to the corpus.

Specified by:
addDoc in class TrainableVectorCreator
Parameters:
d - a Document value

addDoc

public void addDoc(TermVector freqVec)
Add a document to the corpus, when the term frequencies are known. Documents that have been pre-processed can use this method to create term vectors more quickly.

Parameters:
freqVec - a TermVector that indicates the frequency of each term in this document

addDocSet

public void addDocSet(DocumentSet ds)
Add an entire document set.


toString

public java.lang.String toString()
Returns a string representation of this object.

Overrides:
toString in class java.lang.Object

reComputeInternalData

private void reComputeInternalData()
fixes the data which can change when documents are added


setNumDocs

public void setNumDocs(int n)

getNumDocs

public int getNumDocs()

getVector

public TermVector getVector(Document d)
Get a vector for the given document.

Specified by:
getVector in interface VectorCreator
Parameters:
d - a Document value
Returns:
a TermVector value

getVector

public TermVector getVector(TermVector freqVec,
                            int docLength)
Get a vector for the given document when the term frequencies are known. Documents that have been pre-processed can use this method to create term vectors more quickly.

Parameters:
freqVec - a TermVector that indicates the frequency of each term in this document

setNormalizedGlobalWeights

public void setNormalizedGlobalWeights(DocumentSet d)
Set global weights using the normal algorithm.


setEntropyGlobalWeights

public void setEntropyGlobalWeights(DocumentSet d)
Sets global weights using the Entropy algorithm.


test

public static void test()
Runs some tests on this class.


main

public static void main(java.lang.String[] args)
Runs some tests on this class.