|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--iglu.ir.AbstractVectorCreator | +--iglu.ir.TrainableVectorCreator | +--iglu.ir.GeneralVectorCreator
Generates vectors using various standard indexing schemes. It can be configured to computer vectors according to the following algorithms:
This vector creator can also be configured to only index according to the top N occurring terms (i.e. the N terms which occur in the most number of documents, not the N terms which occur the most frequently overall. In this case, the global and local weights are still computed according to one of the above algorithms, but it will only generate the weights for the top N most frequently occurring terms in the data set.
Two of the global weighting schemes, Entropy and Normal, require an
entire corpus at one in order to compute global weights. For this reason,
if you are going to use one of these two, you must configure it by calling
the corresponding setXGlobalWeights(DocumentSet d)
call
before using the vector creator and never change the state of the vector
creator (by adding more documents) after that.
JDBCVectorCreator
,
Serialized FormField Summary | |
private int |
_NEWENTROPY
|
private int |
_NEWNORMALIZED
|
private int |
_OKAY
|
private int |
_RECOMPUTE
|
static int |
AUGMENTEDNORMALIZED
|
static int |
BINARY
|
private int |
dataStatus
|
private TermVector |
documentFrequency
A term vector indicating the number of documents in which a term appears. |
static int |
ENTROPY
|
static int |
GFIDF
|
private TermVector |
globalFrequency
|
static java.lang.String[] |
globalNames
|
private TermVector |
globalWeight
|
private int |
globalWeightScheme
|
private int |
highestDocFreq
|
private double |
highestRank
|
private int |
highestTermFreq
|
static int |
IDF
|
static java.lang.String[] |
localNames
|
private int |
localWeightScheme
|
static int |
LOGARITHMIC
|
static int |
NONE
|
static int |
NORMAL
|
private int |
numDocs
The number of documents in the corpus |
static int |
PROBABILISTICINVERSE
|
static int |
TERMFREQUENCY
|
private TermVector |
validIndexingTerms
|
Fields inherited from class iglu.ir.AbstractVectorCreator |
|
Constructor Summary | |
GeneralVectorCreator()
Create a new GeneralVectorCreator with no data. |
|
GeneralVectorCreator(int localWeight,
int globalWeight)
|
|
GeneralVectorCreator(TermVector globalFrequency,
TermVector documentFrequency,
int numDocs)
Create a new GeneralVectorCreator using the supplied information. |
Method Summary | |
void |
addDoc(Document d)
Add a document to the corpus. |
void |
addDoc(TermVector freqVec)
Add a document to the corpus, when the term frequencies are known. |
void |
addDocSet(DocumentSet ds)
Add an entire document set. |
int |
getNumDocs()
|
TermVector |
getVector(Document d)
Get a vector for the given document. |
TermVector |
getVector(TermVector freqVec,
int docLength)
Get a vector for the given document when the term frequencies are known. |
static void |
main(java.lang.String[] args)
Runs some tests on this class. |
private void |
reComputeInternalData()
fixes the data which can change when documents are added |
void |
setEntropyGlobalWeights(DocumentSet d)
Sets global weights using the Entropy algorithm. |
void |
setLimitTopN(int highestRank)
Returns vectors containing only the topN most frequently occuring terms, where by most frequent, we mean the highest document frequency. |
void |
setNormalizedGlobalWeights(DocumentSet d)
Set global weights using the normal algorithm. |
void |
setNumDocs(int n)
|
static void |
test()
Runs some tests on this class. |
java.lang.String |
toString()
Returns a string representation of this object. |
Methods inherited from class iglu.ir.AbstractVectorCreator |
cleanUp, setDictionary, setLinearlyScale, setMaxSize, setNormalize |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
private TermVector documentFrequency
private TermVector globalFrequency
private TermVector globalWeight
private TermVector validIndexingTerms
private double highestRank
private int _RECOMPUTE
private int _OKAY
private int _NEWENTROPY
private int _NEWNORMALIZED
private int dataStatus
private int numDocs
private int highestDocFreq
private int highestTermFreq
public static final int BINARY
public static final int LOGARITHMIC
public static final int AUGMENTEDNORMALIZED
public static final int TERMFREQUENCY
public static final java.lang.String[] localNames
public static final int NONE
public static final int ENTROPY
public static final int IDF
public static final int GFIDF
public static final int NORMAL
public static final int PROBABILISTICINVERSE
public static final java.lang.String[] globalNames
private int localWeightScheme
private int globalWeightScheme
Constructor Detail |
public GeneralVectorCreator()
public GeneralVectorCreator(int localWeight, int globalWeight)
public GeneralVectorCreator(TermVector globalFrequency, TermVector documentFrequency, int numDocs)
globalFrequency
- Global Frequency Counts for the corpusdocumentFrequency
- Document Frequency counts for the corpusnumDocs
- The number of documents in the corpus.Method Detail |
public void setLimitTopN(int highestRank)
public void addDoc(Document d)
addDoc
in class TrainableVectorCreator
d
- a Document
valuepublic void addDoc(TermVector freqVec)
freqVec
- a TermVector that indicates the frequency of each
term in this documentpublic void addDocSet(DocumentSet ds)
public java.lang.String toString()
toString
in class java.lang.Object
private void reComputeInternalData()
public void setNumDocs(int n)
public int getNumDocs()
public TermVector getVector(Document d)
getVector
in interface VectorCreator
d
- a Document
value
TermVector
valuepublic TermVector getVector(TermVector freqVec, int docLength)
freqVec
- a TermVector that indicates the frequency of each
term in this documentpublic void setNormalizedGlobalWeights(DocumentSet d)
public void setEntropyGlobalWeights(DocumentSet d)
public static void test()
public static void main(java.lang.String[] args)
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |