|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--iglu.ir.AbstractVectorCreator | +--iglu.ir.JDBCVectorCreator
Generates TFIDF vectors from a document set. All intermediate information is kept in a database. This class is intended for data sets large enough that the document frequency vector will not fit in memory, so getDocOccurs() always returns null.
Term counts are built up in RAM, and periodically merged into the database. Database storage is in a JDBCMisc table.
This class is relatively slow, but it is the best option (within IGLU) for very large document sets.
TFIDFVectorCreator
Field Summary | |
private boolean |
dataChanged
|
private java.sql.Connection |
db
|
private int |
localNumDocs
|
private TermVector |
localTerms
|
private int |
maxLocalTerms
|
private JDBCMisc |
misc
|
private java.lang.String |
tablePrefix
|
private JDBCMisc |
terms
|
Fields inherited from class iglu.ir.AbstractVectorCreator |
|
Constructor Summary | |
JDBCVectorCreator(java.sql.Connection c,
java.lang.String tablePrefix,
boolean create)
Create a new JDBCVectorCreator using the supplied information. |
Method Summary | |
void |
addDoc(Document d)
Add a document to the corpus. |
void |
addDoc(TermVector freqVec)
Add a document to the corpus, when the term frequencies are known. |
void |
flush()
Force all information to be written to the database. |
int |
getDocOccurrances(java.lang.String term)
Returns the number of documents in which a term occurs. |
java.lang.String |
getName()
|
int |
getNumDocs()
Returns the number of documents contained in this VectorCreator. |
TermVector |
getVector(Document d)
Get a vector for the given document. |
TermVector |
getVector(TermVector freqVec)
Get a vector for the given document when the term frequencies are known. |
private void |
initializeDB()
|
static void |
main(java.lang.String[] args)
Runs some tests on this class. |
private void |
mergeIntoDatabase()
Adds information from the local term vector and local document count into the main database. |
void |
setMaxTerms(int numTerms)
Sets the maximum number of terms to hold in RAM. |
java.lang.String |
toString()
Returns a string representation of this object. |
Methods inherited from class iglu.ir.AbstractVectorCreator |
cleanUp, setDictionary, setLinearlyScale, setMaxSize, setNormalize |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
private int maxLocalTerms
private java.sql.Connection db
private JDBCMisc misc
private JDBCMisc terms
private java.lang.String tablePrefix
private boolean dataChanged
private int localNumDocs
private TermVector localTerms
Constructor Detail |
public JDBCVectorCreator(java.sql.Connection c, java.lang.String tablePrefix, boolean create)
c
- A connection to a database.tablePrefix
- The prefix of the name of the tables used
by this object. Must be a valid SQL table name.create
- Whether the tables need to be created.Method Detail |
public void setMaxTerms(int numTerms)
private void initializeDB()
public java.lang.String getName()
public int getNumDocs()
public void addDoc(Document d)
public void addDoc(TermVector freqVec)
freqVec
- a TermVector that indicates the frequency of each
term in this documentprivate void mergeIntoDatabase()
public java.lang.String toString()
toString
in class java.lang.Object
public TermVector getVector(Document d)
public TermVector getVector(TermVector freqVec)
freqVec
- a TermVector that indicates the frequency of each
term in this documentpublic int getDocOccurrances(java.lang.String term)
public void flush()
public static void main(java.lang.String[] args)
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |