iglu.ir
Class JDBCVectorCreator

java.lang.Object
  |
  +--iglu.ir.AbstractVectorCreator
        |
        +--iglu.ir.JDBCVectorCreator
All Implemented Interfaces:
VectorCreator

public class JDBCVectorCreator
extends AbstractVectorCreator

Generates TFIDF vectors from a document set. All intermediate information is kept in a database. This class is intended for data sets large enough that the document frequency vector will not fit in memory, so getDocOccurs() always returns null.

Term counts are built up in RAM, and periodically merged into the database. Database storage is in a JDBCMisc table.

This class is relatively slow, but it is the best option (within IGLU) for very large document sets.

Author:
Ryan Scherle
See Also:
TFIDFVectorCreator

Field Summary
private  boolean dataChanged
           
private  java.sql.Connection db
           
private  int localNumDocs
           
private  TermVector localTerms
           
private  int maxLocalTerms
           
private  JDBCMisc misc
           
private  java.lang.String tablePrefix
           
private  JDBCMisc terms
           
 
Fields inherited from class iglu.ir.AbstractVectorCreator
 
Constructor Summary
JDBCVectorCreator(java.sql.Connection c, java.lang.String tablePrefix, boolean create)
          Create a new JDBCVectorCreator using the supplied information.
 
Method Summary
 void addDoc(Document d)
          Add a document to the corpus.
 void addDoc(TermVector freqVec)
          Add a document to the corpus, when the term frequencies are known.
 void flush()
          Force all information to be written to the database.
 int getDocOccurrances(java.lang.String term)
          Returns the number of documents in which a term occurs.
 java.lang.String getName()
           
 int getNumDocs()
          Returns the number of documents contained in this VectorCreator.
 TermVector getVector(Document d)
          Get a vector for the given document.
 TermVector getVector(TermVector freqVec)
          Get a vector for the given document when the term frequencies are known.
private  void initializeDB()
           
static void main(java.lang.String[] args)
          Runs some tests on this class.
private  void mergeIntoDatabase()
          Adds information from the local term vector and local document count into the main database.
 void setMaxTerms(int numTerms)
          Sets the maximum number of terms to hold in RAM.
 java.lang.String toString()
          Returns a string representation of this object.
 
Methods inherited from class iglu.ir.AbstractVectorCreator
cleanUp, setDictionary, setLinearlyScale, setMaxSize, setNormalize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

maxLocalTerms

private int maxLocalTerms

db

private java.sql.Connection db

misc

private JDBCMisc misc

terms

private JDBCMisc terms

tablePrefix

private java.lang.String tablePrefix

dataChanged

private boolean dataChanged

localNumDocs

private int localNumDocs

localTerms

private TermVector localTerms
Constructor Detail

JDBCVectorCreator

public JDBCVectorCreator(java.sql.Connection c,
                         java.lang.String tablePrefix,
                         boolean create)
Create a new JDBCVectorCreator using the supplied information. Lets you begin generating vectors right away, if the database has been initialized previously.

Parameters:
c - A connection to a database.
tablePrefix - The prefix of the name of the tables used by this object. Must be a valid SQL table name.
create - Whether the tables need to be created.
Method Detail

setMaxTerms

public void setMaxTerms(int numTerms)
Sets the maximum number of terms to hold in RAM. This allows the efficiency of the class to be adjusted based on the amount of RAM available.


initializeDB

private void initializeDB()

getName

public java.lang.String getName()

getNumDocs

public int getNumDocs()
Returns the number of documents contained in this VectorCreator.


addDoc

public void addDoc(Document d)
Add a document to the corpus.


addDoc

public void addDoc(TermVector freqVec)
Add a document to the corpus, when the term frequencies are known. Documents that have been pre-processed can use this method to create term vectors more quickly.

Parameters:
freqVec - a TermVector that indicates the frequency of each term in this document

mergeIntoDatabase

private void mergeIntoDatabase()
Adds information from the local term vector and local document count into the main database.


toString

public java.lang.String toString()
Returns a string representation of this object.

Overrides:
toString in class java.lang.Object

getVector

public TermVector getVector(Document d)
Get a vector for the given document.


getVector

public TermVector getVector(TermVector freqVec)
Get a vector for the given document when the term frequencies are known. Documents that have been pre-processed can use this method to create term vectors more quickly.

Parameters:
freqVec - a TermVector that indicates the frequency of each term in this document

getDocOccurrances

public int getDocOccurrances(java.lang.String term)
Returns the number of documents in which a term occurs.


flush

public void flush()
Force all information to be written to the database.


main

public static void main(java.lang.String[] args)
Runs some tests on this class.