|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||
java.lang.Object
|
+--iglu.ir.AbstractDocument
|
+--iglu.ir.HTMLDocument
A document which can identify HTML Tags and pull out HTML info. In addition to the SimpleDocument methods, this class can find specific HTML elements, if they exist. This document does not assume that the HTML document is properly written, so it won't choke on poorly formed HTML.
| Field Summary | |
static int |
STYLE_ANCHOR
|
static int |
STYLE_H1
|
static int |
STYLE_H2
|
static int |
STYLE_H3
|
static int |
STYLE_H4
|
| Fields inherited from class iglu.ir.AbstractDocument |
|
| Fields inherited from interface iglu.ir.Document |
STYLE_BOLD, STYLE_DEEMPHASIZED, STYLE_EMPHSIZED, STYLE_ITALIC |
| Constructor Summary | |
HTMLDocument(java.lang.String contents)
Creates new HTMLDocument |
|
| Method Summary | |
private java.lang.String |
calculateIndexibleContent()
Calculates what the content Text is |
java.lang.String |
getBodyText()
Returns all of the text in body tags. |
java.lang.String |
getIndexibleContent()
Returns a string containing the content of this document which might be indexible. |
java.lang.String[] |
getLinks()
Returns alist of all the links in the document. |
java.lang.String |
getStylizedText(int style)
Returns the text of various styles. |
java.lang.String |
getTitle()
Returns title, if present. |
static void |
main(java.lang.String[] argv)
Takes a filename as a parameter, and returns the indexible content to stdout. |
java.util.TreeMap |
metaTagInfo()
Returns the information stored in the meta tags of an HTML document. |
| Methods inherited from class iglu.ir.AbstractDocument |
getFullContent, numOccurs, numUniqueWords, numWords, setFullContent, setIndexibleContent, toString |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
public static final int STYLE_H1
public static final int STYLE_H2
public static final int STYLE_H3
public static final int STYLE_H4
public static final int STYLE_ANCHOR
| Constructor Detail |
public HTMLDocument(java.lang.String contents)
contents - The HTML document| Method Detail |
public java.lang.String getIndexibleContent()
Document
getIndexibleContent in interface DocumentgetIndexibleContent in class AbstractDocumentString valuepublic java.lang.String getTitle()
public java.lang.String getBodyText()
public java.util.TreeMap metaTagInfo()
TreeMap valuepublic java.lang.String getStylizedText(int style)
getStylizedText in interface DocumentgetStylizedText in class AbstractDocumentstyle - an int value
String valuepublic java.lang.String[] getLinks()
private java.lang.String calculateIndexibleContent()
public static void main(java.lang.String[] argv)
argv -
|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||