|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--iglu.ir.AbstractDocument | +--iglu.ir.HTMLDocument
A document which can identify HTML Tags and pull out HTML info. In addition to the SimpleDocument methods, this class can find specific HTML elements, if they exist. This document does not assume that the HTML document is properly written, so it won't choke on poorly formed HTML.
Field Summary | |
static int |
STYLE_ANCHOR
|
static int |
STYLE_H1
|
static int |
STYLE_H2
|
static int |
STYLE_H3
|
static int |
STYLE_H4
|
Fields inherited from class iglu.ir.AbstractDocument |
|
Fields inherited from interface iglu.ir.Document |
STYLE_BOLD, STYLE_DEEMPHASIZED, STYLE_EMPHSIZED, STYLE_ITALIC |
Constructor Summary | |
HTMLDocument(java.lang.String contents)
Creates new HTMLDocument |
Method Summary | |
private java.lang.String |
calculateIndexibleContent()
Calculates what the content Text is |
java.lang.String |
getBodyText()
Returns all of the text in body tags. |
java.lang.String |
getIndexibleContent()
Returns a string containing the content of this document which might be indexible. |
java.lang.String[] |
getLinks()
Returns alist of all the links in the document. |
java.lang.String |
getStylizedText(int style)
Returns the text of various styles. |
java.lang.String |
getTitle()
Returns title, if present. |
static void |
main(java.lang.String[] argv)
Takes a filename as a parameter, and returns the indexible content to stdout. |
java.util.TreeMap |
metaTagInfo()
Returns the information stored in the meta tags of an HTML document. |
Methods inherited from class iglu.ir.AbstractDocument |
getFullContent, numOccurs, numUniqueWords, numWords, setFullContent, setIndexibleContent, toString |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
public static final int STYLE_H1
public static final int STYLE_H2
public static final int STYLE_H3
public static final int STYLE_H4
public static final int STYLE_ANCHOR
Constructor Detail |
public HTMLDocument(java.lang.String contents)
contents
- The HTML documentMethod Detail |
public java.lang.String getIndexibleContent()
Document
getIndexibleContent
in interface Document
getIndexibleContent
in class AbstractDocument
String
valuepublic java.lang.String getTitle()
public java.lang.String getBodyText()
public java.util.TreeMap metaTagInfo()
TreeMap
valuepublic java.lang.String getStylizedText(int style)
getStylizedText
in interface Document
getStylizedText
in class AbstractDocument
style
- an int
value
String
valuepublic java.lang.String[] getLinks()
private java.lang.String calculateIndexibleContent()
public static void main(java.lang.String[] argv)
argv
-
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |