iglu.ir
Class HTMLDocument

java.lang.Object
  |
  +--iglu.ir.AbstractDocument
        |
        +--iglu.ir.HTMLDocument
All Implemented Interfaces:
Document, java.io.Serializable

public class HTMLDocument
extends AbstractDocument
implements java.io.Serializable

A document which can identify HTML Tags and pull out HTML info. In addition to the SimpleDocument methods, this class can find specific HTML elements, if they exist. This document does not assume that the HTML document is properly written, so it won't choke on poorly formed HTML.

Since:
September 2000
Version:
0.1
Author:
Serialized Form

Field Summary
static int STYLE_ANCHOR
           
static int STYLE_H1
           
static int STYLE_H2
           
static int STYLE_H3
           
static int STYLE_H4
           
 
Fields inherited from class iglu.ir.AbstractDocument
 
Fields inherited from interface iglu.ir.Document
STYLE_BOLD, STYLE_DEEMPHASIZED, STYLE_EMPHSIZED, STYLE_ITALIC
 
Constructor Summary
HTMLDocument(java.lang.String contents)
          Creates new HTMLDocument
 
Method Summary
private  java.lang.String calculateIndexibleContent()
          Calculates what the content Text is
 java.lang.String getBodyText()
          Returns all of the text in body tags.
 java.lang.String getIndexibleContent()
          Returns a string containing the content of this document which might be indexible.
 java.lang.String[] getLinks()
          Returns alist of all the links in the document.
 java.lang.String getStylizedText(int style)
          Returns the text of various styles.
 java.lang.String getTitle()
          Returns title, if present.
static void main(java.lang.String[] argv)
          Takes a filename as a parameter, and returns the indexible content to stdout.
 java.util.TreeMap metaTagInfo()
          Returns the information stored in the meta tags of an HTML document.
 
Methods inherited from class iglu.ir.AbstractDocument
getFullContent, numOccurs, numUniqueWords, numWords, setFullContent, setIndexibleContent, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

STYLE_H1

public static final int STYLE_H1
See Also:
Constant Field Values

STYLE_H2

public static final int STYLE_H2
See Also:
Constant Field Values

STYLE_H3

public static final int STYLE_H3
See Also:
Constant Field Values

STYLE_H4

public static final int STYLE_H4
See Also:
Constant Field Values

STYLE_ANCHOR

public static final int STYLE_ANCHOR
See Also:
Constant Field Values
Constructor Detail

HTMLDocument

public HTMLDocument(java.lang.String contents)
Creates new HTMLDocument

Parameters:
contents - The HTML document
Method Detail

getIndexibleContent

public java.lang.String getIndexibleContent()
Description copied from interface: Document
Returns a string containing the content of this document which might be indexible. Must do the following:

Specified by:
getIndexibleContent in interface Document
Overrides:
getIndexibleContent in class AbstractDocument
Returns:
a String value

getTitle

public java.lang.String getTitle()
Returns title, if present. Otherwise returns null. If you have badly formed HTML and there are multiple title tags, only the first is returned


getBodyText

public java.lang.String getBodyText()
Returns all of the text in body tags. If there are no tags, null is returns. If there is more than one, all are returned concatenated onto one another. Other embedded HTML tags are not removed.


metaTagInfo

public java.util.TreeMap metaTagInfo()
Returns the information stored in the meta tags of an HTML document. assumes that the name and content are surrounded by double quotes.

Returns:
a TreeMap value

getStylizedText

public java.lang.String getStylizedText(int style)
Returns the text of various styles. In the current implementation, if the STYLE_EMPHASIZED and STYLE_DEEMPHSIZED do not necessarily return the text in the order in which is appears in the document. STYLE_EMPHASIZED returns H1-4, bold, and italic text.

Specified by:
getStylizedText in interface Document
Specified by:
getStylizedText in class AbstractDocument
Parameters:
style - an int value
Returns:
a String value

getLinks

public java.lang.String[] getLinks()
Returns alist of all the links in the document. Not yet implemented.


calculateIndexibleContent

private java.lang.String calculateIndexibleContent()
Calculates what the content Text is


main

public static void main(java.lang.String[] argv)
Takes a filename as a parameter, and returns the indexible content to stdout. Nice little utility in addition to being a test for the class.

Parameters:
argv -