Jericho HTML Parser at SourceForge.net
Tweed Coast IT Services
privacy policy

net.htmlparser.jericho
Class Source

java.lang.Object
  extended by Segment
      extended by Source
All Implemented Interfaces:
java.lang.CharSequence, java.lang.Comparable<Segment>, java.lang.Iterable<Segment>

public final class Source
extends Segment
implements java.lang.Iterable<Segment>

Represents a source HTML document.

The first step in parsing an HTML document is always to construct a Source object from the source data, which can be a String, Reader, InputStream, URLConnection or URL. Each constructor uses all the evidence available to determine the original character encoding of the data.

Once the Source object has been created, you can immediately start searching for tags or elements within the document using the tag search methods.

In certain circumstances you may be able to improve performance by calling the fullSequentialParse() method before calling any tag search methods. See the documentation of the fullSequentialParse() method for details.

Any issues encountered while parsing are logged to a Logger object. The setLogger(Logger) method can be used to explicitly set a Logger implementation for a particular Source instance, otherwise the static Config.LoggerProvider property determines how the logger is set by default for all Source instances. See the documentation of the Config.LoggerProvider property for information about how the default logging provider is determined.

Note that many of the useful functions which can be performed on the source document are defined in its superclass, Segment. The source object is itself a segment which spans the entire document.

Most of the methods defined in this class are useful for determining the elements and tags surrounding or neighbouring a particular character position in the document.

For information on how to create a modified version of this source document, see the OutputDocument class.

Source objects are not thread safe, and should therefore not be shared between multiple threads unless all access is synchronized using some mechanism external to the library.

If memory usage is a major concern, consider using the StreamedSource class instead of the Source class.

See Also:
Segment, StreamedSource

Field Summary
static boolean LegacyIteratorCompatabilityMode
          Deprecated. Modify existing code to explicitly handle CharacterReference segments.
 
Constructor Summary
Source(java.lang.CharSequence text)
          Constructs a new Source object from the specified text.
Source(java.io.File file)
          Constructs a new Source object by loading the content from the specified File.
Source(java.io.InputStream inputStream)
          Constructs a new Source object by loading the content from the specified InputStream.
Source(java.io.Reader reader)
          Constructs a new Source object by loading the content from the specified Reader.
Source(java.net.URL url)
          Constructs a new Source object by loading the content from the specified URL.
Source(java.net.URLConnection urlConnection)
          Constructs a new Source object by loading the content from the specified URLConnection.
 
Method Summary
 char charAt(int index)
          Returns the character at the specified index.
 void clearCache()
          Clears the tag cache of all tags.
 Tag[] fullSequentialParse()
          Parses all of the tags in this source document sequentially from beginning to end.
 java.util.List<Element> getAllElements()
          Returns a list of all elements in this source document.
 java.util.List<StartTag> getAllStartTags()
          Returns a list of all start tags in this source document.
 java.util.List<Tag> getAllTags()
          Returns a list of all tags in this source document.
 java.lang.String getCacheDebugInfo()
          Returns a string representation of the tag cache, useful for debugging purposes.
 java.util.List<Element> getChildElements()
          Returns a list of the top-level elements in the document element hierarchy.
 int getColumn(int pos)
          Returns the column number of the specified character position in the source document.
 java.lang.String getDocumentSpecifiedEncoding()
          Returns the document encoding specified within the text of the document.
 Element getElementById(java.lang.String id)
          Returns the Element with the specified id attribute value.
 Element getEnclosingElement(int pos)
          Returns the most nested normal Element that encloses the specified position in the source document.
 Element getEnclosingElement(int pos, java.lang.String name)
          Returns the most nested normal Element with the specified name that encloses the specified position in the source document.
 Tag getEnclosingTag(int pos)
          Returns the Tag that encloses the specified position in the source document.
 Tag getEnclosingTag(int pos, TagType tagType)
          Returns the Tag of the specified type that encloses the specified position in the source document.
 java.lang.String getEncoding()
          Returns the character encoding scheme of the source byte stream used to create this object.
 java.lang.String getEncodingSpecificationInfo()
          Returns a concise description of how the encoding of the source document was determined.
 Logger getLogger()
          Returns the Logger that handles log messages.
 int getNameEnd(int pos)
          Returns the end position of the XML Name that starts at the specified position.
 java.lang.String getNewLine()
          Returns the newline character sequence used in the source document.
 CharacterReference getNextCharacterReference(int pos)
          Returns the CharacterReference beginning at or immediately following the specified position in the source document.
 Element getNextElement(int pos)
          Returns the Element beginning at or immediately following the specified position in the source document.
 Element getNextElement(int pos, java.lang.String name)
          Returns the normal Element with the specified name beginning at or immediately following the specified position in the source document.
 Element getNextElement(int pos, java.lang.String attributeName, java.util.regex.Pattern valueRegexPattern)
          Returns the Element with the specified attribute name and value pattern beginning at or immediately following the specified position in the source document.
 Element getNextElement(int pos, java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)
          Returns the Element with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
 Element getNextElementByClass(int pos, java.lang.String className)
          Returns the Element with the specified class beginning at or immediately following the specified position in the source document.
 EndTag getNextEndTag(int pos)
          Returns the EndTag beginning at or immediately following the specified position in the source document.
 EndTag getNextEndTag(int pos, EndTagType endTagType)
          Returns the EndTag of the specified type beginning at or immediately following the specified position in the source document.
 EndTag getNextEndTag(int pos, java.lang.String name)
          Returns the normal EndTag with the specified name beginning at or immediately following the specified position in the source document.
 EndTag getNextEndTag(int pos, java.lang.String name, EndTagType endTagType)
          Returns the EndTag with the specified name and type beginning at or immediately following the specified position in the source document.
 StartTag getNextStartTag(int pos)
          Returns the StartTag beginning at or immediately following the specified position in the source document.
 StartTag getNextStartTag(int pos, StartTagType startTagType)
          Returns the StartTag of the specified type beginning at or immediately following the specified position in the source document.
 StartTag getNextStartTag(int pos, java.lang.String name)
          Returns the normal StartTag with the specified name beginning at or immediately following the specified position in the source document.
 StartTag getNextStartTag(int pos, java.lang.String attributeName, java.util.regex.Pattern valueRegexPattern)
          Returns the StartTag with the specified attribute name and value pattern beginning at or immediately following the specified position in the source document.
 StartTag getNextStartTag(int pos, java.lang.String name, StartTagType startTagType)
          Returns the StartTag with the specified name and type beginning at or immediately following the specified position in the source document.
 StartTag getNextStartTag(int pos, java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)
          Returns the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
 StartTag getNextStartTagByClass(int pos, java.lang.String className)
          Returns the StartTag with the specified class beginning at or immediately following the specified position in the source document.
 Tag getNextTag(int pos)
          Returns the Tag beginning at or immediately following the specified position in the source document.
 Tag getNextTag(int pos, TagType tagType)
          Returns the Tag of the specified type beginning at or immediately following the specified position in the source document.
 ParseText getParseText()
          Returns the parse text of this source document.
 java.lang.String getPreliminaryEncodingInfo()
          Returns the preliminary encoding of the source document together with a concise description of how it was determined.
 CharacterReference getPreviousCharacterReference(int pos)
          Returns the CharacterReference at or immediately preceding (or enclosing) the specified position in the source document.
 EndTag getPreviousEndTag(int pos)
          Returns the EndTag at or immediately preceding (or enclosing) the specified position in the source document.
 EndTag getPreviousEndTag(int pos, EndTagType endTagType)
          Returns the EndTag of the specified type at or immediately preceding (or enclosing) the specified position in the source document.
 EndTag getPreviousEndTag(int pos, java.lang.String name)
          Returns the normal EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
 StartTag getPreviousStartTag(int pos)
          Returns the StartTag at or immediately preceding (or enclosing) the specified position in the source document.
 StartTag getPreviousStartTag(int pos, StartTagType startTagType)
          Returns the StartTag of the specified type at or immediately preceding (or enclosing) the specified position in the source document.
 StartTag getPreviousStartTag(int pos, java.lang.String name)
          Returns the normal StartTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
 StartTag getPreviousStartTag(int pos, java.lang.String name, StartTagType startTagType)
          Returns the StartTag with the specified name and type at or immediately preceding (or enclosing) the specified position in the source document.
 Tag getPreviousTag(int pos)
          Returns the Tag beginning at or immediately preceding (or enclosing) the specified position in the source document.
 Tag getPreviousTag(int pos, TagType tagType)
          Returns the Tag of the specified type beginning at or immediately preceding (or enclosing) the specified position in the source document.
 int getRow(int pos)
          Returns the row number of the specified character position in the source document.
 RowColumnVector getRowColumnVector(int pos)
          Returns a RowColumnVector object representing the row and column number of the specified character position in the source document.
 SourceFormatter getSourceFormatter()
          Formats the HTML source by laying out each non-inline-level element on a new line with an appropriate indent.
 Tag getTagAt(int pos)
          Returns the Tag at the specified position in the source document.
 void ignoreWhenParsing(java.util.Collection<? extends Segment> segments)
          Causes all of the segments in the specified collection to be ignored when parsing.
 void ignoreWhenParsing(int begin, int end)
          Causes the specified range of the source text to be ignored when parsing.
 boolean isXML()
          Indicates whether the source document is likely to be XML.
 java.util.Iterator<Segment> iterator()
          Returns an iterator over every tag, character reference and plain text segment contained within the source document.
 int length()
          Returns the length of the source document.
 Attributes parseAttributes(int pos, int maxEnd)
          Parses any Attributes starting at the specified position.
 Attributes parseAttributes(int pos, int maxEnd, int maxErrorCount)
          Parses any Attributes starting at the specified position.
 void setLogger(Logger logger)
          Sets the Logger that handles log messages.
 java.lang.CharSequence subSequence(int begin, int end)
          Returns a new character sequence that is a subsequence of this source document.
 java.lang.String toString()
          Returns the source text as a String.
 
Methods inherited from class Segment
compareTo, encloses, encloses, equals, getAllCharacterReferences, getAllElements, getAllElements, getAllElements, getAllElements, getAllElementsByClass, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTagsByClass, getAllTags, getBegin, getDebugInfo, getEnd, getFirstElement, getFirstElement, getFirstElement, getFirstElement, getFirstElementByClass, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTagByClass, getFormControls, getFormFields, getMaxDepthIndicator, getNodeIterator, getRenderer, getRowColumnVector, getSource, getStyleURISegments, getTextExtractor, getURIAttributes, hashCode, ignoreWhenParsing, isWhiteSpace, isWhiteSpace, parseAttributes
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

LegacyIteratorCompatabilityMode

@Deprecated
public static boolean LegacyIteratorCompatabilityMode
Deprecated. Modify existing code to explicitly handle CharacterReference segments.
Specifies whether to enable the legacy Segment.getNodeIterator() compatability mode.

Prior to version 3.1, Segment.getNodeIterator() and Source.iterator() did not handle character references as separate segments, and they were instead included unparsed in the plain text segments. This required the use of the CharacterReference.decode(CharSequence) method to retrieve the actual text from each plain text segment.

Although it is likely that existing programs based on the previous functionality should still work without modification, this static configuration property is provided on a temporary basis to revert back to the behaviour of previous versions, ensuring that existing programs function as intended without major modification.

Setting this configuration property to true restores compatability with previous versions.

This property and compatability mode will be removed in a future release.

Constructor Detail

Source

public Source(java.lang.CharSequence text)
Constructs a new Source object from the specified text.

Parameters:
text - the source text.

Source

public Source(java.io.Reader reader)
       throws java.io.IOException
Constructs a new Source object by loading the content from the specified Reader.

If the specified reader is an instance of InputStreamReader, the getEncoding() method of the created Source object returns the encoding from InputStreamReader.getEncoding().

Parameters:
reader - the java.io.Reader from which to load the source text.
Throws:
java.io.IOException - if an I/O error occurs.

Source

public Source(java.io.InputStream inputStream)
       throws java.io.IOException
Constructs a new Source object by loading the content from the specified InputStream.

The algorithm for detecting the character encoding of the source document from the raw bytes of the specified input stream is the same as that for the Source(URLConnection) constructor, except that the first step is not possible as there is no Content-Type header to check.

Parameters:
inputStream - the java.io.InputStream from which to load the source text.
Throws:
java.io.IOException - if an I/O error occurs.
See Also:
getEncoding()

Source

public Source(java.io.File file)
       throws java.io.IOException
Constructs a new Source object by loading the content from the specified File.

The algorithm for detecting the character encoding of the source document from the raw bytes of the specified file is the same as that for the Source(URLConnection) constructor, except that the first step is not possible as there is no Content-Type header to check.

Parameters:
file - the java.io.File from which to load the source text.
Throws:
java.io.IOException - if an I/O error occurs.
See Also:
getEncoding()

Source

public Source(java.net.URL url)
       throws java.io.IOException
Constructs a new Source object by loading the content from the specified URL.

This is equivalent to Source(url.openConnection()).

Parameters:
url - the URL from which to load the source text.
Throws:
java.io.IOException - if an I/O error occurs.
See Also:
getEncoding()

Source

public Source(java.net.URLConnection urlConnection)
       throws java.io.IOException
Constructs a new Source object by loading the content from the specified URLConnection.

To convert the stream of bytes from the URLConnection into characters the library must determine the character encoding of the stream. This should be specified in the HTTP headers of the connection, but in many cases this information is not available and the encoding must be determined by other means.

In the encoding detection algorithm detailed below, the default 8-bit encoding is Windows-1252. In the unlikely event that Windows-1252 is not a supported encoding on the host platform then ISO-8859-1 is used instead. Windows-1252 is preferred as it defines more printable characters than ISO-8859-1, specifically in the hex range 80 to 9F, while being a superset of all the other printable characters in ISO-8859-1.

The algorithm specified by HTML5 to determine the character encoding is very different to the algorithm used in this library, which follows the Unicode, HTTP, XML and HTML 4.01 specifications. The HTML5 algorithm "willfully violates" several specifications in order to maximise compatability with the misreported encodings of existing web pages and servers.

If the algorithm used in this library is not suitable for your application then you can employ a different library or your own code to detect the encoding and construct the Source document using the Source(Reader) constructor instead.

The algorithm for detecting the character encoding of the source document is as follows:
(process termination is marked by ♦)

  1. If the HTTP headers received from the URL connection include a Content-Type header specifying a supported charset parameter, then use the encoding specified in the value of the charset parameter. ♦
    If the charset parameter is illegally enclosed in double quotes, a warning is logged and the charset specified inside the quotes is tried.
    If the specified charset is not supported on the host platform, a warning is logged and the detection algorithm continues.
  2. Read the first four bytes of the input stream.
  3. If the input stream is empty, the created source document has zero length and its getEncoding() method returns null. ♦
  4. If the input stream starts with a unicode Byte Order Mark (BOM), then use the encoding signified by the BOM. ♦
    BOM BytesEncoding
    EF BB FFUTF-8
    FF FE 00 00UTF-32 (little-endian)
    00 00 FE FFUTF-32 (big-endian)
    FF FEUTF-16 (little-endian)
    FE FFUTF-16 (big-endian)
    0E FE FFSCSU
    2B 2F 76UTF-7
    DD 73 66 73UTF-EBCDIC
    FB EE 28BOCU-1
  5. If the stream contains less than four bytes, then:
    1. If the stream contains either one or three bytes, then use the default 8-bit encoding. ♦
    2. If the stream starts with a zero byte, then use the encoding UTF-16BE. ♦
    3. If the second byte of the stream is zero, then use the encoding UTF-16LE. ♦
    4. Otherwise use the default 8-bit encoding. ♦
  6. Determine a preliminary encoding by examining the first four bytes of the input stream. See the getPreliminaryEncodingInfo() method for details.
  7. Read the first 2048 bytes of the input stream and decode it using the preliminary encoding to create a "preview segment". If the detected preliminary encoding is not supported on this platform, create the preview segment using the default 8-bit encoding instead (this incident is logged at warn level).
  8. Search the preview segment for an encoding specification, which should always appear at or near the top of the document.
  9. If an encoding specification is found:
    1. If the specified encoding is supported on this platform, use it. ♦
    2. If the specified encoding is not supported on this platform, use the encoding that was used to create the preview segment, which is normally the detected preliminary encoding. ♦
  10. If the document looks like XML, then use UTF-8. ♦
    Section 4.3.3 of the XML 1.0 specification states that an XML file that is not encoded in UTF-8 must contain either a UTF-16 BOM or an encoding declaration in its XML declaration. Since neither of these was detected, we can assume the encoding is UTF-8.
  11. Use the encoding that was used to create the preview segment, which is normally the detected preliminary encoding. ♦
    This is the best guess, in the absence of any explicit information about the encoding, based on the first four bytes of the stream. The HTTP protocol section 3.7.1 states that an encoding of ISO-8859-1 can be assumed if no charset parameter was included in the HTTP Content-Type header. The default 8-bit encoding normally used in this scenario is compatible with the HTTP protocol assumption.

Parameters:
urlConnection - the URL connection from which to load the source text.
Throws:
java.io.IOException - if an I/O error occurs.
See Also:
getEncoding()
Method Detail

getDocumentSpecifiedEncoding

public java.lang.String getDocumentSpecifiedEncoding()
Returns the document encoding specified within the text of the document.

The document encoding can be specified within the document text in two ways. They are referred to generically in this library as an encoding specification, and are listed below in order of precedence:

  1. An encoding declaration within the XML declaration of an XML document, which must be present if it has an encoding other than UTF-8 or UTF-16.
    <?xml version="1.0" encoding="ISO-8859-1" ?>
  2. A META declaration, which is in the form of a META tag with attribute http-equiv="Content-Type". The encoding is specified in the charset parameter of a Content-Type HTTP header value, which is placed in the value of the meta tag's content attribute. This META declaration should appear as early as possible in the HEAD element.
    <META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
    An HTML5 character encoding declaration is also a valid alternative.
    <meta charset="utf-8">

Both of these tags must only use characters in the range U+0000 to U+007F, and in the case of the META declaration must use ASCII encoding. This, along with the fact that they must occur at or near the beginning of the document, assists in their detection and decoding without the need to know the exact encoding of the full text.

Returns:
the document encoding specified within the text of the document, or null if no encoding is specified.
See Also:
getEncoding()

getEncoding

public java.lang.String getEncoding()
Returns the character encoding scheme of the source byte stream used to create this object.

The encoding of a document defines how the original byte stream was encoded into characters. The HTTP specification section 3.4 uses the term "character set" to refer to the encoding, and the term "charset" is similarly used in Java (see the class java.nio.charset.Charset). This often causes confusion, as a modern "coded character set" such as Unicode can have several encodings, such as UTF-8, UTF-16, and UTF-32. See the Wikipedia character encoding article for an explanation of the terminology.

This method makes the best possible effort to return the name of the encoding used to decode the original source byte stream into character data. This decoding takes place in the constructor when a parameter based on a byte stream such as an InputStream or URLConnection is used to specify the source text. The documentation of the Source(InputStream) and Source(URLConnection) constructors describe how the return value of this method is determined in these cases. It is also possible in some circumstances for the encoding to be determined in the Source(Reader) constructor.

If a constructor was used that specifies the source text directly in character form (not requiring the decoding of a byte sequence) then the document itself is searched for an encoding specification. In this case, this method returns the same value as the getDocumentSpecifiedEncoding() method.

The getEncodingSpecificationInfo() method returns a simple description of how the value of this method was determined.

Returns:
the character encoding scheme of the source byte stream used to create this object, or null if the encoding is not known.
See Also:
getEncodingSpecificationInfo()

getEncodingSpecificationInfo

public java.lang.String getEncodingSpecificationInfo()
Returns a concise description of how the encoding of the source document was determined.

The description is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.

Returns:
a concise description of how the encoding of the source document was determined.
See Also:
getEncoding()

getPreliminaryEncodingInfo

public java.lang.String getPreliminaryEncodingInfo()
Returns the preliminary encoding of the source document together with a concise description of how it was determined.

It is sometimes necessary for the Source(InputStream) and Source(URLConnection) constructors to search the document for an encoding specification in order to determine the exact encoding of the source byte stream.

In order to search for the document specified encoding before the exact encoding is known, a preliminary encoding is determined using the first four bytes of the input stream.

Because the encoding specification must only use characters in the range U+0000 to U+007F, the preliminary encoding need only have the following basic properties determined:

The encodings used to represent the most commonly encountered combinations of these basic properties are:

Note: all encodings with a code unit size greater than 8 bits are assumed to use an ASCII-compatible low-order byte.

In some descriptions returned by this method, and the documentation below, a pattern is used to help demonstrate the contents of the first four bytes of the stream. The patterns use the characters "00" to signify a zero byte, "XX" to signify a non-zero byte, and "??" to signify a byte than can be either zero or non-zero.

The algorithm for determining the preliminary encoding is as follows:

  1. Byte pattern "00 00..." : If the stream starts with two zero bytes, the default 32-bit big-endian encoding UTF-32BE is used.
  2. Byte pattern "00 XX..." : If the stream starts with a single zero byte, the default 16-bit big-endian encoding UTF-16BE is used.
  3. Byte pattern "XX ?? 00 00..." : If the third and fourth bytes of the stream are zero, the default 32-bit little-endian encoding UTF-32LE is used.
  4. Byte pattern "XX 00..." or "XX ?? XX 00..." : If the second or fourth byte of the stream is zero, the default 16-bit little-endian encoding UTF-16LE is used.
  5. Byte pattern "XX XX 00 XX..." : If the third byte of the stream is zero, the default 16-bit big-endian encoding UTF-16BE is used (assumes the first character is > U+00FF).
  6. Byte pattern "4C XX XX XX..." : If the first four bytes are consistent with the EBCDIC encoding of an XML declaration ("<?xm") or a document type declaration ("<!DO"), or any other string starting with the EBCDIC character '<' followed by three non-ASCII characters (8th bit set), which is consistent with EBCDIC alphanumeric characters, the default EBCDIC-compatible encoding Cp037 is used.
  7. Byte pattern "XX XX XX XX..." : Otherwise, if all of the first four bytes of the stream are non-zero, the default 8-bit ASCII-compatible encoding ISO-8859-1 is used.

If it was not necessary to search for a document specified encoding when determining the encoding of this source document from a byte stream, this method returns null.

See the documentation of the Source(InputStream) and Source(URLConnection) constructors for more detailed information about when the detection of a preliminary encoding is required.

The description returned by this method is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.

Returns:
the preliminary encoding of the source document together with a concise description of how it was determined, or null if no preliminary encoding was required.
See Also:
getEncoding()

isXML

public boolean isXML()
Indicates whether the source document is likely to be XML.

The algorithm used to determine this is designed to be relatively inexpensive and to provide an accurate result in most normal situations. An exact determination of whether the source document is XML would require a much more complex analysis of the text.

The algorithm is as follows:

  1. If the document begins with an XML declaration, it is an XML document.
  2. If the document contains a document type declaration that contains the text "xhtml", it is an XHTML document, and hence also an XML document.
  3. If none of the above conditions are met, assume the document is normal HTML, and therefore not an XML document.

Returns:
true if the source document is likely to be XML, otherwise false.

getNewLine

public java.lang.String getNewLine()
Returns the newline character sequence used in the source document.

If the document does not contain any newline characters, this method returns null.

The three possible return values (aside from null) are "\n", "\r\n" and "\r".

Returns:
the newline character sequence used in the source document, or null if none is present.

getRow

public int getRow(int pos)
Returns the row number of the specified character position in the source document.

Parameters:
pos - the position in the source document.
Returns:
the row number of the specified character position in the source document.
Throws:
java.lang.IndexOutOfBoundsException - if the specified position is not within the bounds of the document.
See Also:
getColumn(int pos), getRowColumnVector(int pos)

getColumn

public int getColumn(int pos)
Returns the column number of the specified character position in the source document.

Parameters:
pos - the position in the source document.
Returns:
the column number of the specified character position in the source document.
Throws:
java.lang.IndexOutOfBoundsException - if the specified position is not within the bounds of the document.
See Also:
getRow(int pos), getRowColumnVector(int pos)

getRowColumnVector

public RowColumnVector getRowColumnVector(int pos)
Returns a RowColumnVector object representing the row and column number of the specified character position in the source document.

Parameters:
pos - the position in the source document.
Returns:
a RowColumnVector object representing the row and column number of the specified character position in the source document.
Throws:
java.lang.IndexOutOfBoundsException - if the specified position is not within the bounds of the document.
See Also:
getRow(int pos), getColumn(int pos)

toString

public java.lang.String toString()
Returns the source text as a String.

Specified by:
toString in interface java.lang.CharSequence
Overrides:
toString in class Segment
Returns:
the source text as a String.

fullSequentialParse

public Tag[] fullSequentialParse()
Parses all of the tags in this source document sequentially from beginning to end.

Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed.

Calling the getAllTags(), getAllStartTags(), getAllElements(), getChildElements(), iterator() or Segment.getNodeIterator() method on the Source object performs a full sequential parse automatically. There are however still circumstances where it should be called manually, such as when it is known that most or all of the tags in the document will need to be parsed, but none of the abovementioned methods are used, or are called only after calling one or more other tag search methods.

If this method is called manually, is should be called soon after the Source object is created, before any tag search methods are called.

By default, tags are parsed only as needed, which is referred to as parse on demand mode. In this mode, every call to a tag search method that is not returning previously cached tags must perform a relatively complex check to determine whether a potential tag is in a valid position.

Generally speaking, a tag is in a valid position if it does not appear inside any another tag. Server tags can appear anywhere in a document, including inside other tags, so this relates only to non-server tags. Theoretically, checking whether a specified position in the document is enclosed in another tag is only possible if every preceding tag has been parsed, otherwise it is impossible to tell whether one of the delimiters of the enclosing tag was in fact enclosed by some other tag before it, thereby invalidating it.

When this method is called, each tag is parsed in sequence starting from the beginning of the document, making it easy to check whether each potential tag is in a valid position. In parse on demand mode a compromise technique must be used for this check, since the theoretical requirement of having parsed all preceding tags is no longer practical. This compromise involves only checking whether the position is enclosed by other tags with certain tag types. The added complexity of this technique makes parsing each tag slower compared to when a full sequential parse is performed, but when only a few tags need parsing this is an extremely beneficial trade-off.

The documentation of the TagType.isValidPosition(Source, int pos, int[] fullSequentialParseData) method, which is called internally by the parser to perform the valid position check, includes a more detailed explanation of the differences between the two modes of operation.

Calling this method a second or subsequent time has no effect.

This method returns the same list of tags as the Source.getAllTags() method, but as an array instead of a list.

If this method is called after any of the tag search methods are called, the cache is cleared of any previously found tags before being restocked via the full sequential parse, and the following message is logged at INFO level: "Full sequential parse clearing all tags from cache. Consider calling Source.fullSequentialParse() manually immediately after construction of Source."

This means that if you still have references to tags or elements from before the full sequential parse, they will not be the same objects as those that are returned by tag search methods after the full sequential parse, which can cause confusion if you are allocating user data to tags. It is also significant if the Segment.ignoreWhenParsing() method has been called since the tags were first found, as any tags inside the ignored segments will no longer be returned by any of the tag search methods.

See also the Tag class documentation for more general details about how tags are parsed.

Returns:
an array of all tags in this source document.

iterator

public java.util.Iterator<Segment> iterator()
Returns an iterator over every tag, character reference and plain text segment contained within the source document.

Plain text is defined as all text that is not part of a Tag or CharacterReference.

This results in a sequential walk-through of the entire source document. The end position of each segment should correspond with the begin position of the subsequent segment, unless any of the tags are enclosed by other tags. This could happen if there are server tags present in the document, or in rare circumstances where the document type declaration contains markup declarations.

Character references that are found inside tags, such as those present inside attribute values, are not included as separate iterated segments.

This method is implemented by simply calling the Segment.getNodeIterator() method of the Segment superclass.

Prior to version 3.1, character references were not handled as separate segments, and were instead included unparsed in the plain text segments. This required the use of the CharacterReference.decode(CharSequence) method to retrieve the actual text from each plain text segment. Although it is likely that existing programs based on the previous functionality should still work without modification, the static configuration property LegacyIteratorCompatabilityMode is provided on a temporary basis to revert back to the behaviour of previous versions, ensuring that existing programs function as intended without major modification.

Example:

The following code demonstrates the typical (implied) usage of this method through the Iterable interface to make an exact copy of the document from reader to writer (assuming no server tags are present):

 Source source=new Source(reader);
 for (Segment segment : source) {
   if (segment instanceof Tag) {
     Tag tag=(Tag)segment;
     // HANDLE TAG
     // Uncomment the following line to ensure each tag is valid XML:
     // writer.write(tag.tidy()); continue;
   } else if (segment instanceof CharacterReference) {
     CharacterReference characterReference=(CharacterReference)segment;
     // HANDLE CHARACTER REFERENCE
     // Uncomment the following line to decode all character references instead of copying them verbatim:
     // characterReference.appendCharTo(writer); continue;
   } else {
     // HANDLE PLAIN TEXT
   }
   // unless specific handling has prevented getting to here, simply output the segment as is:
   writer.write(segment.toString());
 }

Specified by:
iterator in interface java.lang.Iterable<Segment>
Returns:
an iterator over every tag, character reference and plain text segment contained within the source document.

getChildElements

public java.util.List<Element> getChildElements()
Returns a list of the top-level elements in the document element hierarchy.

The objects in the list are all of type Element.

The term top-level element refers to an element that is not nested within any other element in the document.

The term document element hierarchy refers to the hierarchy of elements that make up this source document. The source document itself is not considered to be part of the hierarchy, meaning there is typically more than one top-level element. Even when the source represents an entire HTML document, the document type declaration and/or an XML declaration often exist as top-level elements along with the HTML element itself.

The Element.getChildElements() method can be used to get the children of the top-level elements, with recursive use providing a means to visit every element in the document hierarchy.

The document element hierarchy differs from that of the Document Object Model in that it is only a representation of the elements that are physically present in the source text. Unlike the DOM, it does not include any "implied" HTML elements such as TBODY if they are not present in the source text.

Elements formed from server tags are not included in the hierarchy at all.

Structural errors in this source document such as overlapping elements are reported in the log. When elements are found to overlap, the position of the start tag determines the location of the element in the hierarchy.

Calling this method on the Source object performs a full sequential parse automatically.

A visual representation of the document element hierarchy can be obtained by calling:
getSourceFormatter().setIndentAllElements(true).setCollapseWhiteSpace(true).setTidyTags(true).toString()

Overrides:
getChildElements in class Segment
Returns:
a list of the top-level elements in the document element hierarchy, guaranteed not null.
See Also:
Element.getParentElement(), Element.getChildElements(), Element.getDepth()

getSourceFormatter

public SourceFormatter getSourceFormatter()
Formats the HTML source by laying out each non-inline-level element on a new line with an appropriate indent.

The output format can be configured by setting any number of properties on the returned SourceFormatter instance before obtaining its output.

To create a SourceFormatter instance based on a Segment rather than an entire Source document, use new SourceFormatter(segment) instead.

Returns:
an instance of SourceFormatter based on this source document.

getAllTags

public java.util.List<Tag> getAllTags()
Returns a list of all tags in this source document.

Calling this method on the Source object performs a full sequential parse automatically.

See the Tag class documentation for more details about the behaviour of this method.

Overrides:
getAllTags in class Segment
Returns:
a list of all tags in this source document.

getAllStartTags

public java.util.List<StartTag> getAllStartTags()
Returns a list of all start tags in this source document.

Calling this method on the Source object performs a full sequential parse automatically.

See the Tag class documentation for more details about the behaviour of this method.

Overrides:
getAllStartTags in class Segment
Returns:
a list of all start tags in this source document.

getAllElements

public java.util.List<Element> getAllElements()
Returns a list of all elements in this source document.

Calling this method on the Source object performs a full sequential parse automatically.

The elements returned correspond exactly with the start tags returned in the getAllStartTags() method.

Overrides:
getAllElements in class Segment
Returns:
a list of all elements in this source document.

getElementById

public Element getElementById(java.lang.String id)
Returns the Element with the specified id attribute value.

This simulates the script method getElementById defined in DOM HTML level 1.

This is equivalent to getFirstElement("id",id,true).

A well formed HTML document should have no more than one element with any given id attribute value.

Parameters:
id - the id attribute value (case sensitive) to search for, must not be null.
Returns:
the Element with the specified id attribute value, or null if no such element exists.

getTagAt

public final Tag getTagAt(int pos)
Returns the Tag at the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

This method also returns unregistered tags.

Parameters:
pos - the position in the source document, may be out of bounds.
Returns:
the Tag at the specified position in the source document, or null if no tag exists at the specified position or it is out of bounds.

getPreviousTag

public Tag getPreviousTag(int pos)
Returns the Tag beginning at or immediately preceding (or enclosing) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
Returns:
the Tag beginning at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.

getPreviousTag

public Tag getPreviousTag(int pos,
                          TagType tagType)
Returns the Tag of the specified type beginning at or immediately preceding (or enclosing) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
tagType - the TagType to search for.
Returns:
the Tag of the specified type beginning at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextTag

public Tag getNextTag(int pos)
Returns the Tag beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Use Tag.getNextTag() to get the tag immediately following another tag.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
Returns:
the Tag beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextTag

public Tag getNextTag(int pos,
                      TagType tagType)
Returns the Tag of the specified type beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
tagType - the TagType to search for.
Returns:
the Tag of the specified type beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getEnclosingTag

public Tag getEnclosingTag(int pos)
Returns the Tag that encloses the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document, may be out of bounds.
Returns:
the Tag that encloses the specified position in the source document, or null if the position is not within a tag or is out of bounds.

getEnclosingTag

public Tag getEnclosingTag(int pos,
                           TagType tagType)
Returns the Tag of the specified type that encloses the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document, may be out of bounds.
tagType - the TagType to search for.
Returns:
the Tag of the specified type that encloses the specified position in the source document, or null if the position is not within a tag of the specified type or is out of bounds.

getNextElement

public Element getNextElement(int pos)
Returns the Element beginning at or immediately following the specified position in the source document.

This is equivalent to getNextStartTag(pos).getElement(), assuming the result is not null.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
Returns:
the Element beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextElement

public Element getNextElement(int pos,
                              java.lang.String name)
Returns the normal Element with the specified name beginning at or immediately following the specified position in the source document.

This is equivalent to getNextStartTag(pos,name).getElement(), assuming the result is not null.

Specifying a null argument to the name parameter is equivalent to getNextElement(pos).

Specifying an argument to the name parameter that ends in a colon (:) searches for all elements in the specified XML namespace.

This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
name - the name of the element to search for.
Returns:
the normal Element with the specified name beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextElement

public Element getNextElement(int pos,
                              java.lang.String attributeName,
                              java.lang.String value,
                              boolean valueCaseSensitive)
Returns the Element with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.

This is equivalent to getNextStartTag(pos,attributeName,value,valueCaseSensitive).getElement(), assuming the result is not null.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
attributeName - the attribute name (case insensitive) to search for, must not be null.
value - the value of the specified attribute to search for, must not be null.
valueCaseSensitive - specifies whether the attribute value matching is case sensitive.
Returns:
the Element with the specified attribute name/value pair beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.
See Also:
getNextElement(int pos, String attributeName, Pattern valueRegexPattern)

getNextElement

public Element getNextElement(int pos,
                              java.lang.String attributeName,
                              java.util.regex.Pattern valueRegexPattern)
Returns the Element with the specified attribute name and value pattern beginning at or immediately following the specified position in the source document.

Specifying a null argument to the valueRegexPattern parameter performs the search on the attribute name only, without regard to the attribute value. This will also match an attribute that has no value at all.

This is equivalent to getNextStartTag(pos,attributeName,valueRegexPattern).getElement(), assuming the result is not null.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
attributeName - the attribute name (case insensitive) to search for, must not be null.
valueRegexPattern - the regular expression pattern that must match the attribute value, may be null.
Returns:
the Element with the specified attribute name and value pattern beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.
See Also:
getNextElement(int pos, String attributeName, String value, boolean valueCaseSensitive)

getNextElementByClass

public Element getNextElementByClass(int pos,
                                     java.lang.String className)
Returns the Element with the specified class beginning at or immediately following the specified position in the source document.

This matches an element with a class attribute that contains the specified class name, either as an exact match or where the specified class name is one of multiple class names separated by white space in the attribute value.

This is equivalent to getNextStartTagByClass(pos,className).getElement(), assuming the result is not null.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
className - the class name (case sensitive) to search for, must not be null.
Returns:
the Element with the specified class beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getPreviousStartTag

public StartTag getPreviousStartTag(int pos)
Returns the StartTag at or immediately preceding (or enclosing) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
Returns:
the StartTag at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.

getPreviousStartTag

public StartTag getPreviousStartTag(int pos,
                                    StartTagType startTagType)
Returns the StartTag of the specified type at or immediately preceding (or enclosing) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

This is exactly equivalent to (StartTag)getPreviousTag(pos,startTagType), but can be used to avoid the explicit cast to a StartTag object.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
startTagType - the StartTagType to search for.
Returns:
the StartTag of the specified type at or immediately preceding (or enclosing) the specified position in the source document, or null if none exists or the specified position is out of bounds.

getPreviousStartTag

public StartTag getPreviousStartTag(int pos,
                                    java.lang.String name)
Returns the normal StartTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Specifying a null argument to the name parameter is equivalent to getPreviousStartTag(pos).

This method also returns unregistered tags if the specified name is not a valid XML tag name.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
name - the name of the start tag to search for.
Returns:
the normal StartTag with the specified name at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.

getPreviousStartTag

public StartTag getPreviousStartTag(int pos,
                                    java.lang.String name,
                                    StartTagType startTagType)
Returns the StartTag with the specified name and type at or immediately preceding (or enclosing) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Specifying StartTagType.NORMAL as the argument to the startTagType parameter is equivalent to getPreviousStartTag(pos,name).

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
name - the name of the start tag to search for, may be null.
startTagType - the type of the start tag to search for, must not be null.
Returns:
the StartTag with the specified name and type at or immediately preceding (or enclosing) the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextStartTag

public StartTag getNextStartTag(int pos)
Returns the StartTag beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
Returns:
the StartTag beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextStartTag

public StartTag getNextStartTag(int pos,
                                StartTagType startTagType)
Returns the StartTag of the specified type beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

This is exactly equivalent to (StartTag)getNextTag(pos,startTagType), but can be used to avoid the explicit cast to a StartTag object.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
startTagType - the StartTagType to search for.
Returns:
the StartTag of the specified type beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextStartTag

public StartTag getNextStartTag(int pos,
                                java.lang.String name)
Returns the normal StartTag with the specified name beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Specifying a null argument to the name parameter is equivalent to getNextStartTag(pos).

Specifying an argument to the name parameter that ends in a colon (:) searches for all start tags in the specified XML namespace.

This method also returns unregistered tags if the specified name is not a valid XML tag name.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
name - the name of the start tag to search for, may be null.
Returns:
the normal StartTag with the specified name beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextStartTag

public StartTag getNextStartTag(int pos,
                                java.lang.String name,
                                StartTagType startTagType)
Returns the StartTag with the specified name and type beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Specifying StartTagType.NORMAL as the argument to the startTagType parameter is equivalent to getNextStartTag(pos,name).

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
name - the name of the start tag to search for, may be null.
startTagType - the type of the start tag to search for, must not be null.
Returns:
the StartTag with the specified name and type beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextStartTag

public StartTag getNextStartTag(int pos,
                                java.lang.String attributeName,
                                java.lang.String value,
                                boolean valueCaseSensitive)
Returns the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
attributeName - the attribute name (case insensitive) to search for, must not be null.
value - the value of the specified attribute to search for, must not be null.
valueCaseSensitive - specifies whether the attribute value matching is case sensitive.
Returns:
the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.
See Also:
getNextStartTag(int pos, String attributeName, Pattern valueRegexPattern)

getNextStartTag

public StartTag getNextStartTag(int pos,
                                java.lang.String attributeName,
                                java.util.regex.Pattern valueRegexPattern)
Returns the StartTag with the specified attribute name and value pattern beginning at or immediately following the specified position in the source document.

Specifying a null argument to the valueRegexPattern parameter performs the search on the attribute name only, without regard to the attribute value. This will also match an attribute that has no value at all.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
attributeName - the attribute name (case insensitive) to search for, must not be null.
valueRegexPattern - the regular expression pattern that must match the attribute value, may be null.
Returns:
the StartTag with the specified attribute name and value pattern beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.
See Also:
getNextStartTag(int pos, String attributeName, String value, boolean valueCaseSensitive)

getNextStartTagByClass

public StartTag getNextStartTagByClass(int pos,
                                       java.lang.String className)
Returns the StartTag with the specified class beginning at or immediately following the specified position in the source document.

This matches a start tag with a class attribute that contains the specified class name, either as an exact match or where the specified class name is one of multiple class names separated by white space in the attribute value.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
className - the class name (case sensitive) to search for, must not be null.
Returns:
the StartTag with the specified class beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getPreviousEndTag

public EndTag getPreviousEndTag(int pos)
Returns the EndTag at or immediately preceding (or enclosing) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
Returns:
the EndTag at or immediately preceding (or enclosing) the specified position in the source document, or null if none exists or the specified position is out of bounds.

getPreviousEndTag

public EndTag getPreviousEndTag(int pos,
                                EndTagType endTagType)
Returns the EndTag of the specified type at or immediately preceding (or enclosing) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

This is exactly equivalent to (EndTag)getPreviousTag(pos,endTagType), but can be used to avoid the explicit cast to an EndTag object.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
endTagType - the EndTagType to search for.
Returns:
the EndTag of the specified type at or immediately preceding (or enclosing) the specified position in the source document, or null if none exists or the specified position is out of bounds.

getPreviousEndTag

public EndTag getPreviousEndTag(int pos,
                                java.lang.String name)
Returns the normal EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
name - the name of the end tag to search for, must not be null.
Returns:
the normal EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextEndTag

public EndTag getNextEndTag(int pos)
Returns the EndTag beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
Returns:
the EndTag beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextEndTag

public EndTag getNextEndTag(int pos,
                            EndTagType endTagType)
Returns the EndTag of the specified type beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

This is exactly equivalent to (EndTag)getNextTag(pos,endTagType), but can be used to avoid the explicit cast to an EndTag object.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
endTagType - the EndTagType to search for.
Returns:
the EndTag of the specified type beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextEndTag

public EndTag getNextEndTag(int pos,
                            java.lang.String name)
Returns the normal EndTag with the specified name beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
name - the name of the end tag to search for, must not be null.
Returns:
the normal EndTag with the specified name beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextEndTag

public EndTag getNextEndTag(int pos,
                            java.lang.String name,
                            EndTagType endTagType)
Returns the EndTag with the specified name and type beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
name - the name of the end tag to search for, must not be null.
endTagType - the type of the end tag to search for, must not be null.
Returns:
the EndTag with the specified name and type beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getEnclosingElement

public Element getEnclosingElement(int pos)
Returns the most nested normal Element that encloses the specified position in the source document.

The specified position can be anywhere inside the start tag, end tag, or content of the element. There is no requirement that the returned element has an end tag, and it may be a server tag or HTML comment.

See the Tag class documentation for more details about the behaviour of this method.

Parameters:
pos - the position in the source document, may be out of bounds.
Returns:
the most nested normal Element that encloses the specified position in the source document, or null if the position is not within an element or is out of bounds.

getEnclosingElement

public Element getEnclosingElement(int pos,
                                   java.lang.String name)
Returns the most nested normal Element with the specified name that encloses the specified position in the source document.

The specified position can be anywhere inside the start tag, end tag, or content of the element. There is no requirement that the returned element has an end tag, and it may be a server tag or HTML comment.

See the Tag class documentation for more details about the behaviour of this method.

This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.

Parameters:
pos - the position in the source document, may be out of bounds.
name - the name of the element to search for.
Returns:
the most nested normal Element with the specified name that encloses the specified position in the source document, or null if none exists or the specified position is out of bounds.

getPreviousCharacterReference

public CharacterReference getPreviousCharacterReference(int pos)
Returns the CharacterReference at or immediately preceding (or enclosing) the specified position in the source document.

Character references positioned within an HTML comment are NOT ignored.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
Returns:
the CharacterReference beginning at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNextCharacterReference

public CharacterReference getNextCharacterReference(int pos)
Returns the CharacterReference beginning at or immediately following the specified position in the source document.

Character references positioned within an HTML comment are NOT ignored.

Parameters:
pos - the position in the source document from which to start the search, may be out of bounds.
Returns:
the CharacterReference beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.

getNameEnd

public int getNameEnd(int pos)
Returns the end position of the XML Name that starts at the specified position.

This implementation first checks that the character at the specified position is a valid XML Name start character as defined by the Tag.isXMLNameStartChar(char) method. If this is not the case, the value -1 is returned.

Once the first character has been checked, subsequent characters are checked using the Tag.isXMLNameChar(char) method until one is found that is not a valid XML Name character or the end of the document is reached. This position is then returned.

Parameters:
pos - the position in the source document of the first character of the XML Name.
Returns:
the end position of the XML Name that starts at the specified position.
Throws:
java.lang.IndexOutOfBoundsException - if the specified position is not within the bounds of the document.

parseAttributes

public Attributes parseAttributes(int pos,
                                  int maxEnd)
Parses any Attributes starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes() method should be used in normal situations.

The returned Attributes segment always begins at pos, and ends at the end of the last attribute before either maxEnd or the first occurrence of "/>" or ">" outside of a quoted attribute value, whichever comes first.

Only returns null if the segment contains a major syntactical error or more than the default maximum number of minor syntactical errors.

This is equivalent to parseAttributes(pos,maxEnd,Attributes.getDefaultMaxErrorCount())}.

Parameters:
pos - the position in the source document at the beginning of the attribute list, may be out of bounds.
maxEnd - the maximum end position of the attribute list, or -1 if no maximum.
Returns:
the Attributes starting at the specified position, or null if too many errors occur while parsing or the specified position is out of bounds.
See Also:
StartTag.getAttributes(), Segment.parseAttributes()

parseAttributes

public Attributes parseAttributes(int pos,
                                  int maxEnd,
                                  int maxErrorCount)
Parses any Attributes starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes() method should be used in normal situations.

Only returns null if the segment contains a major syntactical error or more than the specified number of minor syntactical errors.

The maxErrorCount argument overrides the default maximum error count.

See parseAttributes(int pos, int maxEnd) for more information.

Parameters:
pos - the position in the source document at the beginning of the attribute list, may be out of bounds.
maxEnd - the maximum end position of the attribute list, or -1 if no maximum.
maxErrorCount - the maximum number of minor errors allowed while parsing.
Returns:
the Attributes starting at the specified position, or null if too many errors occur while parsing or the specified position is out of bounds.
See Also:
StartTag.getAttributes(), parseAttributes(int pos, int MaxEnd)

ignoreWhenParsing

public void ignoreWhenParsing(int begin,
                              int end)
Causes the specified range of the source text to be ignored when parsing.

See the documentation of the Segment.ignoreWhenParsing() method for more information.

Parameters:
begin - the beginning character position in the source text.
end - the end character position in the source text.

ignoreWhenParsing

public void ignoreWhenParsing(java.util.Collection<? extends Segment> segments)
Causes all of the segments in the specified collection to be ignored when parsing.

This is equivalent to calling Segment.ignoreWhenParsing() on each segment in the collection.


setLogger

public void setLogger(Logger logger)
Sets the Logger that handles log messages.

Specifying a null argument disables logging completely for operations performed on this Source object.

A logger instance is created automatically for each Source object using the LoggerProvider specified by the static Config.LoggerProvider property. The name used for all automatically created logger instances is "net.htmlparser.jericho".

Use of this method with a non-null argument is therefore not usually necessary, unless specifying an instance of WriterLogger or a user-defined Logger implementation.

Parameters:
logger - the logger that will handle log messages, or null to disable logging.
See Also:
Config.LoggerProvider

getLogger

public Logger getLogger()
Returns the Logger that handles log messages.

A logger instance is created automatically for each Source object using the LoggerProvider specified by the static Config.LoggerProvider property. This can be overridden by calling the setLogger(Logger) method. The name used for all automatically created logger instances is "net.htmlparser.jericho".

Returns:
the Logger that handles log messages, or null if logging is disabled.

clearCache

public void clearCache()
Clears the tag cache of all tags.

This method may be useful after calling the Segment.ignoreWhenParsing() method so that any tags previously found within the ignored segments will no longer be returned by the tag search methods.


getCacheDebugInfo

public java.lang.String getCacheDebugInfo()
Returns a string representation of the tag cache, useful for debugging purposes.

Returns:
a string representation of the tag cache, useful for debugging purposes.

getParseText

public final ParseText getParseText()
Returns the parse text of this source document.

This method is normally only of interest to users who wish to create custom tag types.

The parse text is defined as the entire text of the source document in lower case, with all ignored segments replaced by space characters.

Returns:
the parse text of this source document.

subSequence

public final java.lang.CharSequence subSequence(int begin,
                                                int end)
Returns a new character sequence that is a subsequence of this source document.

Specified by:
subSequence in interface java.lang.CharSequence
Overrides:
subSequence in class Segment
Parameters:
begin - the begin position, inclusive.
end - the end position, exclusive.
Returns:
a new character sequence that is a subsequence of this source document.

charAt

public final char charAt(int index)
Description copied from class: Segment
Returns the character at the specified index.

This is logically equivalent to toString().charAt(index) for valid argument values 0 <= index < length().

However because this implementation works directly on the underlying document source string, it should not be assumed that an IndexOutOfBoundsException is thrown for an invalid argument value.

Specified by:
charAt in interface java.lang.CharSequence
Overrides:
charAt in class Segment
Parameters:
index - the index of the character.
Returns:
the character at the specified index.

length

public final int length()
Returns the length of the source document.

Specified by:
length in interface java.lang.CharSequence
Overrides:
length in class Segment
Returns:
the length of the source document.


privacy policy