public final class StreamedSource extends java.lang.Object implements java.lang.Iterable<Segment>, java.io.Closeable
This class provides a means, via the iterator()
method, of sequentially parsing every tag, character reference
and plain text segment contained within the source document using a minimum amount of memory.
In contrast, the standard Source
class stores the entire source text in memory and caches every tag parsed,
resulting in memory problems when attempting to parse very large files.
The iterator
parses and returns each segment as the source text is streamed in.
Previous segments are discarded for garbage collection.
Source documents up to 2GB in size can be processed, a limit which is imposed by the java language because of its use of the int
data type to index string operations.
There is however a significant trade-off in functionality when using the StreamedSource
class as opposed to the Source
class.
The Tag.getElement()
method is not supported on tags that are returned by the iterator, nor are any methods that use the Element
class in any way.
The Segment.getSource()
method is also not supported.
Most of the methods and constructors in this class mirror similarly named methods in the Source
class where the same functionality is available.
See the description of the iterator()
method for a typical usage example of this class.
In contrast to a Source
object, the Reader
or InputStream
specified in the constructor or created implicitly by the constructor
remains open for the life of the StreamedSource
object. If the stream is created internally, it is automatically closed
when the end of the stream is reached or the StreamedSource
object is finalized.
However a Reader
or InputStream
that is specified directly in a constructor is never closed automatically, as it can not be assumed
that the application has no further use for it. It is the user's responsibility to ensure it is closed in this case.
Explicitly calling the close()
method on the StreamedSource
object ensures that all resources used by it are closed, regardless of whether
they were created internally or supplied externally.
The functionality provided by StreamedSource
is similar to a StAX parser,
but with some important benefits:
The following table summarises the differences between the StreamedSource
, StAX and SAX interfaces.
Note that some of the available features are documented as optional and may not be supported by all implementations of StAX and SAX.
Feature | StreamedSource | StAX | SAX |
---|---|---|---|
Parse XML | ● | ● | ● |
Parse entities without DTD | ● | ||
Automatically validate XML | ● | ● | |
Parse HTML | ● | ||
Tolerant of syntax or nesting errors | ● | ||
Provide begin and end character positions of each event1 | ● | ○ | |
Provide source text of each event | ● | ||
Handle server tag events | ● | ||
Handle XML declaration event | ● | ||
Handle comment events | ● | ● | ● |
Handle CDATA section events | ● | ● | ● |
Handle document type declaration event | ● | ● | ● |
Handle character reference events | ● | ||
Allow chunking of plain text | ● | ● | ● |
Allow chunking of comment text | |||
Allow chunking of CDATA section text | ● | ||
Allow specification of maximum buffer size | ● |
Note that the OutputDocument
class can not be used to create a modified version of a streamed source document.
Instead, the output document must be constructed manually from the segments provided by the iterator
.
StreamedSource
objects are not thread safe.
Constructor and Description |
---|
StreamedSource(java.lang.CharSequence text)
Constructs a new
StreamedSource object from the specified text. |
StreamedSource(java.io.InputStream inputStream)
Constructs a new
StreamedSource object by loading the content from the specified InputStream . |
StreamedSource(java.io.Reader reader)
Constructs a new
StreamedSource object by loading the content from the specified Reader . |
StreamedSource(java.net.URL url)
Constructs a new
StreamedSource object by loading the content from the specified URL. |
StreamedSource(java.net.URLConnection urlConnection)
Constructs a new
StreamedSource object by loading the content from the specified URLConnection . |
Modifier and Type | Method and Description |
---|---|
void |
close()
Closes the underlying
Reader or InputStream and releases any system resources associated with it. |
int |
getBufferSize()
Returns the current size of the internal character buffer.
|
Segment |
getCurrentSegment()
Returns the current
Segment from the iterator(). |
java.nio.CharBuffer |
getCurrentSegmentCharBuffer()
Returns a
CharBuffer containing the source text of the current segment. |
java.lang.String |
getEncoding()
Returns the character encoding scheme of the source byte stream used to create this object.
|
java.lang.String |
getEncodingSpecificationInfo()
Returns a concise description of how the encoding of the source document was determined.
|
Logger |
getLogger()
Returns the
Logger that handles log messages. |
java.lang.String |
getPreliminaryEncodingInfo()
Returns the preliminary encoding of the source document together with a concise description of how it was determined.
|
boolean |
isXML()
Indicates whether the source document is likely to be XML.
|
java.util.Iterator<Segment> |
iterator()
Returns an iterator over every tag, character reference and plain text segment contained within the source document.
|
StreamedSource |
setBuffer(char[] buffer)
Specifies an existing character array to use for buffering the incoming character stream.
|
StreamedSource |
setCoalescing(boolean coalescing)
Specifies whether an unbroken section of plain text in the source document should always be coalesced into a single
Segment by the iterator. |
void |
setLogger(Logger logger)
Sets the
Logger that handles log messages. |
java.lang.String |
toString()
Returns a string representation of the object as generated by the default
Object.toString() implementation. |
public StreamedSource(java.io.Reader reader) throws java.io.IOException
StreamedSource
object by loading the content from the specified Reader
.
If the specified reader is an instance of InputStreamReader
, the getEncoding()
method of the
created StreamedSource
object returns the encoding from InputStreamReader.getEncoding()
.
reader
- the java.io.Reader
from which to load the source text.java.io.IOException
- if an I/O error occurs.public StreamedSource(java.io.InputStream inputStream) throws java.io.IOException
StreamedSource
object by loading the content from the specified InputStream
.
The algorithm for detecting the character encoding of the source document from the raw bytes
of the specified input stream is the same as that for the Source(URLConnection)
constructor of the Source
class,
except that the first step is not possible as there is no
Content-Type header to check.
If the specified InputStream
does not support the mark
method, the algorithm that determines the encoding may have to wrap it
in a BufferedInputStream
in order to look ahead at the encoding meta data.
This extra layer of buffering will then remain in place for the life of the StreamedSource
, possibly impacting memory usage and/or degrading performance.
It is always preferable to use the StreamedSource(Reader)
constructor if the encoding is known in advance.
inputStream
- the java.io.InputStream
from which to load the source text.java.io.IOException
- if an I/O error occurs.getEncoding()
public StreamedSource(java.net.URL url) throws java.io.IOException
StreamedSource
object by loading the content from the specified URL.
This is equivalent to StreamedSource(url.openConnection())
.
url
- the URL from which to load the source text.java.io.IOException
- if an I/O error occurs.getEncoding()
public StreamedSource(java.net.URLConnection urlConnection) throws java.io.IOException
StreamedSource
object by loading the content from the specified URLConnection
.
The algorithm for detecting the character encoding of the source document is identical to that described in the
Source(URLConnection)
constructor of the Source
class.
The algorithm that determines the encoding may have to wrap the input stream in a BufferedInputStream
in order to look ahead
at the encoding meta data if the encoding is not specified in the HTTP headers.
This extra layer of buffering will then remain in place for the life of the StreamedSource
, possibly impacting memory usage and/or degrading performance.
It is always preferable to use the StreamedSource(Reader)
constructor if the encoding is known in advance.
urlConnection
- the URL connection from which to load the source text.java.io.IOException
- if an I/O error occurs.getEncoding()
public StreamedSource(java.lang.CharSequence text)
StreamedSource
object from the specified text.
Although the CharSequence
argument of this constructor apparently contradicts the notion of streaming in the source text,
it can still benefits over the equivalent use of the standard Source
class.
Firstly, using the StreamedSource
class to iterate the nodes of an in-memory CharSequence
source document still requires much less memory
than the equivalent operation using the standard Source
class.
Secondly, the specified CharSequence
object could possibly implement its own paging mechanism to minimise memory usage.
If the specified CharSequence
is mutable, its state must not be modified while the StreamedSource
is in use.
text
- the source text.public StreamedSource setBuffer(char[] buffer)
The specified buffer is fixed for the life of the StreamedSource
object, in contrast to the default buffer which can be automatically replaced
by a larger buffer as needed.
This means that if a tag (including a comment or CDATA section) is
encountered that is larger than the specified buffer, an unrecoverable BufferOverflowException
is thrown.
This exception is also thrown if coalescing
has been enabled and a plain text segment is encountered
that is larger than the specified buffer.
In general this method should only be used if there needs to be an absolute maximum memory limit imposed on the parser, where that requirement is more important than the ability to parse any source document successfully.
This method can only be called before the iterator()
method has been called.
buffer
- an existing character array to use for buffering the incoming character stream, must not be null
.StreamedSource
instance, allowing multiple property setting methods to be chained in a single statement.java.lang.IllegalStateException
- if the iterator()
method has already been called.public StreamedSource setCoalescing(boolean coalescing)
Segment
by the iterator.
If this property is set to the default value of false
,
and a section of plain text is encountered in the document that is larger than the current buffer size,
the text is chunked into multiple consecutive plain text segments in order to minimise memory usage.
If this property is set to true
then chunking is disabled, ensuring that consecutive plain text segments are never generated,
but instead forcing the internal buffer to expand to fit the largest section of plain text.
Note that CharacterReference
segments are always handled separately from plain text, regardless of the value of this property.
For this reason, algorithms that process element content almost always have to be designed to expect the text in multiple segments
in order to handle character references, so there is usually no advantage in coalescing plain text segments.
coalescing
- the new value of the coalescing property.StreamedSource
instance, allowing multiple property setting methods to be chained in a single statement.java.lang.IllegalStateException
- if the iterator()
method has already been called.public void close() throws java.io.IOException
Reader
or InputStream
and releases any system resources associated with it.
If the stream is already closed then invoking this method has no effect.
close
in interface java.io.Closeable
close
in interface java.lang.AutoCloseable
java.io.IOException
- if an I/O error occurs.public java.lang.String getEncoding()
This method works in essentially the same way as the Source.getEncoding()
method.
If the byte stream used to create this object does not support the mark
method, the algorithm that determines the encoding may have to wrap it
in a BufferedInputStream
in order to look ahead at the encoding meta data.
This extra layer of buffering will then remain in place for the life of the StreamedSource
, possibly impacting memory usage and/or degrading performance.
It is always preferable to use the StreamedSource(Reader)
constructor if the encoding is known in advance.
The getEncodingSpecificationInfo()
method returns a simple description of how the value of this method was determined.
null
if the encoding is not known.getEncodingSpecificationInfo()
public java.lang.String getEncodingSpecificationInfo()
The description is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.
getEncoding()
public java.lang.String getPreliminaryEncodingInfo()
This method works in essentially the same way as the Source.getPreliminaryEncodingInfo()
method.
The description returned by this method is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.
null
if no preliminary encoding was required.getEncoding()
public java.util.Iterator<Segment> iterator()
Plain text is defined as all text that is not part of a Tag
or CharacterReference
.
This results in a sequential walk-through of the entire source document. The end position of each segment should correspond with the begin position of the subsequent segment, unless any of the tags are enclosed by other tags. This could happen if there are server tags present in the document, or in rare circumstances where the document type declaration contains markup declarations.
Each segment generated by the iterator is parsed as the source text is streamed in. Previous segments are discarded for garbage collection.
If a section of plain text is encountered in the document that is larger than the current buffer size,
the text is chunked into multiple consecutive plain text segments in order to minimise memory usage.
Setting the Coalescing
property to true
disables chunking, ensuring that consecutive plain text segments are never generated,
but instead forcing the internal buffer to expand to fit the largest section of plain text.
Note that CharacterReference
segments are always handled separately from plain text, regardless of whether coalescing
is enabled. For this reason, algorithms that process element content almost always have to be designed to expect the text in multiple segments
in order to handle character references, so there is usually no advantage in coalescing plain text segments.
Character references that are found inside tags, such as those present inside attribute values, do not generate separate segments from the iterator.
This method may only be called once on any particular StreamedSource
instance.
The following code demonstrates the typical (implied) usage of this method through the Iterable
interface
to make an exact copy of the document from reader
to writer
(assuming no server tags are present):
StreamedSource streamedSource=new StreamedSource(reader); for (Segment segment : streamedSource) { if (segment instanceof Tag) { Tag tag=(Tag)segment; // HANDLE TAG // Uncomment the following line to ensure each tag is valid XML: // writer.write(tag.tidy()); continue; } else if (segment instanceof CharacterReference) { CharacterReference characterReference=(CharacterReference)segment; // HANDLE CHARACTER REFERENCE // Uncomment the following line to decode all character references instead of copying them verbatim: // characterReference.appendCharTo(writer); continue; } else { // HANDLE PLAIN TEXT } // unless specific handling has prevented getting to here, simply output the segment as is: writer.write(segment.toString()); }
Note that the last line writer.write(segment.toString())
in the above code can be replaced with the following for improved performance:
CharBuffer charBuffer=streamedSource.getCurrentSegmentCharBuffer(); writer.write(charBuffer.array(),charBuffer.position(),charBuffer.length());
The following code demonstrates how to process the plain text content of a specific element, in this case to print the content of every paragraph element:
StreamedSource streamedSource=new StreamedSource(reader); StringBuilder sb=new StringBuilder(); boolean insideParagraphElement=false; for (Segment segment : streamedSource) { if (segment instanceof Tag) { Tag tag=(Tag)segment; if (tag.getName().equals("p")) { if (tag instanceof StartTag) { insideParagraphElement=true; sb.setLength(0); } else { // tag instanceof EndTag insideParagraphElement=false; System.out.println(sb.toString()); } } } else if (insideParagraphElement) { if (segment instanceof CharacterReference) { ((CharacterReference)segment).appendCharTo(sb); } else { sb.append(segment); } } }
iterator
in interface java.lang.Iterable<Segment>
public Segment getCurrentSegment()
Segment
from the iterator().
This is defined as the last Segment
returned from the iterator's next()
method.
This method returns null
if the iterator's next()
method has never been called, or its
hasNext()
method has returned the value false
.
Segment
from the iterator().public java.nio.CharBuffer getCurrentSegmentCharBuffer()
CharBuffer
containing the source text of the current segment.
The returned CharBuffer
provides a window into the internal char[]
buffer including the position and length that spans the
current segment.
For example, the following code writes the source text of the current segment to writer
:
CharBuffer charBuffer=streamedSource.getCurrentSegmentCharBuffer();
writer.write(charBuffer.array(),charBuffer.position(),charBuffer.length());
This may provide a performance benefit over the standard way of accessing the source text of the current segment,
which is to use the CharSequence
interface of the segment directly, or to call Segment.toString()
.
Because this CharBuffer
is a direct window into the internal buffer of the StreamedSource
, the contents of the
CharBuffer.array()
must not be modified, and the array is only guaranteed to hold the segment source text until the
iterator's hasNext()
or next()
method is next called.
CharBuffer
containing the source text of the current segment.public boolean isXML()
The algorithm used to determine this is designed to be relatively inexpensive and to provide an accurate result in most normal situations. An exact determination of whether the source document is XML would require a much more complex analysis of the text.
The algorithm is as follows:
xhtml
", it is an XHTML document, and hence
also an XML document.
This method can only be called after the iterator()
method has been called.
true
if the source document is likely to be XML, otherwise false
.java.lang.IllegalStateException
- if the iterator()
method has not yet been called.public void setLogger(Logger logger)
Logger
that handles log messages.
Specifying a null
argument disables logging completely for operations performed on this StreamedSource
object.
A logger instance is created automatically for each StreamedSource
object in the same way as is described in the
Source.setLogger(Logger)
method.
logger
- the logger that will handle log messages, or null
to disable logging.Config.LoggerProvider
public Logger getLogger()
Logger
that handles log messages.
A logger instance is created automatically for each StreamedSource
object using the LoggerProvider
specified by the static Config.LoggerProvider
property.
This can be overridden by calling the setLogger(Logger)
method.
The name used for all automatically created logger instances is "net.htmlparser.jericho
".
Logger
that handles log messages, or null
if logging is disabled.public int getBufferSize()
This information is generally useful only for investigating memory and performance issues.
public java.lang.String toString()
Object.toString()
implementation.
In contrast to the Source.toString()
implementation, it is generally not possible for this method to return the entire source text.
toString
in class java.lang.Object
Object.toString()
implementation.