public class TextExtractor extends java.lang.Object implements CharStreamSource
The output is ideal for feeding into a text search engine such as Apache Lucene,
especially when the IncludeAttributes
property has been set to true
.
Use one of the following methods to obtain the output:
The process removes all of the tags and
decodes the result, collapsing all white space.
A space character is included in the output where a normal tag is present in the source,
unless the tag belongs to an inline-level element.
An exception to this is the BR
element, which is also converted to a space despite being an inline-level element.
Text inside SCRIPT
and STYLE
elements contained within this segment
is ignored.
Setting the ExcludeNonHTMLElements
property results in the exclusion of any content within a
non-HTML element.
See the excludeElement(StartTag)
method for details on how to implement a more complex mechanism to determine whether the
content of each Element
is to be excluded from the output.
All tags that are not normal tags, such as server tags, comments etc., are removed from the output without adding white space to the output.
Note that segments on which the Segment.ignoreWhenParsing()
method has been called are treated as text rather than markup,
resulting in their inclusion in the output.
To remove specific segments before extracting the text, create an OutputDocument
and call its remove(Segment)
or
replaceWithSpaces(int begin, int end)
method for each segment to be removed.
Then create a new source document using new Source(outputDocument.toString())
and perform the text extraction on this new source object.
Extracting the text from an entire Source
object performs a full sequential parse automatically.
To perform a simple rendering of HTML markup into text, which is more readable than the output of this class, use the Renderer
class instead.
<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>
"One Two Three
".
Constructor and Description |
---|
TextExtractor(Segment segment)
Constructs a new
TextExtractor based on the specified Segment . |
Modifier and Type | Method and Description |
---|---|
void |
appendTo(java.lang.Appendable appendable)
Appends the output to the specified
Appendable object. |
boolean |
excludeElement(StartTag startTag)
Indicates whether the text inside the
Element of the specified start tag should be excluded from the output. |
boolean |
getConvertNonBreakingSpaces()
Indicates whether non-breaking space (
) character entity references are converted to spaces. |
long |
getEstimatedMaximumOutputLength()
Returns the estimated maximum number of characters in the output, or
-1 if no estimate is available. |
boolean |
getExcludeNonHTMLElements()
Indicates whether the content of non-HTML elements is excluded from the output.
|
boolean |
getIncludeAttributes()
Indicates whether any attribute values are included in the output.
|
boolean |
includeAttribute(StartTag startTag,
Attribute attribute)
|
TextExtractor |
setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
Sets whether non-breaking space (
) character entity references are converted to spaces. |
TextExtractor |
setExcludeNonHTMLElements(boolean excludeNonHTMLElements)
Sets whether the content of non-HTML elements is excluded from the output.
|
TextExtractor |
setIncludeAttributes(boolean includeAttributes)
Sets whether any attribute values are included in the output.
|
java.lang.String |
toString()
Returns the output as a string.
|
void |
writeTo(java.io.Writer writer)
Writes the output to the specified
Writer . |
public TextExtractor(Segment segment)
TextExtractor
based on the specified Segment
.segment
- the segment from which the text will be extracted.Segment.getTextExtractor()
public void writeTo(java.io.Writer writer) throws java.io.IOException
CharStreamSource
Writer
.writeTo
in interface CharStreamSource
writer
- the destination java.io.Writer
for the output.java.io.IOException
- if an I/O exception occurs.public void appendTo(java.lang.Appendable appendable) throws java.io.IOException
CharStreamSource
Appendable
object.appendTo
in interface CharStreamSource
appendable
- the destination java.lang.Appendable
object for the output.java.io.IOException
- if an I/O exception occurs.public long getEstimatedMaximumOutputLength()
CharStreamSource
-1
if no estimate is available.
The returned value should be used as a guide for efficiency purposes only, for example to set an initial StringBuilder
capacity.
There is no guarantee that the length of the output is indeed less than this value,
as classes implementing this method often use assumptions based on typical usage to calculate the estimate.
Although implementations of this method should never return a value less than -1, users of this method must not assume that this will always be the case. Standard practice is to interpret any negative value as meaning that no estimate is available.
getEstimatedMaximumOutputLength
in interface CharStreamSource
-1
if no estimate is available.public java.lang.String toString()
CharStreamSource
toString
in interface CharStreamSource
toString
in class java.lang.Object
public TextExtractor setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
) character entity references are converted to spaces.
The default value is that of the static Config.ConvertNonBreakingSpaces
property at the time the TextExtractor
is instantiated.
convertNonBreakingSpaces
- specifies whether non-breaking space (
) character entity references are converted to spaces.TextExtractor
instance, allowing multiple property setting methods to be chained in a single statement.getConvertNonBreakingSpaces()
public boolean getConvertNonBreakingSpaces()
) character entity references are converted to spaces.
See the setConvertNonBreakingSpaces(boolean)
method for a full description of this property.
true
if non-breaking space (
) character entity references are converted to spaces, otherwise false
.public TextExtractor setIncludeAttributes(boolean includeAttributes)
If the value of this property is true
, then each attribute still has to match the conditions implemented in the
includeAttribute(StartTag,Attribute)
method in order for its value to be included in the output.
The default value is false
.
includeAttributes
- specifies whether any attribute values are included in the output.TextExtractor
instance, allowing multiple property setting methods to be chained in a single statement.getIncludeAttributes()
public boolean getIncludeAttributes()
See the setIncludeAttributes(boolean)
method for a full description of this property.
true
if any attribute values are included in the output, otherwise false
.public boolean includeAttribute(StartTag startTag, Attribute attribute)
This method is ignored if the IncludeAttributes
property is set to false
, in which case
no attribute values are included in the output.
If the IncludeAttributes
property is set to true
, every attribute of every
start tag encountered in the segment is checked using this method to determine whether the value of the attribute should be included in the output.
The default implementation of this method returns true
if the name of the specified attribute
is one of
title,
alt,
label,
summary,
content*, or
href,
but the method can be overridden in a subclass to perform a check of arbitrary complexity on each attribute.
* The value of a content attribute is only included if a
name attribute is also present in the specified start tag,
as the content attribute of a META
tag only contains human readable text if the name attribute is used as opposed to an
http-equiv attribute.
final Set includeAttributeNames=new HashSet(Arrays.asList(new String[] {"title","alt"}));
TextExtractor textExtractor=new TextExtractor(segment) {
public boolean includeAttribute(StartTag startTag, Attribute attribute) {
return includeAttributeNames.contains(attribute.getKey());
}
};
textExtractor.setIncludeAttributes(true);
String extractedText=textExtractor.toString();
startTag
- the start tag of the element to check for inclusion.Element
of the specified start tag should be excluded from the output, otherwise false
.public TextExtractor setExcludeNonHTMLElements(boolean excludeNonHTMLElements)
The default value is false
, meaning that content from all elements meeting the other criteria is included.
excludeNonHTMLElements
- specifies whether content non-HTML elements is excluded from the output.TextExtractor
instance, allowing multiple property setting methods to be chained in a single statement.getExcludeNonHTMLElements()
public boolean getExcludeNonHTMLElements()
See the setExcludeNonHTMLElements(boolean)
method for a full description of this property.
true
if the content of non-HTML elements is excluded from the output, otherwise false
.public boolean excludeElement(StartTag startTag)
Element
of the specified start tag should be excluded from the output.
During the text extraction process, every start tag encountered in the segment is checked using this method to determine whether the text inside its associated element should be excluded from the output.
The default implementation of this method is to always return false
, so that every element is included,
but the method can be overridden in a subclass to perform a check of arbitrary complexity on each start tag.
All elements nested inside an excluded element are also implicitly excluded, as are all
SCRIPT
and STYLE
elements.
Such elements are skipped over without calling this method, so there is no way to include them by overriding the method.
segment
, excluding any text inside elements with the attribute class="NotIndexed"
:
TextExtractor textExtractor=new TextExtractor(segment) {
public boolean excludeElement(StartTag startTag) {
return "NotIndexed".equalsIgnoreCase(startTag.getAttributeValue("class"));
}
};
String extractedText=textExtractor.toString();
startTag
- the start tag of the element to check for inclusion.Element
of the specified start tag should be excluded from the output, otherwise false
.