public class Renderer extends java.lang.Object implements CharStreamSource
This provides a human readable version of the segment content that is modelled on the way Mozilla Thunderbird and other email clients provide an automatic conversion of HTML content to text in their alternative MIME encoding of emails.
The output using default settings complies with the "text/plain; format=flowed" (DelSp=No) protocol described in RFC3676.
Many properties are available to customise the output, possibly the most significant of which being MaxLineLength
.
See the individual property descriptions for details.
Use one of the following methods to obtain the output:
The rendering of some constructs, especially tables, is very rudimentary. No attempt is made to render nested tables properly, except to ensure that all of the text content is included in the output.
Rendering an entire Source
object performs a full sequential parse automatically.
Any aspect of the algorithm not specifically mentioned here is subject to change without notice in future versions.
To extract pure text without any rendering of the markup, use the TextExtractor
class instead.
Constructor and Description |
---|
Renderer(Segment segment)
Constructs a new
Renderer based on the specified Segment . |
Modifier and Type | Method and Description |
---|---|
void |
appendTo(java.lang.Appendable appendable)
Appends the output to the specified
Appendable object. |
int |
getBlockIndentSize()
Returns the size of the indent to be used for anything other than
LI elements. |
boolean |
getConvertNonBreakingSpaces()
Indicates whether non-breaking space (
) character entity references are converted to spaces. |
boolean |
getDecorateFontStyles()
Indicates whether decoration characters are to be included around the content of some
font style elements and
phrase elements.
|
static int |
getDefaultBottomMargin(java.lang.String htmlElementName)
Returns the default bottom margin of an HTML block element with the specified name.
|
static int |
getDefaultTopMargin(java.lang.String htmlElementName)
Returns the default top margin of an HTML block element with the specified name.
|
long |
getEstimatedMaximumOutputLength()
Returns the estimated maximum number of characters in the output, or
-1 if no estimate is available. |
int |
getHRLineLength()
Returns the length of a horizontal line.
|
boolean |
getIncludeAlternateText()
Indicates whether the alternate text of a tag that has an
alt attribute is included in the output. |
boolean |
getIncludeFirstElementTopMargin()
Indicates whether the top margin of the first element is rendered.
|
boolean |
getIncludeHyperlinkURLs()
Indicates whether hyperlink URLs are included in the output.
|
char[] |
getListBullets()
Returns the bullet characters to use for list items inside
UL elements. |
int |
getListIndentSize()
Returns the size of the indent to be used for
LI elements. |
int |
getMaxLineLength()
Returns the column at which lines are to be wrapped.
|
java.lang.String |
getNewLine()
Returns the string to be used to represent a newline in the output.
|
java.lang.String |
getTableCellSeparator()
Returns the string that is to separate table cells.
|
static boolean |
isDefaultIndent(java.lang.String htmlElementName)
Returns the default value of whether an HTML block element of the specified name is indented.
|
java.lang.String |
renderAlternateText(StartTag startTag)
Renders the alternate text of the specified start tag.
|
java.lang.String |
renderHyperlinkURL(StartTag startTag)
Renders the hyperlink URL from the specified
StartTag . |
Renderer |
setBlockIndentSize(int blockIndentSize)
Sets the size of the indent to be used for anything other than
LI elements. |
Renderer |
setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
Sets whether non-breaking space (
) character entity references are converted to spaces. |
Renderer |
setDecorateFontStyles(boolean decorateFontStyles)
Sets whether decoration characters are to be included around the content of some
font style elements and
phrase elements.
|
static void |
setDefaultBottomMargin(java.lang.String htmlElementName,
int bottomMargin)
Sets the default bottom margin of an HTML block element with the specified name.
|
static void |
setDefaultIndent(java.lang.String htmlElementName,
boolean indent)
Sets the default value of whether an HTML block element of the specified name is indented.
|
static void |
setDefaultTopMargin(java.lang.String htmlElementName,
int topMargin)
Sets the default top margin of an HTML block element with the specified name.
|
Renderer |
setHRLineLength(int hrLineLength)
Sets the length of a horizontal line.
|
Renderer |
setIncludeAlternateText(boolean includeAlternateText)
Sets whether the alternate text of a tag that has an
alt attribute is included in the output. |
Renderer |
setIncludeFirstElementTopMargin(boolean includeFirstElementTopMargin)
Sets whether the top margin of the first element is rendered.
|
Renderer |
setIncludeHyperlinkURLs(boolean includeHyperlinkURLs)
Sets whether hyperlink URLs are included in the output.
|
Renderer |
setListBullets(char[] listBullets)
Sets the bullet characters to use for list items inside
UL elements. |
Renderer |
setListIndentSize(int listIndentSize)
Sets the size of the indent to be used for
LI elements. |
Renderer |
setMaxLineLength(int maxLineLength)
Sets the column at which lines are to be wrapped.
|
Renderer |
setNewLine(java.lang.String newLine)
Sets the string to be used to represent a newline in the output.
|
Renderer |
setTableCellSeparator(java.lang.String tableCellSeparator)
Sets the string that is to separate table cells.
|
java.lang.String |
toString()
Returns the output as a string.
|
void |
writeTo(java.io.Writer writer)
Writes the output to the specified
Writer . |
public Renderer(Segment segment)
Renderer
based on the specified Segment
.segment
- the segment containing the HTML to be rendered.Segment.getRenderer()
public void writeTo(java.io.Writer writer) throws java.io.IOException
CharStreamSource
Writer
.writeTo
in interface CharStreamSource
writer
- the destination java.io.Writer
for the output.java.io.IOException
- if an I/O exception occurs.public void appendTo(java.lang.Appendable appendable) throws java.io.IOException
CharStreamSource
Appendable
object.appendTo
in interface CharStreamSource
appendable
- the destination java.lang.Appendable
object for the output.java.io.IOException
- if an I/O exception occurs.public long getEstimatedMaximumOutputLength()
CharStreamSource
-1
if no estimate is available.
The returned value should be used as a guide for efficiency purposes only, for example to set an initial StringBuilder
capacity.
There is no guarantee that the length of the output is indeed less than this value,
as classes implementing this method often use assumptions based on typical usage to calculate the estimate.
Although implementations of this method should never return a value less than -1, users of this method must not assume that this will always be the case. Standard practice is to interpret any negative value as meaning that no estimate is available.
getEstimatedMaximumOutputLength
in interface CharStreamSource
-1
if no estimate is available.public java.lang.String toString()
CharStreamSource
toString
in interface CharStreamSource
toString
in class java.lang.Object
public Renderer setMaxLineLength(int maxLineLength)
Lines that would otherwise exceed this length are wrapped onto a new line at a word boundary.
Setting this property automatically sets the HRLineLength
property to MaxLineLength - 4
.
Setting this property to zero disables line wrapping completely, and leaves the value of HRLineLength
unchanged.
A Line may still exceed this length if it consists of a single word, where the length of the word plus the line indent exceeds the maximum length. In this case the line is wrapped immediately after the end of the word.
The default value is 76
, which reflects the maximum line length for sending
email data specified in RFC2049 section 3.5.
maxLineLength
- the column at which lines are to be wrapped.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getMaxLineLength()
public int getMaxLineLength()
See the setMaxLineLength(int)
method for a full description of this property.
public Renderer setHRLineLength(int hrLineLength)
The length determines the number of hyphen characters used to render HR
elements.
Setting this property to 0
disables line rendering, although it is still treated as a block boundary.
This property is set automatically to MaxLineLength - 4
when the MaxLineLength
property is set.
The default value is 72
.
hrLineLength
- the length of a horizontal line.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getHRLineLength()
public int getHRLineLength()
See the setHRLineLength(int)
method for a full description of this property.
public Renderer setNewLine(java.lang.String newLine)
The default value is "\r\n"
(CR+LF) regardless of the platform on which the library is running.
This is so that the default configuration produces valid
MIME plain/text output, which mandates the use of CR+LF for line breaks.
Specifying a null
argument causes the output to use same new line string as is used in the source document, which is
determined via the Source.getNewLine()
method.
If the source document does not contain any new lines, a "best guess" is made by either taking the new line string of a previously parsed document,
or using the value from the static Config.NewLine
property.
newLine
- the string to be used to represent a newline in the output, may be null
.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getNewLine()
public java.lang.String getNewLine()
See the setNewLine(String)
method for a full description of this property.
public Renderer setIncludeHyperlinkURLs(boolean includeHyperlinkURLs)
The default value is true
.
When this property is true
, the URL of each hyperlink is included in the output as determined by the implementation of the
renderHyperlinkURL(StartTag)
method.
Assuming the default implementation of renderHyperlinkURL(StartTag)
, when this property is true
, the following HTML:
<a href="http://jericho.htmlparser.net/">Jericho HTML Parser</a>
produces the following output:
Jericho HTML Parser <http://jericho.htmlparser.net/>
includeHyperlinkURLs
- specifies whether hyperlink URLs are included in the output.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getIncludeHyperlinkURLs()
public boolean getIncludeHyperlinkURLs()
See the setIncludeHyperlinkURLs(boolean)
method for a full description of this property.
true
if hyperlink URLs are included in the output, otherwise false
.public java.lang.String renderHyperlinkURL(StartTag startTag)
StartTag
.
A return value of null
indicates that the hyperlink URL should not be rendered at all.
The default implementation of this method returns null
if the href
attribute of the specified start tag
starts with "javascript:
", is a relative or invalid URI, or is missing completely.
In all other cases it returns the value of the href
attribute enclosed in angle brackets.
See the documentation of the setIncludeHyperlinkURLs(boolean)
method for an example of how a hyperlink is rendered by the default implementation.
This method can be overridden in a subclass to customise the rendering of hyperlink URLs.
Rendering of hyperlink URLs can be disabled completely without overriding this method by setting the
IncludeHyperlinkURLs
property to false
.
Renderer renderer=new Renderer(segment) {
public String renderHyperlinkURL(StartTag startTag) {
String href=startTag.getAttributeValue("href");
if (href==null || href.startsWith("javascript:")) return null;
try {
URI uri=new URI(href);
if (!uri.isAbsolute()) return null;
} catch (URISyntaxException ex) {
return null;
}
return href;
}
};
String renderedSegment=renderer.toString();
startTag
- the start tag of the hyperlink element, must not be null
.StartTag
, or null
if the hyperlink URL should not be rendered.public Renderer setIncludeAlternateText(boolean includeAlternateText)
alt
attribute is included in the output.
The default value is true
.
Note that this is not conistent with common email clients such as Mozilla Thunderbird which do not render alternate text at all,
even when a tag specifies alternate text.
When this property is true
, the alternate text is included in the output as determined by the implementation of the
renderAlternateText(StartTag)
method.
Assuming the default implementation of renderAlternateText(StartTag)
, when this property is true
, the following HTML:
<img src="smiley.png" alt="smiley face" />
produces the following output:
[smiley face]
includeAlternateText
- specifies whether the alternate text of a tag that has an alt
attribute is included in the output.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getIncludeAlternateText()
public boolean getIncludeAlternateText()
alt
attribute is included in the output.
See the setIncludeAlternateText(boolean)
method for a full description of this property.
true
if the alternate text of a tag that has an alt
attribute is included in the output, otherwise false
.public java.lang.String renderAlternateText(StartTag startTag)
A return value of null
indicates that the alternate text is not to be rendered at all.
The default implementation of this method returns null
if the alt
attribute of the specified start tag is missing or empty, or if the
specified start tag is from an AREA
element.
In all other cases it returns the value of the alt
attribute enclosed in square brackets […]
.
See the documentation of the setIncludeAlternateText(boolean)
method for an example of how alternate text is rendered by the default implementation.
This method can be overridden in a subclass to customise the rendering of alternate text.
Rendering of alternate text can be disabled completely without overriding this method by setting the
IncludeAlternateText
property to false
.
Renderer renderer=new Renderer(segment) {
public String renderAlternateText(StartTag startTag) {
if (startTag.getName()==HTMLElementName.AREA) return null;
String alt=startTag.getAttributeValue("alt");
if (alt==null || alt.length()==0) return null;
return '«'+alt+'»';
}
};
String renderedSegment=renderer.toString();
startTag
- the start tag containing an alt
attribute, must not be null
.null
if the alternate text should not be rendered.public Renderer setDecorateFontStyles(boolean decorateFontStyles)
The default value is false
.
Below is a table summarising the decorated elements.
Elements | Character | Example Output |
---|---|---|
B and STRONG | * | *bold text* |
I and EM | / | /italic text/ |
U | _ | _underlined text_ |
CODE | | | |code| |
decorateFontStyles
- specifies whether decoration characters are to be included around the content of some font style elements.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getDecorateFontStyles()
public boolean getDecorateFontStyles()
See the setDecorateFontStyles(boolean)
method for a full description of this property.
true
if decoration characters are to be included around the content of some font style elements, otherwise false
.public Renderer setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
) character entity references are converted to spaces.
The default value is that of the static Config.ConvertNonBreakingSpaces
property at the time the Renderer
is instantiated.
convertNonBreakingSpaces
- specifies whether non-breaking space (
) character entity references are converted to spaces.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getConvertNonBreakingSpaces()
public boolean getConvertNonBreakingSpaces()
) character entity references are converted to spaces.
See the setConvertNonBreakingSpaces(boolean)
method for a full description of this property.
true
if non-breaking space (
) character entity references are converted to spaces, otherwise false
.public Renderer setBlockIndentSize(int blockIndentSize)
LI
elements.
At present this applies to BLOCKQUOTE
and DD
elements.
The default value is 4
.
blockIndentSize
- the size of the indent.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getBlockIndentSize()
public int getBlockIndentSize()
LI
elements.
See the setBlockIndentSize(int)
method for a full description of this property.
LI
elements.public Renderer setListIndentSize(int listIndentSize)
LI
elements.
The default value is 6
.
This applies to LI
elements inside both UL
and OL
elements.
The bullet or number of the list item is included as part of the indent.
listIndentSize
- the size of the indent.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getListIndentSize()
public int getListIndentSize()
LI
elements.
See the setListIndentSize(int)
method for a full description of this property.
LI
elements.public Renderer setListBullets(char[] listBullets)
UL
elements.
The values in the default array are *
, o
, +
and #
.
If the nesting of rendered lists goes deeper than the length of this array, the bullet characters start repeating from the first in the array.
WARNING: If any of the characters in the default array are modified, this will affect all other instances of this class using the default array.
listBullets
- an array of characters to be used as bullets, must have at least one entry.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getListBullets()
public char[] getListBullets()
UL
elements.
See the setListBullets(char[])
method for a full description of this property.
UL
elements.public Renderer setIncludeFirstElementTopMargin(boolean includeFirstElementTopMargin)
The default value is false
.
If this property is set to true
, then the source "<h1>Heading</h1>
" would be rendered as "\r\n\r\nHeading
",
assuming all other default settings.
If this property is false
, then the same source would be rendered as "Heading
".
Note that the bottom margin of the last element is never rendered.
includeFirstElementTopMargin
- specifies whether the top margin of the first element is rendered.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getIncludeFirstElementTopMargin()
public boolean getIncludeFirstElementTopMargin()
See the setIncludeFirstElementTopMargin(boolean)
method for a full description of this property.
true
if the top margin of the first element is rendered, otherwise false
.public Renderer setTableCellSeparator(java.lang.String tableCellSeparator)
The default value is " \t"
(a space followed by a tab).
tableCellSeparator
- the string that is to separate table cells.Renderer
instance, allowing multiple property setting methods to be chained in a single statement.getTableCellSeparator()
public java.lang.String getTableCellSeparator()
See the setTableCellSeparator(String)
method for a full description of this property.
public static void setDefaultTopMargin(java.lang.String htmlElementName, int topMargin)
The top margin is the number of blank lines that are to be inserted above the rendered block.
As this is a static method, the setting affects all instances of the Renderer
class.
The htmlElementName
argument must be one of the following:
ADDRESS
,
BLOCKQUOTE
,
CAPTION
,
CENTER
,
DD
,
DIR
,
DIV
,
DT
,
FIELDSET
,
FORM
,
H1
,
H2
,
H3
,
H4
,
H5
,
H6
,
HR
,
LEGEND
,
LI
,
MENU
,
OL
,
P
,
PRE
,
TR
,
UL
htmlElementName
- (required) the case insensitive name of a supported HTML block element.topMargin
- the new top margin of the specified element.java.lang.UnsupportedOperationException
- if an unsupported element name is specified.public static int getDefaultTopMargin(java.lang.String htmlElementName)
See the setDefaultTopMargin(String htmlElementName, int topMargin)
method for a full description of this property.
htmlElementName
- (required) the case insensitive name of a supported HTML block element.java.lang.UnsupportedOperationException
- if an unsupported element name is specified.public static void setDefaultBottomMargin(java.lang.String htmlElementName, int bottomMargin)
The bottom margin is the number of blank lines that are to be inserted below the rendered block.
As this is a static method, the setting affects all instances of the Renderer
class.
The htmlElementName
argument must be one of the following:
ADDRESS
,
BLOCKQUOTE
,
CAPTION
,
CENTER
,
DD
,
DIR
,
DIV
,
DT
,
FIELDSET
,
FORM
,
H1
,
H2
,
H3
,
H4
,
H5
,
H6
,
HR
,
LEGEND
,
LI
,
MENU
,
OL
,
P
,
PRE
,
TR
,
UL
htmlElementName
- (required) the case insensitive name of a supported HTML block element.bottomMargin
- the new bottom margin of the specified element.java.lang.UnsupportedOperationException
- if an unsupported element name is specified.public static int getDefaultBottomMargin(java.lang.String htmlElementName)
See the setDefaultBottomMargin(String htmlElementName, int bottomMargin)
method for a full description of this property.
htmlElementName
- (required) the case insensitive name of a supported HTML block element.java.lang.UnsupportedOperationException
- if an unsupported element name is specified.public static void setDefaultIndent(java.lang.String htmlElementName, boolean indent)
As this is a static method, the setting affects all instances of the Renderer
class.
The htmlElementName
argument must be one of the following:
ADDRESS
,
BLOCKQUOTE
,
CAPTION
,
CENTER
,
DD
,
DIR
,
DIV
,
DT
,
FIELDSET
,
FORM
,
H1
,
H2
,
H3
,
H4
,
H5
,
H6
,
HR
,
LEGEND
,
MENU
,
OL
,
P
,
PRE
,
TR
,
UL
htmlElementName
- (required) the case insensitive name of a supported HTML block element.indent
- whether the the specified element is indented.java.lang.UnsupportedOperationException
- if an unsupported element name is specified.public static boolean isDefaultIndent(java.lang.String htmlElementName)
See the setDefaultIndent(String htmlElementName, boolean indent)
method for a full description of this property.
htmlElementName
- (required) the case insensitive name of a supported HTML block element.java.lang.UnsupportedOperationException
- if an unsupported element name is specified.