public abstract class CharacterReference extends Segment
CharacterEntityReference
and NumericCharacterReference
.
This class, together with its subclasses, contains static methods to perform most required operations without having to instantiate an object.
Instances of this class are useful when the positions of character references in a source document are required, or to replace the found character references with customised text.
CharacterReference
instances are obtained using one of the following methods:
Modifier and Type | Field and Description |
---|---|
static int |
INVALID_CODE_POINT
Represents an invalid unicode code point.
|
Modifier and Type | Method and Description |
---|---|
void |
appendCharTo(java.lang.Appendable appendable)
Appends the character represented by this character reference to the specified appendable object.
|
static java.lang.String |
decode(java.lang.CharSequence encodedText)
Decodes the specified HTML encoded text into normal text.
|
static java.lang.String |
decode(java.lang.CharSequence encodedText,
boolean insideAttributeValue)
Decodes the specified HTML encoded text into normal text.
|
static java.lang.String |
decodeCollapseWhiteSpace(java.lang.CharSequence text)
Decodes the specified text after collapsing its white space.
|
static java.lang.String |
encode(char ch)
Encodes the specified character into a character reference if required.
|
static java.lang.String |
encode(java.lang.CharSequence unencodedText)
Encodes the specified text, escaping certain characters into character references.
|
static java.lang.String |
encode(java.lang.CharSequence unencodedText,
boolean insideAttributeValue)
Encodes the specified text, escaping certain characters into character references.
|
static java.lang.String |
encodeWithWhiteSpaceFormatting(java.lang.CharSequence unencodedText)
Encodes the specified text, preserving line breaks, tabs and spaces for rendering by converting them to markup.
|
char |
getChar()
Returns the character represented by this character reference.
|
abstract java.lang.String |
getCharacterReferenceString()
Returns the encoded form of this character reference.
|
static java.lang.String |
getCharacterReferenceString(int codePoint)
Returns the encoded form of the specified unicode code point.
|
int |
getCodePoint()
Returns the unicode code point represented by this character reference.
|
static int |
getCodePointFromCharacterReferenceString(java.lang.CharSequence characterReferenceText)
Parses a single encoded character reference text into a unicode code point.
|
java.lang.String |
getDecimalCharacterReferenceString()
Returns the decimal encoded form of this character reference.
|
static java.lang.String |
getDecimalCharacterReferenceString(int codePoint)
Returns the decimal encoded form of the specified unicode code point.
|
static java.io.Writer |
getEncodingFilterWriter(java.io.Writer writer)
|
java.lang.String |
getHexadecimalCharacterReferenceString()
Returns the hexadecimal encoded form of this character reference.
|
static java.lang.String |
getHexadecimalCharacterReferenceString(int codePoint)
Returns the hexadecimal encoded form of the specified unicode code point.
|
java.lang.String |
getUnicodeText()
Returns the unicode code point of this character reference in U+ notation.
|
static java.lang.String |
getUnicodeText(int codePoint)
Returns the specified unicode code point in U+ notation.
|
boolean |
isTerminated()
Indicates whether this character reference is terminated by a semicolon (
; ). |
static CharacterReference |
parse(java.lang.CharSequence characterReferenceText)
Parses a single encoded character reference text into a
CharacterReference object. |
static java.lang.String |
reencode(java.lang.CharSequence encodedText)
|
static boolean |
requiresEncoding(char ch)
Deprecated.
use
Config.CurrentCharacterReferenceEncodingBehaviour instead. |
charAt, compareTo, encloses, encloses, equals, getAllCharacterReferences, getAllElements, getAllElements, getAllElements, getAllElements, getAllElements, getAllElementsByClass, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTagsByClass, getAllTags, getAllTags, getBegin, getChildElements, getDebugInfo, getEnd, getFirstElement, getFirstElement, getFirstElement, getFirstElement, getFirstElementByClass, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTagByClass, getFormControls, getFormFields, getMaxDepthIndicator, getNodeIterator, getRenderer, getRowColumnVector, getSource, getStyleURISegments, getTextExtractor, getURIAttributes, hashCode, ignoreWhenParsing, isWhiteSpace, isWhiteSpace, length, parseAttributes, subSequence, toString
public static final int INVALID_CODE_POINT
This can be the result of parsing a numeric character reference outside of the valid unicode range of 0x000000-0x10FFFF, or any other invalid character reference.
public int getCodePoint()
appendCharTo(Appendable)
public char getChar()
If this character reference represents a unicode supplimentary code point, any bits outside of the least significant 16 bits of the code point are truncated, yielding an incorrect result.
To ensure that the character is correctly appended to an Appendable
object such as a Writer
, use the code:
characterReference.
appendCharTo
(appendable)
instead of:
appendable.append(characterReference.getChar())
appendCharTo(Appendable)
,
getCodePoint()
public final void appendCharTo(java.lang.Appendable appendable) throws java.io.IOException
If this character is a unicode supplementary character,
then both the UTF-16 high/low surrogate char
values of the of the character are appended, as described in the
Unicode character representations section of the
java.lang.Character
class.
If the static Config.ConvertNonBreakingSpaces
property is set to true
(the default),
then calling this method on a non-breaking space character reference (
)
results in a normal space being appended.
appendable
- the object to append this character reference to.java.io.IOException
public boolean isTerminated()
;
).
Conversely, this library defines an unterminated character reference as one which does not end with a semicolon.
The SGML specification allows unterminated character references in some circumstances, and because the HTML 4.01 specification states simply that "authors may use SGML character references", it follows that they are also valid in HTML documents, although their use is strongly discouraged.
Unterminated character references are not allowed in XHTML documents.
true
if this character reference is terminated by a semicolon, otherwise false
.decode(CharSequence encodedText, boolean insideAttributeValue)
public static java.lang.String encode(java.lang.CharSequence unencodedText)
This is equivalent to encode(unencodedText,true)
.
unencodedText
- the text to encode.public static java.lang.String encode(java.lang.CharSequence unencodedText, boolean insideAttributeValue)
The Config.CurrentCharacterReferenceEncodingBehaviour
setting determines which characters are encoded.
For characters that are to be encoded, the CharacterEntityReference
is used if possible, otherwise a NumericCharacterReference
is used.
The only exception to this is an apostrophe (U+0027),
which is encoded as the numeric character reference "'
" rather than its character entity reference '
as this entity is not defined for use in HTML. See the comments in the CharacterEntityReference
class for more information.
Specifying a value of true
as an argument to the insideAttributeValue
parameter ensures that
double quote characters ("
) are encoded. The default behaviour is that they are not encoded if a value of false
is specified.
To encode text using only numeric character references, use the
NumericCharacterReference.encode(CharSequence)
method instead.
unencodedText
- the text to encode.insideAttributeValue
- specifies whether the output must be valid inside a quoted attribute value.decode(CharSequence)
public static java.lang.String encode(char ch)
The encoding of the character follows the same rules as for each character in the encode(CharSequence unencodedText, boolean insideAttributeValue)
method,
with insideAttributeValue
set to true
.
ch
- the character to encode.public static java.lang.String encodeWithWhiteSpaceFormatting(java.lang.CharSequence unencodedText)
This performs the same encoding as encode(CharSequence,false)
, but also performs the following conversions:
<br />
". CR/LF pairs are treated as a single line break.
"
while ensuring the last is always a normal space.
The conversion of multiple consecutive spaces to alternating space/non-breaking-space allows the correct number of spaces to be rendered, but also allows the line to wrap in the middle of it.
Note that zero-width spaces (U+200B) are converted to the numeric character reference
"​
" through the normal encoding process, but IE6 does not render them properly
either encoded or unencoded.
There is no method provided to reverse this encoding.
unencodedText
- the text to encode.encode(CharSequence)
public static java.lang.String decode(java.lang.CharSequence encodedText)
All character entity references and numeric character references are converted to their respective characters.
This is equivalent to decode(encodedText,false)
.
Unterminated character references are dealt with according to the rules for text outside of attribute values in the current compatibility mode.
If the static Config.ConvertNonBreakingSpaces
property is set to true
(the default),
then all non-breaking space (
) character entity references are converted to normal spaces.
Although character entity reference names are case sensitive, and in some cases differ from other entity references only by their case, some browsers also recognise them in a case-insensitive way. For this reason, all decoding methods in this library recognise character entity reference names even if they are in the wrong case.
encodedText
- the text to decode.encode(CharSequence)
public static java.lang.String decode(java.lang.CharSequence encodedText, boolean insideAttributeValue)
All character entity references and numeric character references are converted to their respective characters.
Unterminated character references are dealt with according to the
value of the insideAttributeValue
parameter and the
current compatibility mode.
If the static Config.ConvertNonBreakingSpaces
property is set to true
(the default),
then all non-breaking space (
) character entity references are converted to normal spaces.
Although character entity reference names are case sensitive, and in some cases differ from other entity references only by their case, some browsers also recognise them in a case-insensitive way. For this reason, all decoding methods in this library recognise character entity reference names even if they are in the wrong case.
encodedText
- the text to decode.insideAttributeValue
- specifies whether the encoded text is inside an attribute value.decode(CharSequence)
,
encode(CharSequence)
public static java.lang.String decodeCollapseWhiteSpace(java.lang.CharSequence text)
All leading and trailing white space is omitted, and any sections of internal white space are replaced by a single space.
The result is how the text would normally be rendered by a user agent, assuming it does not contain any tags.
If the static Config.ConvertNonBreakingSpaces
property is set to true
(the default),
then all non-breaking space (
) character entity references are converted to normal spaces.
For consistency with the rendered output of most user agents these converted spaces are not treated as white space,
so they are not collapsed and not trimmed.
Unterminated character references are dealt with according to the rules for
text outside of attribute values in the current compatibility mode.
See the discussion of the insideAttributeValue
parameter of the decode(CharSequence, boolean insideAttributeValue)
method for a more detailed explanation of this topic.
text
- the source textFormControl.getPredefinedValues()
public static java.lang.String reencode(java.lang.CharSequence encodedText)
This process ensures that the specified encoded text does not contain any remaining unencoded characters.
IMPLEMENTATION NOTE: At present this method simply calls the decode
method followed by the
encode
method, both with insideAttributeValue
set to true
.
encodedText
- the text to re-encode.public abstract java.lang.String getCharacterReferenceString()
The exact behaviour of this method depends on the class of this object.
See the CharacterEntityReference.getCharacterReferenceString()
and
NumericCharacterReference.getCharacterReferenceString()
methods for more details.
CharacterReference.parse(">").getCharacterReferenceString()
returns ">
"CharacterReference.parse(">").getCharacterReferenceString()
returns "e;
"getCharacterReferenceString(int codePoint)
,
getDecimalCharacterReferenceString()
public static java.lang.String getCharacterReferenceString(int codePoint)
This method returns the character entity reference encoded form of the unicode code point if one exists, otherwise it returns the decimal character reference encoded form.
The only exception to this is an apostrophe (U+0027),
which is encoded as the numeric character reference "'
" instead of its character entity reference
"'
".
CharacterReference.getCharacterReferenceString(62)
returns ">
"CharacterReference.getCharacterReferenceString('>')
returns ">
"CharacterReference.getCharacterReferenceString('☺')
returns "☺
"codePoint
- the unicode code point to encode.getHexadecimalCharacterReferenceString(int codePoint)
public java.lang.String getDecimalCharacterReferenceString()
This is equivalent to getDecimalCharacterReferenceString
(
getCodePoint()
)
.
CharacterReference.parse(">").getDecimalCharacterReferenceString()
returns ">
"getCharacterReferenceString()
,
getHexadecimalCharacterReferenceString()
public static java.lang.String getDecimalCharacterReferenceString(int codePoint)
CharacterReference.getDecimalCharacterReferenceString('>')
returns ">
"codePoint
- the unicode code point to encode.getCharacterReferenceString(int codePoint)
,
getHexadecimalCharacterReferenceString(int codePoint)
public java.lang.String getHexadecimalCharacterReferenceString()
This is equivalent to getHexadecimalCharacterReferenceString
(
getCodePoint()
)
.
CharacterReference.parse(">").getHexadecimalCharacterReferenceString()
returns ">
"getCharacterReferenceString()
,
getDecimalCharacterReferenceString()
public static java.lang.String getHexadecimalCharacterReferenceString(int codePoint)
CharacterReference.getHexadecimalCharacterReferenceString('>')
returns ">
"codePoint
- the unicode code point to encode.getCharacterReferenceString(int codePoint)
,
getDecimalCharacterReferenceString(int codePoint)
public java.lang.String getUnicodeText()
This is equivalent to getUnicodeText(getCodePoint())
.
CharacterReference.parse(">").getUnicodeText()
returns "U+003E
"getUnicodeText(int codePoint)
public static java.lang.String getUnicodeText(int codePoint)
CharacterReference.getUnicodeText('>')
returns "U+003E
"codePoint
- the unicode code point.public static CharacterReference parse(java.lang.CharSequence characterReferenceText)
CharacterReference
object.
The character reference must be at the start of the given text, but may contain other characters at the end.
The getEnd()
method can be used on the resulting object to determine at which character position the character reference ended.
If the text does not represent a valid character reference, this method returns null
.
Unterminated character references are always accepted, regardless of the settings in the current compatibility mode.
To decode all character references in a given text, use the decode(CharSequence)
method instead.
CharacterReference.parse(">").getChar()
returns '>
'characterReferenceText
- the text containing a single encoded character reference.CharacterReference
object representing the specified text, or null
if the text does not represent a valid character reference.decode(CharSequence)
public static int getCodePointFromCharacterReferenceString(java.lang.CharSequence characterReferenceText)
The character reference must be at the start of the given text, but may contain other characters at the end.
If the text does not represent a valid character reference, this method returns INVALID_CODE_POINT
.
This is equivalent to parse(characterReferenceText)
.
getCodePoint()
,
except that it returns INVALID_CODE_POINT
if an invalid character reference is specified instead of throwing a
NullPointerException
.
CharacterReference.getCodePointFromCharacterReferenceString(">")
returns 38
characterReferenceText
- the text containing a single encoded character reference.INVALID_CODE_POINT
if the text does not represent a valid character reference.@Deprecated public static final boolean requiresEncoding(char ch)
Config.CurrentCharacterReferenceEncodingBehaviour
instead.public static java.io.Writer getEncodingFilterWriter(java.io.Writer writer)
writer
- the destination for the encoded textWriter
that encodes all text before passing it through to the specified Writer
.encode(CharSequence unencodedText)