public abstract class TagType
extends java.lang.Object
This class is the root abstract class common to all tag types, and contains methods to register and deregister tag types as well as various methods to aid in their implementation.
Every tag type is represented by a singleton instance of a class that must be a subclass of either
StartTagType
or EndTagType
. These two abstract classes, the only direct descendants of this class,
represent the two major classifications under which every tag type exists.
Because all TagType
instaces must be singletons, the '==
' operator can be used to test for a particular tag type
instead of the equals(Object)
method.
The term predefined tag type refers to any of the tag types defined in this library, including both standard and extended tag types.
The term standard tag type refers to any of the tag types represented by instances
in static fields of the StartTagType
and EndTagType
subclasses.
Standard tag types are registered by default, and define the tags most commonly found in HTML documents.
The term extended tag type refers to any predefined tag type
that is not a standard tag type.
The PHPTagTypes
and MasonTagTypes
classes contain extended tag types related to their respective server platforms.
The tag types defined within them must be registered by the user before they are recognised by the parser.
The term custom tag type refers to any user-defined tag type, or any tag type that is not a predefined tag type.
The tag recognition process of the parser gives each tag type a precedence level, which is primarily determined by the length of its start delimiter. A tag type with a more specific start delimiter is chosen in preference to one with a less specific start delimiter, assuming they both share the same prefix. If two tag types have exactly the same start delimiter, the one which was registered later has the higher precedence.
The two special tag types StartTagType.UNREGISTERED
and EndTagType.UNREGISTERED
represent
tags that do not match the syntax of any other tag type. They have the lowest precedence
of all the tag types. The Tag.isUnregistered()
method provides a detailed explanation of unregistered tags.
See the documentation of the tag parsing process for more information on how each tag is identified by the parser.
Note that the standard HTML element names do not represent different
tag types. All standard HTML tags have a tag type of StartTagType.NORMAL
or EndTagType.NORMAL
,
and are also referred to as normal tags.
Apart from the registration related methods, all of the methods in this class and its subclasses relate to the implementation of custom tag types and are not relevant to the majority of users who just use the predefined tag types.
For perfomance reasons, this library only allows tag types that start
with a '<
' character.
The character following this defines the immediate subclass of the tag type.
An EndTagType
always has a slash ('/
') as the second character, while a StartTagType
has any character other than a slash as the second character.
This definition means that tag types which are not intuitively classified as either start tag types or end tag types
(such as an HTML comment) are mostly classified as start tag types.
Every method in this and the StartTagType
and EndTagType
abstract classes can be categorised
as one of the following:
Modifier and Type | Method and Description |
---|---|
protected abstract Tag |
constructTagAt(Source source,
int pos)
Constructs a tag of this type at the specified position in the specified source document if it matches all of the required features.
|
void |
deregister()
Deregisters this tag type.
|
java.lang.String |
getClosingDelimiter()
Returns the character sequence that marks the end of the tag.
|
java.lang.String |
getDescription()
Returns a description of this tag type useful for debugging purposes.
|
protected java.lang.String |
getNamePrefix()
Returns the name prefix required by this tag type.
|
static java.util.List<TagType> |
getRegisteredTagTypes()
Returns a list of all the currently registered tag types in order of lowest to highest precedence.
|
java.lang.String |
getStartDelimiter()
Returns the character sequence that marks the start of the tag.
|
static TagType[] |
getTagTypesIgnoringEnclosedMarkup()
Returns an array of all the tag types inside which the parser ignores all non-server tags
in parse on demand mode.
|
boolean |
isServerTag()
Indicates whether this tag type represents a server tag.
|
protected boolean |
isValidPosition(Source source,
int pos,
int[] fullSequentialParseData)
Indicates whether a tag of this type is valid in the specified position of the specified source document.
|
void |
register()
Registers this tag type for recognition by the parser.
|
static void |
setTagTypesIgnoringEnclosedMarkup(TagType[] tagTypes)
Sets the tag types inside which the parser ignores all non-server tags.
|
protected boolean |
tagEncloses(Source source,
int pos)
Indicates whether a tag of this type encloses the specified position of the specified source document.
|
java.lang.String |
toString()
Returns a string representation of this object useful for debugging purposes.
|
public final void register()
The order of registration affects the precedence of the tag type when a potential tag is being parsed.
deregister()
public final void deregister()
register()
public static final java.util.List<TagType> getRegisteredTagTypes()
public final java.lang.String getDescription()
public final java.lang.String getStartDelimiter()
The character sequence must be all in lower case.
The first character in this property must be '<
'.
This is a deliberate limitation of the system which is necessary to retain reasonable performance.
The second character in this property must be '/
' if the implementing class is an EndTagType
.
It must not be '/
' if the implementing class is a StartTagType
.
Tag Type | Start Delimiter |
---|---|
StartTagType.UNREGISTERED | <
|
StartTagType.NORMAL | <
|
StartTagType.COMMENT | <!--
|
StartTagType.XML_DECLARATION | <?xml
|
StartTagType.XML_PROCESSING_INSTRUCTION | <?
|
StartTagType.DOCTYPE_DECLARATION | <!doctype
|
StartTagType.MARKUP_DECLARATION | <!
|
StartTagType.CDATA_SECTION | <![cdata[
|
StartTagType.SERVER_COMMON | <%
|
EndTagType.UNREGISTERED | </
|
EndTagType.NORMAL | </
|
public final java.lang.String getClosingDelimiter()
The character sequence must be all in lower case.
In a StartTag
of a type that has attributes,
characters appearing inside a quoted attribute value are ignored when determining the location of the closing delimiter.
Note that the optional '/
' character preceding the closing '>
' in an
empty-element tag is not considered part of the end delimiter.
This property must define the closing delimiter common to all instances of the tag type.
public final boolean isServerTag()
Server tags are typically parsed by some process on the web server and substituted with other text or markup before delivery to the user agent. This parser therefore handles them differently to non-server tags in that they can occur at any position in the document without regard for the HTML document structure. As a result they can occur anywhere inside any other tag, although a non-server tag cannot theoretically occur inside a server tag.
The documentation of the tag parsing process explains in detail how the value of this property affects the recognition of server tags, as well as how the presence of server tags affects the recognition of non-server tags in and around them.
Most XML-style server tags can not be represented as a distinct tag type because they are generally indistinguishable from non-server XML tags.
See the Segment.ignoreWhenParsing()
method for information about how to prevent such server tags from interfering with the proper parsing
of the rest of the document.
Tag Type | Is Server Tag |
---|---|
StartTagType.UNREGISTERED | false
|
StartTagType.NORMAL | false
|
StartTagType.COMMENT | false
|
StartTagType.XML_DECLARATION | false
|
StartTagType.XML_PROCESSING_INSTRUCTION | false
|
StartTagType.DOCTYPE_DECLARATION | false
|
StartTagType.MARKUP_DECLARATION | false
|
StartTagType.CDATA_SECTION | false
|
StartTagType.SERVER_COMMON | true
|
EndTagType.UNREGISTERED | false
|
EndTagType.NORMAL | false
|
true
if this tag type represents a server tag, otherwise false
.protected final java.lang.String getNamePrefix()
This string is identical to the start delimiter, except that it does not include the
initial "<
" or "</
" characters that always prefix the start delimiter of a
StartTagType
or EndTagType
respectively.
The name of a tag of this type may or may not include extra characters after the prefix.
This is determined by properties such as StartTagType.isNameAfterPrefixRequired()
or EndTagTypeGenericImplementation.isStatic()
.
Tag Type | Name Prefix |
---|---|
StartTagType.UNREGISTERED | (empty string) |
StartTagType.NORMAL | (empty string) |
StartTagType.COMMENT | !--
|
StartTagType.XML_DECLARATION | ?xml
|
StartTagType.XML_PROCESSING_INSTRUCTION | ?
|
StartTagType.DOCTYPE_DECLARATION | !doctype
|
StartTagType.MARKUP_DECLARATION | !
|
StartTagType.CDATA_SECTION | ![cdata[
|
StartTagType.SERVER_COMMON | %
|
EndTagType.UNREGISTERED | (empty string) |
EndTagType.NORMAL | (empty string) |
getStartDelimiter()
protected boolean isValidPosition(Source source, int pos, int[] fullSequentialParseData)
This method is called immediately before constructTagAt(Source, int pos)
to do a preliminary check on the validity of a tag of this type in the specified position.
This check is not performed as part of the constructTagAt(Source, int pos)
call because the same
validation is used for all the standard tag types, and is likely to be sufficient
for all custom tag types.
Having this check separated into a different method helps to isolate common code from the code that is unique to each tag type.
A server tag is valid in any position except inside a server-side comment,
but a non-server tag is not valid inside any other tag, nor inside elements with implicit CDATA content such as
SCRIPT
and STYLE
elements.
The common implementation of this method behaves differently depending upon whether or not a full sequential parse is being peformed.
For server tags it simply checks that the position is not enclosed by a server-side comment if a full sequential parse
is not being performed. If a full sequential parse is being performed, it always returns true
for server tags as the parser automatically skips over
all positions enclosed by server-side comments, so this method is only called in positions where a server tag is always valid.
When this method is called for non-server tags during a full sequential parse, the fullSequentialParseData
argument contains information
allowing the exact theoretical check to be performed, rejecting a non-server tag if it is inside any other tag.
See below for further information about the fullSequentialParseData
parameter.
When this method is called in parse on demand mode
(not during a full sequential parse, fullSequentialParseData==null
),
practical constraints prevent the exact theoretical check from being carried out, and non-server tags are only rejected
if they are found inside HTML comments or CDATA sections.
This behaviour is configurable by manipulating the static TagTypesIgnoringEnclosedMarkup
array
to determine which tag types can not contain non-server tags in parse on demand mode.
The documentation of this property contains
a more detailed analysis of the subject, detailing some potential problems with this approach and explaining why only the comment and
CDATA section tag types are included by default.
See the documentation of the tag parsing process for more information about how this method fits into the whole tag parsing process.
This method can be overridden in custom tag types if the default implementation is unsuitable.
The fullSequentialParseData
parameter:
This parameter is used to discard non-server tags that are found inside other tags or inside SCRIPT
elements.
In the current version of this library, the fullSequentialParseData
argument is either null
(in parse on demand mode) or an integer array containing only a single entry
(if a full sequential parse is being peformed).
The integer contained in the array is the maximum position in the document at which the end of a tag has been found, indicating that no non-server tags should be recognised before that position. If no tags have yet been encountered, the value of this integer is zero.
If the last tag encountered was the start tag of a SCRIPT
element,
the value of this integer is Integer.MAX_VALUE
, indicating that no other non-server elements should be recognised until the
end tag of the SCRIPT
element is found.
The HTML 4 DTD defines script element content as a special type of CDATA. The XHTML DTD changed it to PCDATA, meaning that HTML elements should be parsed inside script elements if they are not escaped by comments or an explicit CDATA section. The HTML 5 parsing rules reversed this again, making it closer to the original HTML 4 rules. Because this parser is designed to facilitate parsing HTML rather than XHTML, it treats script element content as implicit CDATA, consistent with HTML 4 and HTML 5.
According to the HTML 4.01 specification section 6.2,
the first occurrence of the character sequence "</
" terminates the special handling of CDATA within
SCRIPT
and STYLE
elements.
This library however only terminates the CDATA handling of SCRIPT
element content
when the character sequence "</script
" is detected, in line with the behaviour of the major browsers and with
HTML 5 script element parsing rules.
Note that the implicit treatment of SCRIPT
element content as CDATA also prevents the recognition of
comments and explicit CDATA sections inside script elements.
All major browsers used to recognise comments inside script elements regardless, which is relevant if the script element contains a javascript string literal
"<script
", which would terminate the script element unless it was enclosed in a comment.
Versions 3.0 to 3.2 of this parser therefore also recognised comments inside script elements in a full sequential parse to maintain compatibility with the
major browsers, but the latest versions of gecko and webkit browsers now correctly ignore comments inside script elements, so as of version 3.3 this parser
has also reverted to the correct behaviour.
Although STYLE
elements should theoretically be treated in the same way as SCRIPT
elements,
the syntax of Cascading Style Sheets (CSS) does not contain any constructs that
could be misinterpreted as HTML tags, so there is virtually no need to perform any special checks in this case.
IMPLEMENTATION NOTE: The rationale behind using an integer array to hold this value, rather than a scalar int
value,
is to emulate passing the parameter by reference.
This value needs to be shared amongst several internal methods during the full sequential parse process,
and any one of those methods needs to be able to modify the value and pass it back to the calling method.
This would normally be implemented by passing the parameter by reference, but because Java does not support this language construct, a container for a
mutable integer must be passed instead.
Because the standard Java library does not provide a class for holding a single mutable integer (the java.lang.Integer
class is immutable),
the easiest container to use, without creating a class especially for this purpose, is an integer array.
The use of an array does not imply any intention to use more than a single array entry in subsequent versions.
source
- the Source
document.pos
- the character position in the source document to check.fullSequentialParseData
- an integer array containing data allowing this method to implement a better algorithm when a full sequential parse is being performed, or null
in parse on demand mode.true
if a tag of this type is valid in the specified position of the specified source document, otherwise false
.public static final TagType[] getTagTypesIgnoringEnclosedMarkup()
The tag types returned by this property (referred to in the following paragraphs as the "listed types") default to
StartTagType.COMMENT
and StartTagType.CDATA_SECTION
.
This property is used by the default implementation of the isValidPosition
method
in parse on demand mode.
It is not used at all during a full sequential parse.
In the default implementation of the isValidPosition
method,
in parse on demand mode,
every new non-server tag found by the parser (referred to as a "new tag") undergoes a check to see whether it is enclosed
by a tag of one of the listed types.
This includes new tags of the listed types themselves if they are non-server tags.
The recursive nature of this check means that all tags of the listed types occurring before the new tag must be found
by the parser before it can determine whether the new tag should be ignored.
To mitigate any performance issues arising from this process, the listed types are given special treatment in the tag cache.
This dramatically decreases the time taken to search on these tag types, so adding a tag type to this array that
is easily recognised and occurs infrequently only results in a small degradation in overall performance.
A special exception to the algorithm described above applies to COMMENT
tags.
The default implementation of the isValidPosition
method
does not check whether a COMMENT
tag is inside another COMMENT
tag.
The only syntactically valid way that could occur is a construct like <!-- ... <!-->
, as the characters '--
' should not occur inside a comment.
This construct does actually occur in practice in a
Microsoft downlevel-revealed conditional comment tag (<!--[if ... ]><!-->
),
which may cause problems if documents containing these tags are searched using parse on demand mode.
The only way to resolve this issue in all instances would be for the default implementation of the isValidPosition
method
to recursively check every COMMENT
tag back to the start of the document, which results in far worse problems such as potential stack overflows
in large documents containing lots of comments and much worse performance in general.
If the presence of such nested comments is an issue in your case, the best solution is to perform a full sequential parse.
Theoretically, non-server tags appearing inside any other tag should be ignored, which is how the parser behaves during a full sequential parse.
Server tags in particular very often contain other "tags" that should not be recognised as tags by the parser.
If this behaviour is required in parse on demand, the tag type of each server tag that might be found
in the source documents can be added to this property using the static setTagTypesIgnoringEnclosedMarkup(TagType[])
method.
For example, the following command would prevent non-server tags from being recognised inside standard PHP tags,
as well as the default comment and CDATA section tags:
TagType.setTagTypesIgnoringEnclosedMarkup(new TagType[] {PHPTagTypes.PHP_STANDARD, StartTagType.COMMENT, StartTagType.CDATA_SECTION});
The only situation where a non-server tag can legitimately contain a sequence of characters that resembles a tag is within an attribute value.
The HTML 4.01 specification section 5.3.2
specifically allows the presence of '<
' and '>
' characters within attribute values.
A common occurrence of this is in event attributes containing scripts,
such as the onclick
attribute.
There is no way of preventing such "tags" from being recognised in parse on demand mode, as adding
StartTagType.NORMAL
to this property as a listed type would be far too inefficient.
Performing a full sequential parse of the source document prevents these attribute values from being
recognised as tags, but can be very expensive if only a few tags in the document need to be parsed.
The penalty of not parsing every tag in the document is that the exactness of this check is compromised, but in practical terms the difference is inconsequential.
The default listed types of comments and CDATA sections yields sensible results
in the vast majority of practical applications with only a minor impact on performance.
In XHTML, '<
' and '>
' characters
must be represented in attribute values as character references
(see the XML 1.0 specification section 3.1),
so the situation should never arise that a tag is found inside another tag unless one of them is a
server tag.
public static final void setTagTypesIgnoringEnclosedMarkup(TagType[] tagTypes)
See getTagTypesIgnoringEnclosedMarkup()
for the documentation of this property.
tagTypes
- an array of tag types.protected abstract Tag constructTagAt(Source source, int pos)
The implementation of this method must check that the text at the specified position meets all of the criteria of this tag type, including such checks as the presence of the correct or well formed closing delimiter, name, attributes, end tag, or any other distinguishing features.
It can be assumed that the specified position starts with the start delimiter of this tag type,
and that all other tag types with higher precedence (if any) have already been rejected as candidates.
Tag types with lower precedence will be considered if this method returns null
.
This method is only called after a successful check of the tag's position, i.e.
isValidPosition(source,pos,fullSequentialParseData)
==true
.
The StartTagTypeGenericImplementation
and EndTagTypeGenericImplementation
subclasses provide default
implementations of this method that allow the use of much simpler properties and
implementation assistance methods and to carry out the required functions.
source
- the Source
document.pos
- the position in the source document.null
if it does not meet the criteria.protected final boolean tagEncloses(Source source, int pos)
This is logically equivalent to source.
getEnclosingTag(pos,this)
!=null
,
but is safe to use within other implementation methods without the risk of causing an infinite recursion.
This method is called from the default implementation of the isValidPosition(Source, int pos, int[] fullSequentialParseData)
method.
source
- the Source
document.pos
- the character position in the source document to check.true
if a tag of this type encloses the specified position of the specified source document, otherwise false
.public java.lang.String toString()
toString
in class java.lang.Object