Jericho HTML Parser
Release Notes
3.3 (2012-10-31)
- Bug Fixes:
- [3581664] CharacterReference.decode() does not decode entities
containing digits - ½ ¼ ¾ ¹ ² ³
∴
- [3311286] SourceCompactor does not respect TEXTAREA
- [3519131] Renderer output incorrect when constructed with an
Element object.
- [3538829] Renderer output of font decoration on block boundaries
incorrect.
- Segment.getAllStartTags(name) and Segment.getFirstElement(name)
do not work if the argument contains upper case characters.
- The end delimiter of a common server tag inside an escaped server
tag is falsely recognised as the end delimiter of the escaped tag.
- CHANGES THAT COULD AFFECT THE BEHAVIOUR OF EXISTING PROGRAMS:
- [3427073] Segment.getStyleURISegments() now includes style element
content as well as style attribute values.
- [3427927] Segment.getURIAttributes() now includes the archive
attributes of object and applet elements.
- Comments no longer recognised inside script elements during full
sequential parse. Previously they were recognised for compatibility
with major browsers but modern browser behaviour has changed.
- Changed the log level of all parsing errors from INFO to ERROR, and
the log level of the Source.fullSequentialParse() advisory message
from WARN to INFO. The previous levels gave the advisory message a
higher severity than the parsing errors, preventing logging systems
from hiding the advisory message while showing parsing errors.
Character encoding warnings remain unchanged at WARN level.
- Changed the behaviour of the Renderer.renderHyperlinkURL(StartTag)
method so that relative URLs are not rendered.
- Changed the behaviour of the Renderer so that hyperlink element
content is not rendered if it is the same as the hyperlink URL,
ignoring any http:// prefix or / suffix.
- EndTag.tidy() now removes whitespace before the closing bracket.
- Added Source(File) constructor.
- Added OutputDocument.getSegment() method.
- Added OutputDocument.remove(int begin, int end) method.
- Added Renderer.setHRLineLength() method.
- Added RenderToText.jsp webapp sample.
- Added Segment.getRowColumnVector() method.
- Encoding detection now ignores common encodings specified in meta tags
that have a code unit size incompatible with the preliminary encoding.
- Upgraded to the following logger APIs:
slf4j-api-1.7.2, log4j-1.2.17
3.2 (2011-01-30)
- Bug Fixes:
- [2826979] IllegalCharsetNameException thrown when illegal encoding
specified in the document.
- [2837434] Potential multithreading bug in Source.getNewLine()
- [3036182] NullPointerException when run with stringent java.policy
- TextExtractor did not include any attribute values.
- All unterminated character references were decoded regardless of the
configuration settings (bug introduced in 3.1).
- Renderer class -
under
resulted in new line.
- SourceFormatter did not handle TEXTAREA elements correctly.
- No exceptions thrown if invalid charset is specified by server or in
source document.
- Byte order mark character was included in the source document.
- HTML5 elements added to HTMLElementName and HTMLElements classes.
- Detects HTML5 character encoding declaration.
- Uses Windows-1252 as the default 8-bit encoding when available instead
of the subset encoding ISO-8859-1.
- Added Renderer.setIncludeAlternateText(boolean) method.
- Added Renderer.renderAlternateText(StartTag) method.
- Added Renderer.setIncludeFirstElementTopMargin(boolean) method.
- Added Renderer.setDefaultTopMargin(String,int) static method.
- Added Renderer.setDefaultBottomMargin(String,int) static method.
- Added Renderer.setDefaultIndent(String,boolean) static method.
- Renderer now evaluates inline styles for top, bottom and left margins.
- Added Attribute.getStartTag() method.
- Added Segment.getURIAttributes() method.
- Added Segment.getStyleURISegments() method.
- Added deregister() methods to the extended tag type classes.
- Added MicrosoftConditionalCommentTagTypes class.
- Added StartTagType.SERVER_COMMON_COMMENT tag type.
- SourceFormatter now inlines DOCTYPE tags.
- Added Segment.getMaxDepthIndicator() method.
- Added static Config.IsHTMLEmptyElementTagRecognised parameter.
- Deprecated MicrosoftTagTypes class.
- Upgraded to the following logger APIs:
slf4j-api-1.6.1, log4j-1.2.16
3.1 (2009-06-11)
- Bug Fixes:
- [2793556] Infinite loop on Segment.getAllStartTags()
- Infinite loop on Segment.getAllElements()
- Segment.getFirst* methods returned segments outside the bounding
segment.
- Segment.getAllElements methods did not return all enclosed elements
in some circumstances.
- Fixed documentation errors in Segment.getAllElements methods.
- Added StreamedSource class.
- CHANGES THAT COULD AFFECT THE BEHAVIOUR OF EXISTING PROGRAMS:
- Changed ParseText from class to interface.
- Segment.getNodeIterator() now returns character references as
separate nodes.
- Added tag search methods based on attribute value regular expressions.
- Added tag search methods based on HTML class attribute.
- Added static Source.LegacyNodeIteratorCompatabilityMode property
temporarily to restore Segment.getNodeIterator() functionality to
that of previous versions.
- Removed char[] based search methods in ParseText.
- Added CharacterReference.appendCharTo(Appendable) method.
- Added OutputDocument(Segment) constructor.
- Added StreamedSourceCopy sample program.
3.0 (2009-04-09)
- Requires runtime Java 5 or later
- Bug Fixes:
- Character references representing unicode supplementary characters
were not decoded correctly to UTF-16 code unit pairs.
- [2188446] Element.getDepth() and Element.getParentElement()
returned incorrect results if called in parse on demand mode.
- Comments are now recognised inside correctly.
- RenderToText does not handle whitespace after
correctly.
- Resetting to invalid mark exception during encoding detection.
- INPUT elements of type "button" and "reset" incorrectly
interpreted as form controls of type FormControlType.TEXT.
- Valid end tags containing white space rejected.
- Elements inside tags
- [1576991] Bug in ConvertStyleSheets sample program
- [1597587] various NPEs in findFormFields()
- [1599700] Segment.findAllStartTags(attributeName...) infinite loop
- Overlapping elements resulted in some elements being listed as a
child of more than one parent element.
- OutputDocument.writeTo(Writer) closed the writer.
- Server tags no longer interfere with parsing of start tag attributes.
- Added Renderer class and Segment.getRenderer() method.
- Added TextExtractor class and Segment.getTextExtractor() method.
- Deprecated segment.extractText methods.
- Added SourceFormatter class and Source.getSourceFormatter() method.
- Deprecated Source.indent method.
- Added Logger interface along with the related LoggerProvider
interface and BasicLoggerProvider and WriterLogger classes.
- Added Source.setLogger(Logger) and Source.getLogger() methods.
- Deprecated Source.setLogWriter(Writer) and Source.getLogWriter()
methods.
- Added Source.findNextElement(int pos, String attributeName,
String value, boolean valueCaseSensitive) method.
- Added Segment.findAllElements(String attributeName, String value,
boolean valueCaseSensitive) method.
- Calling the ignoreWhenParsing methods on overlapping segments no
longer results in an OverlappingOutputSegmentsException.
- Added CharacterReference.getEncodingFilterWriter(Writer) method.
- Added CharacterReference.encode(char) method.
- Added Source.getNewLine() method.
- Added static Config.NewLine parameter.
- All text output now uses Config.NewLine instead of hard-coded '\n'.
- Source.fullSequentialParse() method no longer parses the source again
if it has already been called.
- Some methods that require the parsing of the entire source now call
Source.fullSequentialParse() automatically.
- Some changes to the output of various getDebugInfo() methods.
- Added categorised class list in javadoc.
- Removed all methods/constants deprecated in 2.0.
2.3 (2006-09-11)
- Bug Fixes:
- [1510438] NullPointerException in Source.indent.
- [1511480] Incorrect detection of non-html element with nested
empty-element tag of same name.
- [1547562] Fault in caching mechanism.
- Source.fullSequentialParse() sometimes resulted in unregistered
tags being returned in tag searches.
- Invalid Empty-element tags whose name is in either of the sets
HTMLElements.getEndTagOptionalElementNames() or
HTMLElements.getEndTagRequiredElementNames() were rejected by the
parser if the slash immediately follows the tag name.
- StartTag.tidy() only included a slash before the closing delimiter
of the tag if the tag name was in the set of
HTMLElements.getEndTagForbiddenElementNames(). It now includes the
slash for all tag names not in getEndTagOptionalElementNames().
- Source.fullSequentialParse() now clears the cache automatically
instead of throwing an IllegalStateException if the cache is not
empty.
- Changes to behaviour of Source.indent:
- preserves indenting in SCRIPT elements, server elements,
HTML comments and CDATA sections.
- keeps SCRIPT elements, HTML comments, XML declarations,
XML processing instructions and markup declarations inline.
- Minor documentation improvements.
2.2 (2006-06-20)
- Bug Fixes:
- Fault in caching mechanism resulted in missed tags in rare
circumstances. (SubCache.findNextTag method)
- [1407179] Segment.extractText() threw NullPointerException if
the last character position was part of a tag.
- Segment.extractText() now converts some tags to whitespace and
ignores text inside SCRIPT and STYLE elements.
- Added Segment.extractText(boolean includeAttributes) option.
- Added Source.fullSequentialParse() method.
- Added CharStreamSource interface for dealing with char output.
- Added Source.indent(String indentText, boolean tidyTags,
boolean collapseWhiteSpace, boolean indentAllElements) method.
- Added Segment.getChildElements() method.
- Added Element.getParentElement() method.
- Added Element.getDepth() method.
- Named tag search methods now only return unregistered tags if the
specified name is not a valid XML tag name.
- Changed Attributes.DefaultMaxErrorCount system default from 1 to 2.
- Added EndTag.getElement() method.
- Added Tag.getElement() abstract method.
- Added Tag.getNameSegment() method.
- Added Tag.getUserData() and Tag.setUserData(Object) methods.
- Added Tag.findNextTag() method.
- Added Tag.findPreviousTag() method.
- Added Tag.tidy() and Tag.tidy(boolean toXHTML) methods.
- Added and renamed many methods in OutputDocument class to make the
interface more intuitive.
- Added HTMLElements.getNestingForbiddenElementNames() method.
- Illegally nested elements with required end tags now terminate at
start of illegally nested start tag, avoiding possible stack overflow
in the common case of multiple unterminated elements.
- Tag search methods called with a pos argument that is out of range
now return null or empty results rather than throwing an exception.
- Renamed output(Writer) method in OutputSegment to writeTo(Writer).
- Deprecated Tag.regenerateHTML() method.
- Deprecated Source.getNextTagIterator() method.
- Deprecated AttributesOutputSegment class.
- Deprecated StringOutputSegment class.
- Removed BlankOutputSegment class from public API.
- Removed CharOutputSegment class from public API.
- Removed IOutputSegment which was deprecated in 2.0.
2.1 (2005-12-24)
- Added Source(InputStream) constructor.
- Added Source(Reader) constructor.
- Added Source(URL) constructor.
- Added Source.getEncoding() method.
- Added Source.getEncodingSpecificationInfo() method.
- Added Source.isXML() method.
- Added Source.findNextElement(pos) method.
- Added Source.findNextElement(pos,name) method.
- Added Segment.extractText() method.
- Added StartTag.getAttributeValue(attributeName) method.
- Added Element.getAttributeValue(attributeName) method.
- Added ExtractText and SourceEncoding sample programs.
2.0 (2005-11-10)
- Complete rewrite of the parsing engine to allow the encapsulation of
different tag types into the new TagType class.
- Requires Java 1.4 or later.
- All programs written for previous versions of the library will have
to be recompiled with the new version, regardless of whether any
changes are required. This is because several methods, including the
Source constructor, now expect a CharSequence as an argument instead
of a String.
- Changes that could require modifications to existing programs:
- The toString() method of Segment and all subclasses now returns the
source text of the segment instead of a string useful for debugging
purposes. This change was necessary because Segment now
implements CharSequence.
- For consistency, the toString() methods of all IOutputSegment
implementations now return the output string instead of a string
useful for debugging purposes.
- The return type of the OutputDocument.getSourceText() method is now
CharSequence instead of String.
- Character references in Attribute.getValue() are now decoded
- StartTag.isEmptyElementTag() no longer checks whether the end tag
is required.
- Element.getContent() now returns zero-length segment instead of null
in case of an empty element.
- FormField.getPredefinedValues() now returns an empty collection
instead of null if the form field has no predefined values.
- Segment.findAllStartTags() now returns server tags that are found
inside other tags.
- Attributes segment now ends immediately after the last attribute
instead of immediatley before the end-of-tag delimiter.
- Modified Segment.isWhiteSpace(char) to match HTML specification
- CharacterReference.encode(CharSequence) no longer encodes
apostrophes by default
- Tags of type SERVER_COMMON now always have the name "%" regardless
of whether an identifier immediately follows it.
- Modified and enhanced aspects of StartTag searches relating to
special tags
- P elements are now terminated by TABLE elements.
See the HTMLElementName.P documentation for more information.
- removed public fields in Attribute class that were deprecated in 1.2
- removed Source.getSourceTextLowerCase() method deprecated in 1.3
- removed Source.findEnd(int pos, SpecialTag) method which was
accidentally added as a public method in 1.4
- Deprecated numerous methods (details in javadoc)
- Deprecated IOutputSegment interface and replaced with OutputSegment
- Improved caching system
- Added recognition of markup declarations
- Added recognition of CDATA sections
- Added recognition of SGML marked sections
- Doctype declarations containing markup declarations now supported
- Segment class now implements CharSequence and Comparable
- Added getDebugInfo() to Segment and all subclasses to replace the
previous functionality of the toString() method
- OutputSegment interface now implements CharSequence
- Added getDebugInfo() to the OutputSegment interface to replace the
previous functionality of the toString() method
- Attributes class now implements List
- FormFields class now implements Collection
- Added HTMLElementName interface and HTMLElements class
- Added RowColumnVector class and associated methods in Source class
- Added FormControl class
- Added various methods to the FormField, FormFields and OutputDocument
classes related to FormControl objects and the manipulation and output
of form submission values.
- Added Config and related classes
- Added TagType class and subclasses
- Added various tag search methods to the Source and Segment classes
including searches by TagType, attribute values, and other criteria.
- Added AttributesOutputSegment class
- Added Util class
- Added OverlappingOutputSegmentsException class
- Added many other methods to existing classes
- Documentation improvements
1.4.1 (2005-11-10)
- Bug Fixes:
- [1065861] Named StartTag search did not find a tag immediately
following a comment
- Unnamed StartTag search did not find a comment if the search starts
at the first character of the comment
- Character references in FormField.getPredefinedValues() items were
not decoded
- FormControlType.SELECT_SINGLE.allowsMultipleValues() returned false
instead of the correct value of true, resulting in the same
incorrect value from FormField.allowMultipleValues() when multiple
SELECT_SINGLE controls with the same name were present in the form
1.4 (2004-09-02)
- Added CharacterEntityReference and NumbericCharacterReference classes
- Added CharOutputSegment class
- Attributes allow whitespace around '=' sign
- Added convenience method Element.getAttributes()
- Some documentation improvements
1.3 (2004-07-25)
- Deprecated Source.getSourceTextLowerCase()
- Added ignoreWhenParsing methods to Source and Segment classes
(See sample called JSPTest)
- Added parseAttributes methods to Source, Segment and StartTag classes
- Added ability to search for tags in a specified namespace
- Added BlankOutputSegment class
- Fixed bug relating to HTML comments with alphabetic characters
immediately following the opening