Jericho HTML Parser
Release Notes
3.4 (2015-10-24)
- Bug Fixes:
- [62] Fixed GC performance problem in StreamedSource.
- [71] Renderer.setHRLineLength(0) doesn't completely disable
rendering of HR element.
- [72] Fixed performance problem in Attributes.
- [80] Fixed position discarded exception in StreamedSource.
- [81] Limited left margin in Renderer based on MaxLineLength.
- Little-endian BOM encoding detection broken.
- HTML5 elements with forbidden end tags weren't present in
HTMLElements.getEndTagForbiddenElementNames()
- CHANGES THAT COULD AFFECT THE BEHAVIOUR OF EXISTING PROGRAMS:
- Changed default character reference encoding behaviour.
(see Config.DEFAULT_CHARACTER_REFERENCE_ENCODING_BEHAVIOUR)
- Changed the the ordering of OutputSegments for more intuitive
behaviour, but still backward compatible with the old API contract.
- Added Apache License as an option for licensing.
- Added Config.CurrentCharacterReferenceEncodingBehaviour parameter.
- Performance improvements in name and attribute based searches after
full sequential parse.
- Performance improvement in CharacterReference.decode methods.
- Added LoggerProvider.getSourceLogger() for better performance of
highly concurrent applications.
- Performance improvement in StreamedSource by avoiding exception
at end of stream.
- Compiling to target Java 1.7 instead of Java 1.5.
(source code is however still compatible with Java 1.6)
- Removed all raw type references from source code.
- Improved documentation of TagType.isValidPosition to include mention
of potential problems with Microsoft downlevel-revealed conditional
comment tags.
- INPUT elements missing a name attribute no longer result in an
error message being logged.
- INPUT elements with type attribute values of date, datetime,
datetime-local, month, time, week, number, range, email, url, search,
tel, and color are now recognised as text controls without warnings
appearing in the log.
- HTMLSanitiser.stripInvalidMarkup sample now removes content from
correctly.
- RenderToText does not handle whitespace after
correctly.
- Resetting to invalid mark exception during encoding detection.
- INPUT elements of type "button" and "reset" incorrectly
interpreted as form controls of type FormControlType.TEXT.
- Valid end tags containing white space rejected.
- Elements inside tags
- [1576991] Bug in ConvertStyleSheets sample program
- [1597587] various NPEs in findFormFields()
- [1599700] Segment.findAllStartTags(attributeName...) infinite loop
- Overlapping elements resulted in some elements being listed as a
child of more than one parent element.
- OutputDocument.writeTo(Writer) closed the writer.
- Server tags no longer interfere with parsing of start tag attributes.
- Added Renderer class and Segment.getRenderer() method.
- Added TextExtractor class and Segment.getTextExtractor() method.
- Deprecated segment.extractText methods.
- Added SourceFormatter class and Source.getSourceFormatter() method.
- Deprecated Source.indent method.
- Added Logger interface along with the related LoggerProvider
interface and BasicLoggerProvider and WriterLogger classes.
- Added Source.setLogger(Logger) and Source.getLogger() methods.
- Deprecated Source.setLogWriter(Writer) and Source.getLogWriter()
methods.
- Added Source.findNextElement(int pos, String attributeName,
String value, boolean valueCaseSensitive) method.
- Added Segment.findAllElements(String attributeName, String value,
boolean valueCaseSensitive) method.
- Calling the ignoreWhenParsing methods on overlapping segments no
longer results in an OverlappingOutputSegmentsException.
- Added CharacterReference.getEncodingFilterWriter(Writer) method.
- Added CharacterReference.encode(char) method.
- Added Source.getNewLine() method.
- Added static Config.NewLine parameter.
- All text output now uses Config.NewLine instead of hard-coded '\n'.
- Source.fullSequentialParse() method no longer parses the source again
if it has already been called.
- Some methods that require the parsing of the entire source now call
Source.fullSequentialParse() automatically.
- Some changes to the output of various getDebugInfo() methods.
- Added categorised class list in javadoc.
- Removed all methods/constants deprecated in 2.0.
2.3 (2006-09-11)
- Bug Fixes:
- [1510438] NullPointerException in Source.indent.
- [1511480] Incorrect detection of non-html element with nested
empty-element tag of same name.
- [1547562] Fault in caching mechanism.
- Source.fullSequentialParse() sometimes resulted in unregistered
tags being returned in tag searches.
- Invalid Empty-element tags whose name is in either of the sets
HTMLElements.getEndTagOptionalElementNames() or
HTMLElements.getEndTagRequiredElementNames() were rejected by the
parser if the slash immediately follows the tag name.
- StartTag.tidy() only included a slash before the closing delimiter
of the tag if the tag name was in the set of
HTMLElements.getEndTagForbiddenElementNames(). It now includes the
slash for all tag names not in getEndTagOptionalElementNames().
- Source.fullSequentialParse() now clears the cache automatically
instead of throwing an IllegalStateException if the cache is not
empty.
- Changes to behaviour of Source.indent:
- preserves indenting in SCRIPT elements, server elements,
HTML comments and CDATA sections.
- keeps SCRIPT elements, HTML comments, XML declarations,
XML processing instructions and markup declarations inline.
- Minor documentation improvements.
2.2 (2006-06-20)
- Bug Fixes:
- Fault in caching mechanism resulted in missed tags in rare
circumstances. (SubCache.findNextTag method)
- [1407179] Segment.extractText() threw NullPointerException if
the last character position was part of a tag.
- Segment.extractText() now converts some tags to whitespace and
ignores text inside SCRIPT and STYLE elements.
- Added Segment.extractText(boolean includeAttributes) option.
- Added Source.fullSequentialParse() method.
- Added CharStreamSource interface for dealing with char output.
- Added Source.indent(String indentText, boolean tidyTags,
boolean collapseWhiteSpace, boolean indentAllElements) method.
- Added Segment.getChildElements() method.
- Added Element.getParentElement() method.
- Added Element.getDepth() method.
- Named tag search methods now only return unregistered tags if the
specified name is not a valid XML tag name.
- Changed Attributes.DefaultMaxErrorCount system default from 1 to 2.
- Added EndTag.getElement() method.
- Added Tag.getElement() abstract method.
- Added Tag.getNameSegment() method.
- Added Tag.getUserData() and Tag.setUserData(Object) methods.
- Added Tag.findNextTag() method.
- Added Tag.findPreviousTag() method.
- Added Tag.tidy() and Tag.tidy(boolean toXHTML) methods.
- Added and renamed many methods in OutputDocument class to make the
interface more intuitive.
- Added HTMLElements.getNestingForbiddenElementNames() method.
- Illegally nested elements with required end tags now terminate at
start of illegally nested start tag, avoiding possible stack overflow
in the common case of multiple unterminated elements.
- Tag search methods called with a pos argument that is out of range
now return null or empty results rather than throwing an exception.
- Renamed output(Writer) method in OutputSegment to writeTo(Writer).
- Deprecated Tag.regenerateHTML() method.
- Deprecated Source.getNextTagIterator() method.
- Deprecated AttributesOutputSegment class.
- Deprecated StringOutputSegment class.
- Removed BlankOutputSegment class from public API.
- Removed CharOutputSegment class from public API.
- Removed IOutputSegment which was deprecated in 2.0.
2.1 (2005-12-24)
- Added Source(InputStream) constructor.
- Added Source(Reader) constructor.
- Added Source(URL) constructor.
- Added Source.getEncoding() method.
- Added Source.getEncodingSpecificationInfo() method.
- Added Source.isXML() method.
- Added Source.findNextElement(pos) method.
- Added Source.findNextElement(pos,name) method.
- Added Segment.extractText() method.
- Added StartTag.getAttributeValue(attributeName) method.
- Added Element.getAttributeValue(attributeName) method.
- Added ExtractText and SourceEncoding sample programs.
2.0 (2005-11-10)
- Complete rewrite of the parsing engine to allow the encapsulation of
different tag types into the new TagType class.
- Requires Java 1.4 or later.
- All programs written for previous versions of the library will have
to be recompiled with the new version, regardless of whether any
changes are required. This is because several methods, including the
Source constructor, now expect a CharSequence as an argument instead
of a String.
- Changes that could require modifications to existing programs:
- The toString() method of Segment and all subclasses now returns the
source text of the segment instead of a string useful for debugging
purposes. This change was necessary because Segment now
implements CharSequence.
- For consistency, the toString() methods of all IOutputSegment
implementations now return the output string instead of a string
useful for debugging purposes.
- The return type of the OutputDocument.getSourceText() method is now
CharSequence instead of String.
- Character references in Attribute.getValue() are now decoded
- StartTag.isEmptyElementTag() no longer checks whether the end tag
is required.
- Element.getContent() now returns zero-length segment instead of null
in case of an empty element.
- FormField.getPredefinedValues() now returns an empty collection
instead of null if the form field has no predefined values.
- Segment.findAllStartTags() now returns server tags that are found
inside other tags.
- Attributes segment now ends immediately after the last attribute
instead of immediatley before the end-of-tag delimiter.
- Modified Segment.isWhiteSpace(char) to match HTML specification
- CharacterReference.encode(CharSequence) no longer encodes
apostrophes by default
- Tags of type SERVER_COMMON now always have the name "%" regardless
of whether an identifier immediately follows it.
- Modified and enhanced aspects of StartTag searches relating to
special tags
- P elements are now terminated by TABLE elements.
See the HTMLElementName.P documentation for more information.
- removed public fields in Attribute class that were deprecated in 1.2
- removed Source.getSourceTextLowerCase() method deprecated in 1.3
- removed Source.findEnd(int pos, SpecialTag) method which was
accidentally added as a public method in 1.4
- Deprecated numerous methods (details in javadoc)
- Deprecated IOutputSegment interface and replaced with OutputSegment
- Improved caching system
- Added recognition of markup declarations
- Added recognition of CDATA sections
- Added recognition of SGML marked sections
- Doctype declarations containing markup declarations now supported
- Segment class now implements CharSequence and Comparable
- Added getDebugInfo() to Segment and all subclasses to replace the
previous functionality of the toString() method
- OutputSegment interface now implements CharSequence
- Added getDebugInfo() to the OutputSegment interface to replace the
previous functionality of the toString() method
- Attributes class now implements List
- FormFields class now implements Collection
- Added HTMLElementName interface and HTMLElements class
- Added RowColumnVector class and associated methods in Source class
- Added FormControl class
- Added various methods to the FormField, FormFields and OutputDocument
classes related to FormControl objects and the manipulation and output
of form submission values.
- Added Config and related classes
- Added TagType class and subclasses
- Added various tag search methods to the Source and Segment classes
including searches by TagType, attribute values, and other criteria.
- Added AttributesOutputSegment class
- Added Util class
- Added OverlappingOutputSegmentsException class
- Added many other methods to existing classes
- Documentation improvements
1.4.1 (2005-11-10)
- Bug Fixes:
- [1065861] Named StartTag search did not find a tag immediately
following a comment
- Unnamed StartTag search did not find a comment if the search starts
at the first character of the comment
- Character references in FormField.getPredefinedValues() items were
not decoded
- FormControlType.SELECT_SINGLE.allowsMultipleValues() returned false
instead of the correct value of true, resulting in the same
incorrect value from FormField.allowMultipleValues() when multiple
SELECT_SINGLE controls with the same name were present in the form
1.4 (2004-09-02)
- Added CharacterEntityReference and NumbericCharacterReference classes
- Added CharOutputSegment class
- Attributes allow whitespace around '=' sign
- Added convenience method Element.getAttributes()
- Some documentation improvements
1.3 (2004-07-25)
- Deprecated Source.getSourceTextLowerCase()
- Added ignoreWhenParsing methods to Source and Segment classes
(See sample called JSPTest)
- Added parseAttributes methods to Source, Segment and StartTag classes
- Added ability to search for tags in a specified namespace
- Added BlankOutputSegment class
- Fixed bug relating to HTML comments with alphabetic characters
immediately following the opening