Jericho HTML Parser Release Notes 3.4 (2015-10-24) - Bug Fixes: - [62] Fixed GC performance problem in StreamedSource. - [71] Renderer.setHRLineLength(0) doesn't completely disable rendering of HR element. - [72] Fixed performance problem in Attributes. - [80] Fixed position discarded exception in StreamedSource. - [81] Limited left margin in Renderer based on MaxLineLength. - Little-endian BOM encoding detection broken. - HTML5 elements with forbidden end tags weren't present in HTMLElements.getEndTagForbiddenElementNames() - CHANGES THAT COULD AFFECT THE BEHAVIOUR OF EXISTING PROGRAMS: - Changed default character reference encoding behaviour. (see Config.DEFAULT_CHARACTER_REFERENCE_ENCODING_BEHAVIOUR) - Changed the the ordering of OutputSegments for more intuitive behaviour, but still backward compatible with the old API contract. - Added Apache License as an option for licensing. - Added Config.CurrentCharacterReferenceEncodingBehaviour parameter. - Performance improvements in name and attribute based searches after full sequential parse. - Performance improvement in CharacterReference.decode methods. - Added LoggerProvider.getSourceLogger() for better performance of highly concurrent applications. - Performance improvement in StreamedSource by avoiding exception at end of stream. - Compiling to target Java 1.7 instead of Java 1.5. (source code is however still compatible with Java 1.6) - Removed all raw type references from source code. - Improved documentation of TagType.isValidPosition to include mention of potential problems with Microsoft downlevel-revealed conditional comment tags. - INPUT elements missing a name attribute no longer result in an error message being logged. - INPUT elements with type attribute values of date, datetime, datetime-local, month, time, week, number, range, email, url, search, tel, and color are now recognised as text controls without warnings appearing in the log. - HTMLSanitiser.stripInvalidMarkup sample now removes content from tags - [1576991] Bug in ConvertStyleSheets sample program - [1597587] various NPEs in findFormFields() - [1599700] Segment.findAllStartTags(attributeName...) infinite loop - Overlapping elements resulted in some elements being listed as a child of more than one parent element. - OutputDocument.writeTo(Writer) closed the writer. - Server tags no longer interfere with parsing of start tag attributes. - Added Renderer class and Segment.getRenderer() method. - Added TextExtractor class and Segment.getTextExtractor() method. - Deprecated segment.extractText methods. - Added SourceFormatter class and Source.getSourceFormatter() method. - Deprecated Source.indent method. - Added Logger interface along with the related LoggerProvider interface and BasicLoggerProvider and WriterLogger classes. - Added Source.setLogger(Logger) and Source.getLogger() methods. - Deprecated Source.setLogWriter(Writer) and Source.getLogWriter() methods. - Added Source.findNextElement(int pos, String attributeName, String value, boolean valueCaseSensitive) method. - Added Segment.findAllElements(String attributeName, String value, boolean valueCaseSensitive) method. - Calling the ignoreWhenParsing methods on overlapping segments no longer results in an OverlappingOutputSegmentsException. - Added CharacterReference.getEncodingFilterWriter(Writer) method. - Added CharacterReference.encode(char) method. - Added Source.getNewLine() method. - Added static Config.NewLine parameter. - All text output now uses Config.NewLine instead of hard-coded '\n'. - Source.fullSequentialParse() method no longer parses the source again if it has already been called. - Some methods that require the parsing of the entire source now call Source.fullSequentialParse() automatically. - Some changes to the output of various getDebugInfo() methods. - Added categorised class list in javadoc. - Removed all methods/constants deprecated in 2.0. 2.3 (2006-09-11) - Bug Fixes: - [1510438] NullPointerException in Source.indent. - [1511480] Incorrect detection of non-html element with nested empty-element tag of same name. - [1547562] Fault in caching mechanism. - Source.fullSequentialParse() sometimes resulted in unregistered tags being returned in tag searches. - Invalid Empty-element tags whose name is in either of the sets HTMLElements.getEndTagOptionalElementNames() or HTMLElements.getEndTagRequiredElementNames() were rejected by the parser if the slash immediately follows the tag name. - StartTag.tidy() only included a slash before the closing delimiter of the tag if the tag name was in the set of HTMLElements.getEndTagForbiddenElementNames(). It now includes the slash for all tag names not in getEndTagOptionalElementNames(). - Source.fullSequentialParse() now clears the cache automatically instead of throwing an IllegalStateException if the cache is not empty. - Changes to behaviour of Source.indent: - preserves indenting in SCRIPT elements, server elements, HTML comments and CDATA sections. - keeps SCRIPT elements, HTML comments, XML declarations, XML processing instructions and markup declarations inline. - Minor documentation improvements. 2.2 (2006-06-20) - Bug Fixes: - Fault in caching mechanism resulted in missed tags in rare circumstances. (SubCache.findNextTag method) - [1407179] Segment.extractText() threw NullPointerException if the last character position was part of a tag. - Segment.extractText() now converts some tags to whitespace and ignores text inside SCRIPT and STYLE elements. - Added Segment.extractText(boolean includeAttributes) option. - Added Source.fullSequentialParse() method. - Added CharStreamSource interface for dealing with char output. - Added Source.indent(String indentText, boolean tidyTags, boolean collapseWhiteSpace, boolean indentAllElements) method. - Added Segment.getChildElements() method. - Added Element.getParentElement() method. - Added Element.getDepth() method. - Named tag search methods now only return unregistered tags if the specified name is not a valid XML tag name. - Changed Attributes.DefaultMaxErrorCount system default from 1 to 2. - Added EndTag.getElement() method. - Added Tag.getElement() abstract method. - Added Tag.getNameSegment() method. - Added Tag.getUserData() and Tag.setUserData(Object) methods. - Added Tag.findNextTag() method. - Added Tag.findPreviousTag() method. - Added Tag.tidy() and Tag.tidy(boolean toXHTML) methods. - Added and renamed many methods in OutputDocument class to make the interface more intuitive. - Added HTMLElements.getNestingForbiddenElementNames() method. - Illegally nested elements with required end tags now terminate at start of illegally nested start tag, avoiding possible stack overflow in the common case of multiple unterminated elements. - Tag search methods called with a pos argument that is out of range now return null or empty results rather than throwing an exception. - Renamed output(Writer) method in OutputSegment to writeTo(Writer). - Deprecated Tag.regenerateHTML() method. - Deprecated Source.getNextTagIterator() method. - Deprecated AttributesOutputSegment class. - Deprecated StringOutputSegment class. - Removed BlankOutputSegment class from public API. - Removed CharOutputSegment class from public API. - Removed IOutputSegment which was deprecated in 2.0. 2.1 (2005-12-24) - Added Source(InputStream) constructor. - Added Source(Reader) constructor. - Added Source(URL) constructor. - Added Source.getEncoding() method. - Added Source.getEncodingSpecificationInfo() method. - Added Source.isXML() method. - Added Source.findNextElement(pos) method. - Added Source.findNextElement(pos,name) method. - Added Segment.extractText() method. - Added StartTag.getAttributeValue(attributeName) method. - Added Element.getAttributeValue(attributeName) method. - Added ExtractText and SourceEncoding sample programs. 2.0 (2005-11-10) - Complete rewrite of the parsing engine to allow the encapsulation of different tag types into the new TagType class. - Requires Java 1.4 or later. - All programs written for previous versions of the library will have to be recompiled with the new version, regardless of whether any changes are required. This is because several methods, including the Source constructor, now expect a CharSequence as an argument instead of a String. - Changes that could require modifications to existing programs: - The toString() method of Segment and all subclasses now returns the source text of the segment instead of a string useful for debugging purposes. This change was necessary because Segment now implements CharSequence. - For consistency, the toString() methods of all IOutputSegment implementations now return the output string instead of a string useful for debugging purposes. - The return type of the OutputDocument.getSourceText() method is now CharSequence instead of String. - Character references in Attribute.getValue() are now decoded - StartTag.isEmptyElementTag() no longer checks whether the end tag is required. - Element.getContent() now returns zero-length segment instead of null in case of an empty element. - FormField.getPredefinedValues() now returns an empty collection instead of null if the form field has no predefined values. - Segment.findAllStartTags() now returns server tags that are found inside other tags. - Attributes segment now ends immediately after the last attribute instead of immediatley before the end-of-tag delimiter. - Modified Segment.isWhiteSpace(char) to match HTML specification - CharacterReference.encode(CharSequence) no longer encodes apostrophes by default - Tags of type SERVER_COMMON now always have the name "%" regardless of whether an identifier immediately follows it. - Modified and enhanced aspects of StartTag searches relating to special tags - P elements are now terminated by TABLE elements. See the HTMLElementName.P documentation for more information. - removed public fields in Attribute class that were deprecated in 1.2 - removed Source.getSourceTextLowerCase() method deprecated in 1.3 - removed Source.findEnd(int pos, SpecialTag) method which was accidentally added as a public method in 1.4 - Deprecated numerous methods (details in javadoc) - Deprecated IOutputSegment interface and replaced with OutputSegment - Improved caching system - Added recognition of markup declarations - Added recognition of CDATA sections - Added recognition of SGML marked sections - Doctype declarations containing markup declarations now supported - Segment class now implements CharSequence and Comparable - Added getDebugInfo() to Segment and all subclasses to replace the previous functionality of the toString() method - OutputSegment interface now implements CharSequence - Added getDebugInfo() to the OutputSegment interface to replace the previous functionality of the toString() method - Attributes class now implements List - FormFields class now implements Collection - Added HTMLElementName interface and HTMLElements class - Added RowColumnVector class and associated methods in Source class - Added FormControl class - Added various methods to the FormField, FormFields and OutputDocument classes related to FormControl objects and the manipulation and output of form submission values. - Added Config and related classes - Added TagType class and subclasses - Added various tag search methods to the Source and Segment classes including searches by TagType, attribute values, and other criteria. - Added AttributesOutputSegment class - Added Util class - Added OverlappingOutputSegmentsException class - Added many other methods to existing classes - Documentation improvements 1.4.1 (2005-11-10) - Bug Fixes: - [1065861] Named StartTag search did not find a tag immediately following a comment - Unnamed StartTag search did not find a comment if the search starts at the first character of the comment - Character references in FormField.getPredefinedValues() items were not decoded - FormControlType.SELECT_SINGLE.allowsMultipleValues() returned false instead of the correct value of true, resulting in the same incorrect value from FormField.allowMultipleValues() when multiple SELECT_SINGLE controls with the same name were present in the form 1.4 (2004-09-02) - Added CharacterEntityReference and NumbericCharacterReference classes - Added CharOutputSegment class - Attributes allow whitespace around '=' sign - Added convenience method Element.getAttributes() - Some documentation improvements 1.3 (2004-07-25) - Deprecated Source.getSourceTextLowerCase() - Added ignoreWhenParsing methods to Source and Segment classes (See sample called JSPTest) - Added parseAttributes methods to Source, Segment and StartTag classes - Added ability to search for tags in a specified namespace - Added BlankOutputSegment class - Fixed bug relating to HTML comments with alphabetic characters immediately following the opening