Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.
It is an open source library released under the Eclipse Public License (EPL), GNU Lesser General Public License (LGPL), and Apache Licence. You are therefore free to use it in commercial applications subject to the terms detailed in any one of these licence documents.
The javadocs provide comprehensive documentation of the entire API, as well as being a very useful reference on aspects of HTML and XML in general.
Visit the SourceForge.net project page at http://sourceforge.net/projects/jerichohtml/ for downloads and support.
Release notes for each version can be found in a file called release.txt in the project root directory.
The library distinguishes itself from other HTML parsers with the following major features:
StreamedSource
class,
which allows memory efficient processing of large files using an event iterator. This is essentially a
StAX alternative with the ability to process HTML and non-validating XML,
as well as several other features not available in other streaming parsers.
The samples/console
directory in the download package contains sample programs
for performing common tasks and demonstrating the functionality of the library.
The .bat
files can be run directly on a MS-Windows operating system,
or the following syntax can be used on a UNIX based operating system from the samples/console
directory:
java -classpath classes:../../dist/jericho-html-x.x.jar ProgramName
where x.x
is the current release number and ProgramName
is the name of the sample program to run.
The following sample programs are available:
DisplayAllElements.java | Demonstrates the behaviour of the library when retrieving all elements from a document containing a mix of normal HTML, different types of server tags, and badly formatted HTML. |
FindSpecificTags.java | Demonstrates how to search for tags with a specified name, in a specified namespace, or special tags such as document type declarations, XML declarations, XML processing instructions, common server tags, PHP tags, Mason tags, and HTML comments. |
ExtractText.java | Demonstrates the use of the TextExtractor class that extracts all of the text from a document, as well as the title, description, keywords and links. |
RenderToText.java | Demonstrates the use of the Renderer class that performs a simple text rendering of HTML markup, similar to the way Mozilla Thunderbird and other email clients provide an automatic conversion of HTML content to text in their alternative MIME encoding of emails. (Click here for an online demonstration) |
HTMLSanitiser.java | Demonstrates how to sanitise HTML containing unwanted or invalid tags into clean HTML. The unit test class for this functionality is available here. |
StreamedSourceCopy.java | Demonstrates the use of the StreamedSource class by iterating through the parsed segments of a source document and creating an exact copy of it. |
FormControlDisplayCharacteristics.java | Demonstrates setting the display characteristics of individual form controls. This allows a control to be disabled, removed, or replaced with a plain text representation of its value (display value). The new document is written to a file called NewForm.html |
FormFieldCSVOutput.java |
Demonstrates the use of the
FormFields.getColumnValues(Map)
method to store form data in a .CSV file, automatically creating separate columns for fields that can
contain multiple values (such as checkboxes).
The output is written to a file called FormData.csv
|
FormFieldList.java |
Demonstrates the use of the
Segment.findFormFields()
method to list all form fields and their associated controls in a document.
|
FormFieldSetValues.java |
Demonstrates setting the values of form controls, which is best done via the
FormFields object.
The new document is written to a file called NewForm.html
|
FormatSource.java | Demonstrates the use of the SourceFormatter class that formats HTML source by laying out each non-inline-level element on a new line with an appropriate indent. Also known as a "source beautifier". (Click here for an online demonstration) |
CompactSource.java | Demonstrates the use of the SourceCompactor class that compacts HTML source by removing all unnecessary white space. |
Encoding.java | Demonstrates the use of the EncodingDetector class and how to determine the encoding of a source document. |
SplitLongLines.java | Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split into multiple lines. |
ConvertStyleSheets.java | Demonstrates how to detect all external style sheets and place them inline into the document. |
The build and sample files are implemented as DOS .bat files only.
This package was originally written in the latter half of 2002. At that time I evaluated 6 other parsers, none of which were capable of achieving my aims. Most couldn't reproduce a typical HTML document without change, none could reproduce a source document containing badly formatted or non-HTML components without change, and none provided a means to track the positions of nodes in the source text. A list of these parsers and a brief description follows, but please note that I have not revised this analysis since the before this package was written. Please let me know if there are any errors.
Sponsors: |
WebVenture.com.au |
Corporate Translations |
Taking Care of Trees |