How can I efficiently parse HTML with Java?

JavaHtmlParsingHtml ParsingWeb Scraping

Java Problem Overview


I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.

Now, I want to separate both the tasks.

I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.

I want to know which HTML parser can parse HTML efficiently. I need

  1. Speed
  2. Ease to locate any HtmlElement by its "id" or "name" or "tag type".

It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.

Java Solutions


Solution 1 - Java

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

Its party trick is a CSS selector syntax to find elements, e.g.:

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();

See the Selector javadoc for more info.

This is a new project, so any ideas for improvement are very welcome!

Solution 2 - Java

The best I've seen so far is HtmlCleaner:

> HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

With HtmlCleaner you can locate any element using XPath.

For other html parsers see this SO question.

Solution 3 - Java

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAmitView Question on Stackoverflow
Solution 1 - JavaJonathan HedleyView Answer on Stackoverflow
Solution 2 - JavatangensView Answer on Stackoverflow
Solution 3 - JavaMs2gerView Answer on Stackoverflow