Web scraping with Java
JavaWeb ScrapingFrameworksJava Problem Overview
I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageID
and extract the HTML titles / other stuff in their DOM trees.
Are there ways other than web scraping?
Java Solutions
Solution 1 - Java
jsoup
Extracting the title is not difficult, and you have many options, search here on Stack Overflow for "Java HTML parsers". One of them is Jsoup.
You can navigate the page using DOM if you know the page structure, see http://jsoup.org/cookbook/extracting-data/dom-navigation
It's a good library and I've used it in my last projects.
Solution 2 - Java
Your best bet is to use Selenium Web Driver since it
-
Provides visual feedback to the coder (see your scraping in action, see where it stops)
-
Accurate and Consistent as it directly controls the browser you use.
-
Slow. Doesn't hit web pages like HtmlUnit does but sometimes you don't want to hit too fast.
Htmlunit is fast but is horrible at handling Javascript and AJAX.
Solution 3 - Java
HTMLUnit can be used to do web scraping, it supports invoking pages, filling & submitting forms. I have used this in my project. It is good java library for web scraping. read here for more
Solution 4 - Java
mechanize for Java would be a good fit for this, and as Wadjy Essam mentioned it uses JSoup for the HMLT. mechanize is a stageful HTTP/HTML client that supports navigation, form submissions, and page scraping.
http://gistlabs.com/software/mechanize-for-java/ (and the GitHub here https://github.com/GistLabs/mechanize)
Solution 5 - Java
There is also Jaunt Java Web Scraping & JSON Querying - http://jaunt-api.com
Solution 6 - Java
You might look into jwht-scraper!
This is a complete scraping framework that has all the features a developper could expect from a web scraper :
- Proxy support
- Warning Sign Support to detect captchas and more
- Complex link following features
- Multithreading
- Various scraping delays when required
- Rotating User-Agent
- Request auto retry and HTTP redirections supports
- HTTP headers, cookies and more support
- GET and POST support
- Annotation Configuration
- Detailed Scraping Metrics
- Async handling of the scraper client
- jwht-htmltopojo fully featured framework to map HTML to POJO
- Custom Input Format handling and built in JSON -> POJO mapping
- Full Exception Handling Control
- Detailed Logging with log4j
- POJO injection
- Custom processing hooks
- Easy to use and well documented API
It works with (jwht-htmltopojo)[https://github.com/whimtrip/jwht-htmltopojo) lib which itsef uses Jsoup mentionned by several other people here.
Together they will help you built awesome scrapers mapping directly HTML to POJOs and bypassing any classical scraping problems in only a matter of minutes!
Hope this might help some people here!
Disclaimer, I am the one who developed it, feel free to let me know your remarks!
Solution 7 - Java
Look at an HTML parser such as TagSoup, HTMLCleaner or NekoHTML.
Solution 8 - Java
If you wish to automate scraping of large amount pages or data, then you could try Gotz ETL.
It is completely model driven like a real ETL tool. Data structure, task workflow and pages to scrape are defined with a set of XML definition files and no coding is required. Query can be written either using Selectors with JSoup or XPath with HtmlUnit.
Solution 9 - Java
For tasks of this type I usually use Crawller4j + Jsoup.
With crawler4j I download the pages from a domain, you can specify which ULR with a regular expression.
With jsoup, I "parsed" the html data you have searched for and downloaded with crawler4j.
Normally you can also download data with jsoup, but Crawler4J makes it easier to find links. Another advantage of using crawler4j is that it is multithreaded and you can configure the number of concurrent threads
Solution 10 - Java
Normally I use selenium, which is software for testing automation. You can control a browser through a webdriver, so you will not have problems with javascripts and it is usually not very detected if you use the full version. Headless browsers can be more identified.