Sometimes in a developer’s life there is no clean API available to gather information from a web application .. no SOAP, no XML-RPC and no REST .. just a website hiding the information we’re looking for somewhere in its DOM hierarchy – so the only solution is screenscraping.
Screenscraping always leaves me with a bad feeling – but luckily there is a tool that makes this job at least a bit easier for a developer .. jsoup to the rescue!
Prerequisites
Nothing special here .. just a JDK and good ole’ Maven ..
Creating a new Project
First we need a new Maven project …
-
Create a new Maven project using your IDE or via console mvn archetype:generate
-
We need just one dependency for jsoup – having added it my pom.xml finally looks like this
<?xml version="1.0"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.hascode.samples</groupId> <artifactId>jsoup-example</artifactId> <version>0.0.1-SNAPSHOT</version> <dependencies> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.6.1</version> </dependency> </dependencies> </project>
Screenscraping a Website
In the following example, we’re going to fetch the context of www.hascode.com and parse its title, the heading of the current article and some metadata available …
-
That’s what my screenscraping class looks like
package com.hascode.samples.jsoup; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.select.Elements; public class WebScraper { public static void main(final String[] args) throws IOException { Document doc = Jsoup.connect("https://www.hascode.com/") .userAgent("Mozilla").timeout(6000).get(); String title = doc.title(); // parsing the page's title System.out.println("The title of www.hascode.com is: " + title); Elements heading = doc.select("h2 > a"); // parsing the latest article's // heading System.out.println("The latest article is: " + heading.text()); System.out.println("The article's URL is: " + heading.attr("href")); Elements editorial = doc.select("div.BlockContent-body small"); System.out.println("The was created: " + editorial.text()); } }
-
Running the class we’re going to see the following output
The title of www.hascode.com is: hasCode.com The latest article is: Contract-First Web-Services using JAX-WS, JAX-B, Maven and Eclipse The article's URL is: https://www.hascode.com/2011/08/contract-first-web-services-using-jax-ws-jax-b-maven-and-eclipse/ The was created: August 23rd, 2011 by micha kops
Parsing HTML Fragments
Sometimes we get a single fragment of HTML code from an API .. no problem with jsoup …
-
The fragment html parser
package com.hascode.samples.jsoup; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class FragmentParser { public static void main(final String[] args) throws IOException { String htmlFragment = "<div class=\"breadcrumb\">"; htmlFragment += "<ul><li><a href=\"/\">Home</a></li>"; htmlFragment += "<li><a href=\"#cat1\">Category 1</a></li>"; htmlFragment += "</ul></div>"; Document doc = Jsoup.parseBodyFragment(htmlFragment); Element div = doc.body().select("div").first(); Element a1 = div.select("ul a").first(); Element a2 = div.select("ul a").get(1); System.out.println(String.format("The div has the class '%s'", div.attr("class"))); System.out .println(String .format("The first link in the breadcrum has the text '%s' and links to '%s'.", a1.text(), a1.attr("href"))); System.out .println(String .format("The second link in the breadcrumb has the text '%s' and links to '%s'", a2.text(), a2.attr("href"))); } }
-
And its output produced
The div has the class 'breadcrumb' The first link in the breadcrum has the text 'Home' and links to '/'. The second link in the breadcrumb has the text 'Category 1' and links to '#cat1'
Tutorial Sources
I have put the source from this tutorial on my GitHub repository – download it there or check it out using Mercurial:
git clone https://github.com/hascode/hascode-tutorials.git