Screenscraping made easy using jsoup and Maven

Sometimes in a developer’s life there is no clean API available to gather information from a web application .. no SOAP, no XML-RPC and no REST .. just a website hiding the information we’re looking for somewhere in its DOM hierarchy – so the only solution is screenscraping.

Screenscraping always leaves me with a bad feeling – but luckily there is a tool that makes this job at least a bit easier for a developer .. jsoup to the rescue!

Prerequisites

Nothing special here .. just a JDK and good ole’ Maven ..

Creating a new Project

First we need a new Maven project …

Create a new Maven project using your IDE or via console mvn archetype:generate

We need just one dependency for jsoup – having added it my pom.xml finally looks like this

<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.hascode.samples</groupId>
  <artifactId>jsoup-example</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <dependencies>
    <dependency>
      <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.6.1</version>
    </dependency>
  </dependencies>
</project>

Screenscraping a Website

In the following example, we’re going to fetch the context of www.hascode.com and parse its title, the heading of the current article and some metadata available …

That’s what my screenscraping class looks like

package com.hascode.samples.jsoup;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class WebScraper {
 public static void main(final String[] args) throws IOException {
 Document doc = Jsoup.connect("https://www.hascode.com/")
 .userAgent("Mozilla").timeout(6000).get();
 String title = doc.title(); // parsing the page's title
 System.out.println("The title of www.hascode.com is: " + title);
 Elements heading = doc.select("h2 > a"); // parsing the latest article's
 // heading
 System.out.println("The latest article is: " + heading.text());
 System.out.println("The article's URL is: " + heading.attr("href"));
 Elements editorial = doc.select("div.BlockContent-body small");
 System.out.println("The was created: " + editorial.text());
 }
}

Running the class we’re going to see the following output

The title of www.hascode.com is: hasCode.com
The latest article is: Contract-First Web-Services using JAX-WS, JAX-B, Maven and Eclipse
The article's URL is: https://www.hascode.com/2011/08/contract-first-web-services-using-jax-ws-jax-b-maven-and-eclipse/
The was created: August 23rd, 2011 by micha kops

Parsing HTML Fragments

Sometimes we get a single fragment of HTML code from an API .. no problem with jsoup …

The fragment html parser

package com.hascode.samples.jsoup;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class FragmentParser {
 public static void main(final String[] args) throws IOException {
 String htmlFragment = "<div class=\"breadcrumb\">";
 htmlFragment += "<ul><li><a href=\"/\">Home</a></li>";
 htmlFragment += "<li><a href=\"#cat1\">Category 1</a></li>";
 htmlFragment += "</ul></div>";
 Document doc = Jsoup.parseBodyFragment(htmlFragment);
 Element div = doc.body().select("div").first();
 Element a1 = div.select("ul a").first();
 Element a2 = div.select("ul a").get(1);
 System.out.println(String.format("The div has the class '%s'",
 div.attr("class")));
 System.out
 .println(String
 .format("The first link in the breadcrum has the text '%s' and links to '%s'.",
 a1.text(), a1.attr("href")));
 System.out
 .println(String
 .format("The second link in the breadcrumb has the text '%s' and links to '%s'",
 a2.text(), a2.attr("href")));
 }
}

And its output produced

The div has the class 'breadcrumb'
The first link in the breadcrum has the text 'Home' and links to '/'.
The second link in the breadcrumb has the text 'Category 1' and links to '#cat1'

Tutorial Sources

I have put the source from this tutorial on my GitHub repository – download it there or check it out using Mercurial:

git clone https://github.com/hascode/hascode-tutorials.git

Prerequisites#

Creating a new Project#

Screenscraping a Website#

Parsing HTML Fragments#

Tutorial Sources#

Resources#

Prerequisites

Creating a new Project

Screenscraping a Website

Parsing HTML Fragments

Tutorial Sources

Resources