logo2

Sometimes in a developer’s life there is no clean API available to gather information from a web application .. no SOAP, no XML-RPC and no REST .. just a website hiding the information we’re looking for somewhere in its DOM hierarchy – so the only solution is screenscraping.

Screenscraping always leaves me with a bad feeling – but luckily there is a tool that makes this job at least a bit easier for a developer .. jsoup to the rescue!

Prerequisites

Nothing special here .. just a JDK and good ole’ Maven ..

Creating a new Project

First we need a new Maven project …

  • Create a new Maven project using your IDE or via console mvn archetype:generate

  • We need just one dependency for jsoup – having added it my pom.xml finally looks like this

    <?xml version="1.0"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0</modelVersion>
      <groupId>com.hascode.samples</groupId>
      <artifactId>jsoup-example</artifactId>
      <version>0.0.1-SNAPSHOT</version>
      <dependencies>
        <dependency>
          <groupId>org.jsoup</groupId>
          <artifactId>jsoup</artifactId>
          <version>1.6.1</version>
        </dependency>
      </dependencies>
    </project>

Screenscraping a Website

In the following example, we’re going to fetch the context of www.hascode.com and parse its title, the heading of the current article and some metadata available …

  • That’s what my screenscraping class looks like

    package com.hascode.samples.jsoup;
    
    import java.io.IOException;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    
    public class WebScraper {
     public static void main(final String[] args) throws IOException {
     Document doc = Jsoup.connect("https://www.hascode.com/")
     .userAgent("Mozilla").timeout(6000).get();
     String title = doc.title(); // parsing the page's title
     System.out.println("The title of www.hascode.com is: " + title);
     Elements heading = doc.select("h2 > a"); // parsing the latest article's
     // heading
     System.out.println("The latest article is: " + heading.text());
     System.out.println("The article's URL is: " + heading.attr("href"));
     Elements editorial = doc.select("div.BlockContent-body small");
     System.out.println("The was created: " + editorial.text());
     }
    }
  • Running the class we’re going to see the following output

    The title of www.hascode.com is: hasCode.com
    The latest article is: Contract-First Web-Services using JAX-WS, JAX-B, Maven and Eclipse
    The article's URL is: https://www.hascode.com/2011/08/contract-first-web-services-using-jax-ws-jax-b-maven-and-eclipse/
    The was created: August 23rd, 2011 by micha kops

Parsing HTML Fragments

Sometimes we get a single fragment of HTML code from an API .. no problem with jsoup …

  • The fragment html parser

    package com.hascode.samples.jsoup;
    
    import java.io.IOException;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    
    public class FragmentParser {
     public static void main(final String[] args) throws IOException {
     String htmlFragment = "<div class=\"breadcrumb\">";
     htmlFragment += "<ul><li><a href=\"/\">Home</a></li>";
     htmlFragment += "<li><a href=\"#cat1\">Category 1</a></li>";
     htmlFragment += "</ul></div>";
     Document doc = Jsoup.parseBodyFragment(htmlFragment);
     Element div = doc.body().select("div").first();
     Element a1 = div.select("ul a").first();
     Element a2 = div.select("ul a").get(1);
     System.out.println(String.format("The div has the class '%s'",
     div.attr("class")));
     System.out
     .println(String
     .format("The first link in the breadcrum has the text '%s' and links to '%s'.",
     a1.text(), a1.attr("href")));
     System.out
     .println(String
     .format("The second link in the breadcrumb has the text '%s' and links to '%s'",
     a2.text(), a2.attr("href")));
     }
    }
  • And its output produced

    The div has the class 'breadcrumb'
    The first link in the breadcrum has the text 'Home' and links to '/'.
    The second link in the breadcrumb has the text 'Category 1' and links to '#cat1'

Tutorial Sources

I have put the source from this tutorial on my GitHub repository – download it there or check it out using Mercurial:

git clone https://github.com/hascode/hascode-tutorials.git