Extractomatic

Extractomatic is a simple API to detect and remove surplus clutter (such as adverts, headers, footers) around the main content of a web page. It uses the Boilerplate Java library, by Christian Kohlschütter.

Usage

There is one method at /extract, which responds to GET requests and returns JSON. Pass the URL of the content you wish to extract in the url parameter.

For example, GETing http://extractomatic.tomtaylor.co.uk/extract?url=http://www.nothingtoseehere.net/2010/01/moomin_world_naantali_1.html returns:

{
  "response" : {
    "title" : "Nothing To See Here: Moomin World, Naantali",
    "content" : "Moomin World, Naantali\nTove Jansson\u2019s Moomins, created by... (more content)",
    "source" : "http://www.nothingtoseehere.net/2010/01/moomin_world_naantali_1.html"
  },
  "status" : "success"
}

Boilerpipe provides a number of different extraction algorithms, which vary in effectiveness depending on the style of content you're interested in.

To change algorithm, pass in an optional mode parameter. The mode defaults to default, surprisingly.

Mode Description
article Uses ArticleExtractor: A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor.
article_sentences Uses ArticleSentencesExtractor: A full-text extractor which is tuned towards extracting sentences from news articles.
min_k_words Uses KeepEverythingWithMinKWordsExtractor: A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor, but usually worse than ArticleExtractor.
default Uses DefaultExtractor: A quite generic full-text extractor.
largest Uses LargestContentExtractor: A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor, but usually worse than ArticleExtractor.
num_words Uses NumWordsRulesExtractor: A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

The title is always the HTML title of the page.

Errors

If an error occurs, you will get a response that looks something like:

{
  "error" : {
    "message" : "Malformed URL",
    "code" : 100
  },
  "status" : "error"
}

The different errors you might see are:

Code Description
100 Malformed URL
101 Could not fetch the URL
102 Unknown extractor mode
103 No URL requested
99 Unknown error

More Info

Extractomatic is hosted on Google App Engine, and the source code is available on Github.

Credits

Thanks to Christian Kohlschütter for releasing Boilerpipe under the Apache License 2.0.