Extractomatic is a simple API to detect and remove surplus clutter (such as adverts, headers, footers) around the main content of a web page. It uses the Boilerplate Java library, by Christian Kohlschütter.
There is one method at /extract, which responds to GET requests and returns JSON. Pass the URL of the content you wish to extract in the url parameter.
For example, GETing http://extractomatic.tomtaylor.co.uk/extract?url=http://www.nothingtoseehere.net/2010/01/moomin_world_naantali_1.html returns:
{
"response" : {
"title" : "Nothing To See Here: Moomin World, Naantali",
"content" : "Moomin World, Naantali\nTove Jansson\u2019s Moomins, created by... (more content)",
"source" : "http://www.nothingtoseehere.net/2010/01/moomin_world_naantali_1.html"
},
"status" : "success"
}
Boilerpipe provides a number of different extraction algorithms, which vary in effectiveness depending on the style of content you're interested in.
To change algorithm, pass in an optional mode parameter. The mode defaults to default, surprisingly.
| Mode | Description |
|---|---|
| article | Uses ArticleExtractor: A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. |
| article_sentences | Uses ArticleSentencesExtractor: A full-text extractor which is tuned towards extracting sentences from news articles. |
| min_k_words | Uses KeepEverythingWithMinKWordsExtractor: A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor, but usually worse than ArticleExtractor. |
| default | Uses DefaultExtractor: A quite generic full-text extractor. |
| largest | Uses LargestContentExtractor: A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor, but usually worse than ArticleExtractor. |
| num_words | Uses NumWordsRulesExtractor: A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block). |
The title is always the HTML title of the page.
If an error occurs, you will get a response that looks something like:
{
"error" : {
"message" : "Malformed URL",
"code" : 100
},
"status" : "error"
}
The different errors you might see are:
| Code | Description |
|---|---|
| 100 | Malformed URL |
| 101 | Could not fetch the URL |
| 102 | Unknown extractor mode |
| 103 | No URL requested |
| 99 | Unknown error |
Extractomatic is hosted on Google App Engine, and the source code is available on Github.
Thanks to Christian Kohlschütter for releasing Boilerpipe under the Apache License 2.0.