Unknown reporter writes

This is a follow-up of the [Extract relevant text ](http://sourceforge.net/p/recollfirefox/tickets/4/) feature request for the firefox extension. I figure it would be more useful within recoll itself.

>This plugins sends the entire page to Recoll for indexing, even the irrelevant content such as ads, sidebars, etc, that clutters the search results.
>You probably know more about this than me, but I found these links useful. There are lots of open-source libraries providing this functionality. None still available for javascript, however
>
>* [Evaluating Text Extraction Algorithms](https://web.archive.org/web/20130627025152/http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms)
>* [Overview: Extracting article text from HTML documents](https://web.archive.org/web/20130623005602/http://tomazkovacic.com/blog/14/extracting-article-text-from-html-documents/)
>* [List of resources: Article text extraction from HTML documents](https://web.archive.org/web/20130622025806/http://tomazkovacic.com/blog/56/list-of-resources-article-text-extraction-from-html-documents/)
>* [Feature-wise Comparison of HTML Article Text Extractors](https://web.archive.org/web/20130623072055/http://tomazkovacic.com/blog/98/feature-wise-comparison-of-html-article-text-extractors/)

Python goose-extractor looks promising - it can also deduce tags. I assume filters and input handlers are the same, so I’ll just follow directions at [Writing a document input handler](http://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.PROGRAM.FILTERS.html)

If it seems to work well, would you be interesting in adding support for html text extractors?

medoc writes

Yes filters and input handlers are the same thing, I now prefer the latter term.

And yes, an HTML handler which would reduce the clutter would be interesting, especially for the web history I guess (but also probably for saved web pages). Don’t hesitate to ask for help if you have any trouble with the interface.

orbisvicis writes

I’ve evaluated (not rigorously):

  • closed-source: diffbot, alchemyAPI

  • open-source: libextract (eatih), several python ports of readability, goose, boilerpipe, dragnet

I can’t stress enough how impressed I am with diffbot. Conversely alchemyAPI seems to be the worst extractor - either that or I haven’t used the online demo appropriately.

The two best open-source extractors are boilerpipe and dragnet. Dragnet has slightly lower recall but compensates with slightly higher (weighted) accuracy. I’ve picked boilerpipe because it is an established project well packaged among linux distributions while dragnet is new and has hard-to-meet requirements. Here’s an initial implementation of a [boilerpipe html content extractor for recoll](https://bitbucket.org/snippets/orbisvicis/y8grb).

orbisvicis writes

I also looked into semi-standardized keyword extraction (ie, against wikipedia) but found nothing promising. Maui requires wikipedia-miner (website down, poor documentation, high system configuration cost: mysql, etc), and dbpedia-spotlight requires ~6GB memory even when using the disk-backed store. Do you have any ideas?

medoc writes

About the HTML handler: i’d like to give it a try, but you’ll have to bear with a non-java person:

  • The Makefile is complaining about a missing .manifest file

  • If I remove the dependancy, the compiler complains about missing modules, ex. org.apache.commons.cli, org.jsoup, etc.

Would you be able to come up with a list of packages to install (on a common Linux dist, your choice of Debian, Ubuntu, Fedora, or Suse in a pinch) ?

orbisvicis writes

Dependencies:

  1. ubuntu

    • libboilerpipe-java

    • libnekohtml-java

    • xerces-j2

    • libjsoup-java

    • libjuniversalchardet-java

    • libcommons-cli-java

  2. fedora

    • boilerpipe

    • nekohtml

    • xerces-j2

    • jsoup

    • juniversalchardet

    • apache-commons-cli

Files:

  • BoilerpipeHandler.java

  • BoilerpipeHandler.manifest (this was misnamed as BoilerpipeHander.txt in the snippet, now fixed)

  • Makefile

Edit Files:

  • If required, change the path of the installed jars in the Makefile and the manifest file. I’m new to this myself, so there’s probably a better way of doing things (maybe ant?).

Running:

  • java -jar BoilerpipeHandler.jar -e DefaultExtractor -o output.html --html input.html

medoc writes

What platform are you building this on (distribution an version), apparently String.join() at least is java 8, nowhere to be found on Ubuntu Trusty as far as I can see.

medoc writes

I finally got it working on Fedora, and tried it on a few nytimes pages.

While it does remove a lot of clutter, it also almost always cuts the first part of an article, which is really not acceptable at all.

I think that it’s much better to have to deal with a bit of clutter (just hit next in the search tool until you get to what you want) than to risk missing data !

orbisvicis writes

I’m glad you got it working. I just finished adding more features to the command-line interface - synchronizing it with boilerpipe’s upstream - but rather not add workarounds for older java versions. I wish I had implemented this in python via jpype - it would have taken significantly less time and lines-of-code.

Anyway, I’m on Fedora 23 using openjdk 8 (a bit too late, sorry). Unfortunately the new changes require the pre-2.0 version of boilerpipe - only available via git - and maven to build. After checkout, run mvn clean package install in the the root directory, then move nekohtml-relocated/target/nekohtml-relocated-1.9.13.jar and boilerpipe-common/target/boilerpipe-common-2.0-SNAPSHOT.jar to the build directory. That’s it - business as usual from then.

> it also almost always cuts the first part of an article

Are you sure? What do you consider the first part of an article - it sounds very serious? Do you mind sharing a link? I’ve tested [How a Medical Mystery in Brazil Led Doctors to Zika](http://www.nytimes.com/2016/02/07/health/zika-virus-brazil-how-it-spread-explained.html?_r=0), the DefaultExtractor cuts the byline, a sentence near the top, and a sentence near middle while the ArticleExtractor only cuts the byline (In both cases the author signature near the bottom is preserved).

The new interface has some features to make this type of comparison easier:

java -jar BoilerpipeHandler.jar -e DefaultExtractor -o output.html --highlight
java -jar BoilerpipeHandler.jar -e ArticleExtractor -o output.html --highlight

It use’s boilerpipe’s builtin HTML parser, so the encoding issues are not representative of the final result, such as:

java -jar BoilerpipeHandler.jar -e ArticleExtractor -o output.html --markup --wrap
java -jar BoilerpipeHandler.jar -e ArticleExtractor -o output.html --wrap

There is also a new feature that extracts contained images:

java -jar BoilerpipeHandler.jar -e ArticleExtractor -o output.html --images --wrap

edit:

I did intend this for indexing and previewing, but if you feel it cuts too much content, it might still be useful for previewing only.

medoc writes

And the extracted text begins at:

There were also reports of people trapped inside razed vegetable

Which is apparently the first paragraph after the first photo, when I look at it with scripts blocked. With scripts enabled it’s the first paragraph after the "Show full article" box.

The text begins at:

While luxury goods may seem a relatively minor issue, experts on North

Which does not seem to be a special location on the page, so I guess that the above notes about image and button were a fluke.

In both cases the pages were saved as "Web page complete" from Firefox, maybe the save method makes a difference ? But the saved html text looks indeed complete.

By the way, I’ll be completely off-line next week, and very sporadically online during the whole month. Don’t think that I don’t take an interest in this because I can answer only infrequently, I do.

medoc writes

The text extractor tests seem to show that they miss useful text. This is quite unacceptable, it’s better to have a little too much noise than missing information.