Unknown reporter writes
This is a follow-up of the [Extract relevant text ](http://sourceforge.net/p/recollfirefox/tickets/4/) feature request for the firefox extension. I figure it would be more useful within recoll itself.
>This plugins sends the entire page to Recoll for indexing, even the irrelevant content such as ads, sidebars, etc, that clutters the search results.
>You probably know more about this than me, but I found these links useful. There are lots of open-source libraries providing this functionality. None still available for javascript, however
>
>* [Evaluating Text Extraction Algorithms](https://web.archive.org/web/20130627025152/http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms)
>* [Overview: Extracting article text from HTML documents](https://web.archive.org/web/20130623005602/http://tomazkovacic.com/blog/14/extracting-article-text-from-html-documents/)
>* [List of resources: Article text extraction from HTML documents](https://web.archive.org/web/20130622025806/http://tomazkovacic.com/blog/56/list-of-resources-article-text-extraction-from-html-documents/)
>* [Feature-wise Comparison of HTML Article Text Extractors](https://web.archive.org/web/20130623072055/http://tomazkovacic.com/blog/98/feature-wise-comparison-of-html-article-text-extractors/)
Python goose-extractor looks promising - it can also deduce tags. I assume filters and input handlers are the same, so I’ll just follow directions at [Writing a document input handler](http://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.PROGRAM.FILTERS.html)
If it seems to work well, would you be interesting in adding support for html text extractors?
medoc writes
Yes filters and input handlers are the same thing, I now prefer the latter term.
And yes, an HTML handler which would reduce the clutter would be interesting, especially for the web history I guess (but also probably for saved web pages). Don’t hesitate to ask for help if you have any trouble with the interface.
orbisvicis writes
I’ve evaluated (not rigorously):
-
closed-source: diffbot, alchemyAPI
-
open-source: libextract (eatih), several python ports of readability, goose, boilerpipe, dragnet
I can’t stress enough how impressed I am with diffbot. Conversely alchemyAPI seems to be the worst extractor - either that or I haven’t used the online demo appropriately.
The two best open-source extractors are boilerpipe and dragnet. Dragnet has slightly lower recall but compensates with slightly higher (weighted) accuracy. I’ve picked boilerpipe because it is an established project well packaged among linux distributions while dragnet is new and has hard-to-meet requirements. Here’s an initial implementation of a [boilerpipe html content extractor for recoll](https://bitbucket.org/snippets/orbisvicis/y8grb).
orbisvicis writes
I also looked into semi-standardized keyword extraction (ie, against wikipedia) but found nothing promising. Maui requires wikipedia-miner (website down, poor documentation, high system configuration cost: mysql, etc), and dbpedia-spotlight requires ~6GB memory even when using the disk-backed store. Do you have any ideas?
medoc writes
About the HTML handler: i’d like to give it a try, but you’ll have to bear with a non-java person:
-
The Makefile is complaining about a missing .manifest file
-
If I remove the dependancy, the compiler complains about missing modules, ex. org.apache.commons.cli, org.jsoup, etc.
Would you be able to come up with a list of packages to install (on a common Linux dist, your choice of Debian, Ubuntu, Fedora, or Suse in a pinch) ?
orbisvicis writes
Dependencies:
-
ubuntu
-
libboilerpipe-java
-
libnekohtml-java
-
xerces-j2
-
libjsoup-java
-
libjuniversalchardet-java
-
libcommons-cli-java
-
-
fedora
-
boilerpipe
-
nekohtml
-
xerces-j2
-
jsoup
-
juniversalchardet
-
apache-commons-cli
-
Files:
-
BoilerpipeHandler.java
-
BoilerpipeHandler.manifest (this was misnamed as BoilerpipeHander.txt in the snippet, now fixed)
-
Makefile
Edit Files:
-
If required, change the path of the installed jars in the Makefile and the manifest file. I’m new to this myself, so there’s probably a better way of doing things (maybe ant?).
Running:
-
java -jar BoilerpipeHandler.jar -e DefaultExtractor -o output.html --html input.html
medoc writes
What platform are you building this on (distribution an version), apparently String.join() at least is java 8, nowhere to be found on Ubuntu Trusty as far as I can see.
medoc writes
I finally got it working on Fedora, and tried it on a few nytimes pages.
While it does remove a lot of clutter, it also almost always cuts the first part of an article, which is really not acceptable at all.
I think that it’s much better to have to deal with a bit of clutter (just hit next in the search tool until you get to what you want) than to risk missing data !
orbisvicis writes
I’m glad you got it working. I just finished adding more features to the command-line interface - synchronizing it with boilerpipe’s upstream - but rather not add workarounds for older java versions. I wish I had implemented this in python via jpype - it would have taken significantly less time and lines-of-code.
Anyway, I’m on Fedora 23 using openjdk 8 (a bit too late, sorry). Unfortunately the new changes require the pre-2.0 version of boilerpipe - only available via git - and maven
to build. After checkout, run mvn clean package
install in the the root directory, then move nekohtml-relocated/target/nekohtml-relocated-1.9.13.jar
and boilerpipe-common/target/boilerpipe-common-2.0-SNAPSHOT.jar
to the build directory. That’s it - business as usual from then.
> it also almost always cuts the first part of an article
Are you sure? What do you consider the first part of an article - it sounds very serious? Do you mind sharing a link? I’ve tested [How a Medical Mystery in Brazil Led Doctors to Zika](http://www.nytimes.com/2016/02/07/health/zika-virus-brazil-how-it-spread-explained.html?_r=0), the DefaultExtractor cuts the byline, a sentence near the top, and a sentence near middle while the ArticleExtractor only cuts the byline (In both cases the author signature near the bottom is preserved).
The new interface has some features to make this type of comparison easier:
java -jar BoilerpipeHandler.jar -e DefaultExtractor -o output.html --highlight
java -jar BoilerpipeHandler.jar -e ArticleExtractor -o output.html --highlight
It use’s boilerpipe’s builtin HTML parser, so the encoding issues are not representative of the final result, such as:
java -jar BoilerpipeHandler.jar -e ArticleExtractor -o output.html --markup --wrap
java -jar BoilerpipeHandler.jar -e ArticleExtractor -o output.html --wrap
There is also a new feature that extracts contained images:
java -jar BoilerpipeHandler.jar -e ArticleExtractor -o output.html --images --wrap
edit:
I did intend this for indexing and previewing, but if you feel it cuts too much content, it might still be useful for previewing only.
medoc writes
I tried it on http://mobile.nytimes.com/2016/02/06/world/asia/taiwan-mobilizes-army-to-search-rubble-after-earthquake.html?_r=0
And the extracted text begins at:
There were also reports of people trapped inside razed vegetable
Which is apparently the first paragraph after the first photo, when I look at it with scripts blocked. With scripts enabled it’s the first paragraph after the "Show full article" box.
On the 2nd page I tried: http://www.nytimes.com/2016/02/06/world/asia/north-korea-china-sanctions-luxury.html?_r=0
The text begins at:
While luxury goods may seem a relatively minor issue, experts on North
Which does not seem to be a special location on the page, so I guess that the above notes about image and button were a fluke.
In both cases the pages were saved as "Web page complete" from Firefox, maybe the save method makes a difference ? But the saved html text looks indeed complete.
By the way, I’ll be completely off-line next week, and very sporadically online during the whole month. Don’t think that I don’t take an interest in this because I can answer only infrequently, I do.
medoc writes
The text extractor tests seem to show that they miss useful text. This is quite unacceptable, it’s better to have a little too much noise than missing information.