Unknown reporter writes
Catppt command is silently unsuccessful at extracting text from ppt files. Therefore, ppt file contents are not indexed by Recoll. I am unsure how to fix catppt, but there has been a bug opened on the catdoc tracker since 2009 and it is still not fixed. I’m reporting this issue here simply to raise awareness of the issue; perhaps someone more knowledgeable than me (and more active than upstream) will take an interest in the issue and fix it. I am running Ubuntu 12.04 if that matters and would be happy to provide ppt files on request (none of mine work). I wonder if there are certain files that do work….but I haven’t found any.
medoc writes
Hi,
Catppt does work on the sample files I have, they are probably older ones. I have modified the filter to use unoconv when catppt seems to fail.
Unoconv seems to work ok on your sample file, but it is extremely slow.
To test:
-
Install unoconv (it is a small utility which works with uses libre/openoffice to convert documents).
-
Install the new filter. Instructions and link on this page: http://www.lesbonscomptes.com/recoll/filters/filters.html
Then you should run a full indexing (recollindex -z or reset index from the GUI file menu).
As unoconv is so slow and catppt does work for many files, the filter takes a decision to use unoconv based on a test on the number of lines in catppt output.
If the output has less than 5 lines, it uses unoconv instead.
It would be very useful if you could try catppt on a number of your files and verify than the output always has fewer than 5 lines when it is incorrect.
Else I’ll have to find another test.
pharmville writes
Me again. Thanks for finding a solution so quickly. I will look into catppt with more powerpoint files soon. But first, let me submit a fix for the new filter.
I discovered that your new filter fails 100% of the time when lines < 5, simply because rclpdf is looking for the input file in the wrong place.
unoconv -o command line switch will specify an output directory, not an output file. As far as I know, it is not possible to choose the output filename; unoconv simply defaults to the original filename with .pdf at the end instead of .ppt.
Therefore, I changed this:
#!bash
unoconv -f pdf -o $unopdf "$infile"
`dirname $0`/rclpdf $unopdf
to this:
#!bash
unoconv -f pdf -o $unopdf "$infile"
`dirname $0`/rclpdf "$unopdf/${infile%.*}.pdf"
I tested it again to make sure I didn’t screw anything up. It seems to work now.
medoc writes
Hi,
The output name issue is weird. Of course I did test this thing (at least once :) ) before sending it out !
I guess that this is a question of different versions. What versions of Unoconv and open/libreoffice are you using ? Mine definitely accepts a file name as parameter to -o
The other strange thing is that the $unopdf variable is computed a few lines above, and should correspond to a pdf file name inside the temporary directory. Maybe unoconv actually computes the parent directory name from this.
I really need to test with your version so that we can have a script which will work in both cases. As far as I can see, your approach would not work with my unoconv version, and mine clearly does not work with yours.
pharmville writes
#!bash
unoconv --version
unoconv 0.4
Written by Dag Wieers <dag@wieers.com >
Homepage at http://dag.wieers.com/home-made/unoconv/
platform posix/linux2
python 2.7.3 (default, Sep 26 2013, 20:08:41)
[GCC 4.6.3]
build revision $Rev$
pharmville writes
Latest version of unoconv is 0.4 on Ubuntu LTS. I decided to manually update to 0.6, but unfortunately it isn’t' working:
unoconv: RuntimeException during export phase: Office probably died. attributes typeName and/or value of uno.Enum are not strings
Maybe I need to update LibreOffice or python. Luckily, I found this:
Under the changelog for 0.5:
Change to how -o/--output/--outputpath works (can now output to filenames too)
So it looks like 0.4 and below needs what I did, and 0.5 and up require your version of the filter.
Edit: I agree, it is weird that 0.4 and below create a directory named foo.pdf instead of a file.
medoc writes
Ok, I’ve changed my version to use a directory as parameter to -o and use your file name conversion. This should work both 0.4 and 0.5 (I checked 0.5). Could you please download it and try it with 0.4. If it works, the problem is solved. I’d attach it but it seems I can’t, so it’s here:
medoc writes
Hold on, does not work with absolute paths. Fixing it.
medoc writes
Should be ok now.
Also, recollindex was executing catppt directly instead of using the filter. You need to add the following to $HOME/.recoll/mimeconf:
#!shell
[index]
application/vnd.ms-powerpoint = exec rclppt
pharmville writes
Fantastic work! Your new rclppt works perfectly with 0.4 over here. I think this part of the issue is solved.
Okay I will finally post what you originally asked for. I tested several more ppt files. I chose 3 of them to share as examples.
https://dl.dropboxusercontent.com/u/59110371/Pituitary%20Disorders.ppt https://dl.dropboxusercontent.com/u/59110371/Pituitary%20Disorders%20catppt%20output.txt
https://dl.dropboxusercontent.com/u/59110371/Thyroid%20Disorders.ppt https://dl.dropboxusercontent.com/u/59110371/Thyroid%20Disorders%20catppt%20output.txt
https://dl.dropboxusercontent.com/u/59110371/Overview%20of%20Diabetes.ppt https://dl.dropboxusercontent.com/u/59110371/Overview%20of%20Diabetes%20catppt%20output.txt
One is a file that catppt seems to extract perfectly (Pituitary Disorders.ppt), another is a file that fails to extract but is greater than 5 lines (Thyroid Disorders.ppt), and another is a file that "partially" works (Overview of Diabetes.ppt). I say partially because it extracts some of the text, but not all of it. Also, it repeats the same sections again and again, even though they only appear once in the slideshow.
After trying more files, it appears that designing an accurate test for catppt failure will be quite difficult. In my case I’ll probably just edit the filter to use unoconv exclusively. Since I’m not a student anymore, I rarely aquire new ppt files, so once I get through the initial lengthy indexing, it will be worth the extra accuracy for me.
Even though I have a fix for myself, I will still follow this issue and offer any help that I can, if needed.
medoc writes
Thanks a lot for helping with this. I’ll harvest a few dozen ppts on the internets and try to see if there is a format version or something which I could use to decide which tool to use. Else, and based on your experience, it will be unoconv exclusively, as catppt clearly can’t be trusted even to fail reliably :)
medoc writes
If you are still around, and interested, this is to let you know that Recoll now has a new ppt filter, which is both reasonably fast, and thorough. More info here: http://www.recoll.org/filters/filters.html
pharmville writes
Thank you for the update. The best search tool keeps getting better ;)
I finished testing this thoroughly. It is extremely fast and appears to contain all the powerpoint text each time (including text from the "notes" sections). I think it is time for me to consider the issue solved.
medoc writes
Thanks for testing this ! I wonder why nobody previously complained about the PPT filter… Anyway, as you write, issue closed.
Cheers,
jf