ForestFairy writes
The Big Three formats for which I would be delighted to see support are:
cwk webarchive pwi
If you want to use my solution for these, here it is. To convert these three to plain text (rather crudely), the wfscan script seems to function. To install it, download http://futuramerlin.com/scanner-install , and run it as root. Then, run scanner-setup (not as root). Then, execute wfscan [file-name]. The output file will be : ~/.Wreathe-File-Scanner/output.scdat . Note that the script has some major bugs and holes in it at the moment, so it is best used as a fallback for when other things do not work. Also, I will probably be updating it soon to fix a bug … erm, perhaps «complete failure» would be more accurate than «bug» … that prevents it processing files with spaces in the path :-) (That update will probably be version 1.1 of http://futuramerlin.com/wfscan ; to update it, simply download and run http://futuramerlin.com/scanner-install as root again).
The following would also be nice to have (in the format [supporting software] : [formats]) :
libarchive : tar, cpio, dvd/cd images
unace : ace
unrar-free, unrar, or rar : rar
p7zip-full : Zip64, cab, arj, rar, rpm, iso (cd/dvd images), deb
Archive::Ar (libarchive-ar-perl) : ar
kgb : kgb
lha : lzh/lha
pacpl : Wider variety of formats than python-mutagen (if it can be used to extract metadata? I am not particularly familiar with this software)
fontforge : Font metadata
Also, support for m4r (MPEG-4 ringtone) metadata would be nice.
Thank you for considering these suggestions!
medoc writes
Hello,
This is a serious bunch of new formats. I am not familiar with the 3 first ones (cwk webarchive pwi). Could you please be a little more specific about what they are ? There are I think several formats named "webarchive", one of which (the konqueror one) is already supported in the latest version (see the "filters" section on the download page: http://www.recoll.org/download.html).
About the other formats, maybe in time :) As I gather that you can program, maybe you could take a look at the Recoll manual, there is a section about how to program filters, this is actually quite easy. Especially, the new Python filters, which can make use of any Python module are quite powerful. Take a look and don’t hesitate to come back with questions if needed. Having a quick look at one of the existing filters should help a lot (ie: rclwar for exemple).
Cheers, jf
ForestFairy writes
CWK is used by Apple’s AppleWorks program ; it can be a word processing document, a spreadsheet, a presentation, a «drawing», or a «painting». Webarchive is Safari’s format for saving webpages (IIRC it’s xml-based, but it’s not in the list of supported XML formats; I do not have one immediately accessible…. PWI is created by Note Taker from Windows CE 2.11. I do not really know how to program well; rather I teach myself enough of programming to be able to write what I want to write. Usually it comes out disastrously :-) Because Linux all of them as application/octet-stream, I have created the following association : application/octet-stream = wfscan-recoll
Unfortunately, that results in every file of that mime-type being handled that way, which is unfortunately rather general. Is there any method to use extensions rather than mime-types for that assignment? Thanks.
-
S. Is it normal for Recoll to take a few days to complete one index pass, and for the index folder to occupy ~20 gigs?
-
P. S. I have another bug report regarding Recoll getting stuck on files that I will submit soon.
-
P. P. S. Thank you very much for this excellent search program! :-) I was using Google Desktop but I switched to Recoll because Google Desktop was unsatisfactory.
medoc writes
Suffix to mime associations: recoll has its own file for this. You can add associations in $HOME/.recoll/mimemap. Take a look at: http://www.lesbonscomptes.com/recoll/usermanual/rcl.install.config.html#RCL.INSTALL.CONFIG.MIMEMAP
I’ll take a look at the three formats. What is your script based on ? Are there existing modules to turn these formats into plain text or html ?
About indexing time: a few days seems a bit extreme, but obviously, this depends on the amount of data to be indexed and the speed of the machine. The index size is usually the same size as the text amount in the indexed content. There is a small page about indexing performance: http://www.lesbonscomptes.com/recoll/perfs.html
I’m glad that recoll is useful for you despite all the issues that you seem to have !
ForestFairy writes
Suffix to mime : Great, thanks.
It’s a simple perl script that I wrote, teaching myself perl in the process :-D therefore it is not perfect, but at least it gets it done. What it does is it finds ascii letters in the file and extracts them (with some extra work to remove SGML tags, or things that look like them). Therefore, it works for most types of file, but it also gets lots of trash from binary parts of those formats. :-P
I have maybe 1tb data to index, almost certainly <75gb when it’s all converted to plain text, so 20gb seems quite reasonable.
I hope you don’t mind me submitting all these bug reports and requests :-D ; I’m merely attempting to help make it easier to use —sorry ;-)
Unknown User writes
ddd
mroark writes
Whoops sorry, didn’t mean to change status. That was not intentional. I added a rclrar script to handle rar files as an attachment here. It depends on the python rarfile module: http://pypi.python.org/pypi/rarfile/2.2
medoc writes
Some of the more formats are now supported, others are really too obscure.