biomimetics writes

I have upgraded recoll to 1.19.2 (xapian 1.2.15) on openSUSE 12.3

since then only plain text files are indexed.

my recoll.conf is cat recoll.conf # The system-wide configuration files for recoll are located in: # /usr/local/share/recoll/examples # The default configuration files are commented, you should take a look # at them for an explanation of what can be set (you could also take a look # at the manual instead). # Values set in this file will override the system-wide values for the file # with the same name in the central directory. The syntax for setting # values is identical.

unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae fifi\ flfl topdirs = /home/fs/OptimalAS/OSData/JUKEB/ /home/fs/OptimalAS/OSData/W\ ORK/ /tmp/loeschmich_recoll_test/ indexstemminglanguages = english german german2

on the local path /tmp/loeschmich_recoll_test/ I added an xml file which contains:

cat 0016CDE6.xml <WfProtokoll Modellname="Validator 1.25de" Prozessname="Validator 4768" Zeitstempel="22.05.2013 09:08:11" Betreff="" ><Verlauf ><Zeile Vorgangsschritt="StartActivity" Zeitpunkt="22.05.2013 09:08:04" Bearbeiter="SYSTEM" Information="22.05.2013 09:08:04 --- starting StartActivity|StartActivity"/ ><Zeile Vorgangsschritt="StartActivity" Zeitpunkt="22.05.2013 09:08:11" Bearbeiter="SYSTEM" Information="WF finished OK, OS-Import successful"/ ><Zeile Vorgangsschritt="StartActivity" Zeitpunkt="22.05.2013 09:08:11" Bearbeiter="SYSTEM" Information="22.05.2013 09:08:11 --- ending StartActivity|StartActivity"/ ><Zeile Vorgangsschritt="EndActivity" Zeitpunkt="22.05.2013 09:08:11" Bearbeiter="SYSTEM" Information="22.05.2013 09:08:11 --- starting EndActivity|StartActivity"/ ></Verlauf ></WfProtokoll >

now I call recollindex -i 0016CDE6.xml :3:recollindex.cpp:402:recollindex: changing current directory to [/tmp] :3:recollindex.cpp:423:recollindex: starting up :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 0 mS :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 0 mS :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 310 mS :3:../utils/workqueue.h:215:DbUpd: tasks 0 nowakes 0 wsleeps 1 csleeps 0 :3:../utils/workqueue.h:215:Internfile: tasks 0 nowakes 0 wsleeps 4 csleeps 0 :3:../utils/workqueue.h:215:Split: tasks 0 nowakes 0 wsleeps 2 csleeps 0

recoll -t StartActivity Recoll query: (startactivity:(wqf=11)) 0 results

if I add a text file: cat recoll.txt recoll is great for indexing

and add it via the indexer I get recollindex -i recoll.txt :3:recollindex.cpp:402:recollindex: changing current directory to [/tmp] :3:recollindex.cpp:423:recollindex: starting up :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 0 mS :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 0 mS :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 224 mS :3:../utils/workqueue.h:215:DbUpd: tasks 0 nowakes 0 wsleeps 1 csleeps 0 :3:../utils/workqueue.h:215:Internfile: tasks 0 nowakes 0 wsleeps 4 csleeps 0 :3:../utils/workqueue.h:215:Split: tasks 0 nowakes 0 wsleeps 2 csleeps 0

and the expected result recoll -t "recoll" Recoll query: (recoll:(wqf=11)) 1 results inode/directory [file:///tmp/loeschmich_recoll_test] [loeschmich_recoll_test] 4096 bytes

any ideas why only text files are being indexed?

best regards and many thanks for this great program!

robin

medoc writes

Hi,

It’s difficult to be sure without more verbose messages, but maybe the problem is due that all the content in the XML file is given by attributes. These will not be indexed by the default recoll XML filter. You could try to add the following to ~/.recoll/mimeconf:

[index]
application/xml = internal text/plain
text/xml = internal text/plain

Then the XML data will be indexed as plain text and the attribute data should show up.

Don’t forget to erase the old index data (recollindex -e filename) before reindexing the file.

biomimetics writes

I don’t think it is a specific problem of the xml file as before the upgrade all types were indexed probperly (pdf, mail, xml, ….)

I just now tested with a pdf (which contains the word Produkte and which can be extracted as a string from okular pdf reader) but get the same result:

recollindex -i 00141BCC.pdf :3:recollindex.cpp:402:recollindex: changing current directory to [/tmp] :3:recollindex.cpp:423:recollindex: starting up :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 0 mS :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 0 mS :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 318 mS :3:../utils/workqueue.h:215:DbUpd: tasks 0 nowakes 0 wsleeps 1 csleeps 0 :3:../utils/workqueue.h:215:Internfile: tasks 0 nowakes 0 wsleeps 4 csleeps 0 :3:../utils/workqueue.h:215:Split: tasks 0 nowakes 0 wsleeps 2 csleeps 0 limo@rs:/tmp/loeschmich_recoll_test > recoll -t Produkte Recoll query: (produkte:(wqf=11)) 0 results

is there any better way I can provide you with information? (by the way are you working for a company/university)

best regards and many thanks for responding so quickly

robin

medoc writes

Ok, sorry about the XML false track, please set the verbosity to 6 and try the indexing again

biomimetics writes

I have not set loglevel to 6. As a fresh trial I added the recoll_umser_manual.pdf via recollindex -i recoll_user_manual.pdf and tried to search for the term manual but got zero results recoll -t "manual"

If you look through the indexing logs there is a "skipping" in there for the manual’s pdf

many thanks for your help

robin

recollindex -i recoll_user_manual.pdf :4:../common/rclconfig.cpp:394:RclConfig::initThrConf: autoconf requested :4:../utils/execmd.cpp:260:ExecCmd::startExec: (0|1) sh {-c} {egrep processor /proc/cpuinfo | wc -l} :5:../utils/netcon.cpp:242:Netcon::selectloop: fd 4 has 0x0 mask, erasing :5:../utils/execmd.cpp:499:ExecCmd::doexec: selectloop returned 0 :4:../utils/execmd.cpp:604:ExecCmd::wait: got status 0x0 :4:../common/rclconfig.cpp:446:RclConfig::initThrConf: chosen config (ql,nt): (2, 4) (2, 2) (2, 1) :5:../common/rclinit.cpp:158:rclinit: multi-threaded execution: do not use vfork :3:recollindex.cpp:402:recollindex: changing current directory to [/tmp] :3:recollindex.cpp:423:recollindex: starting up :4:../utils/execmd.cpp:604:ExecCmd::wait: got status 0x0 :4:../rcldb/rcldb.cpp:623:Db::open: m_isopen 0 m_iswritable 0 mode 1 :5:../rcldb/stoplist.cpp:36:StopList::StopList: file_to_string(/home/limo/.recoll/stoplist.txt) failed: open/stat: errno: 2 : :4:../rcldb/rcldb.cpp:214:RclDb:: threads: haveWriteQ 1, wqlen 2 wqts 1 :4:../rcldb/rcldb.cpp:658:Db::open: lastdocid: 1554513 :4:../index/fsindexer.cpp:142:FsIndexer: threads: haveIQ 1 iql 2 iqts 4 haveSQ 1 sql 2 sqts 2 :4:../index/fsindexer.cpp:307:FsIndexer::indexFiles :4:../index/fsindexer.cpp:267:FsIndexer::indexFiles: skipping [/tmp/recoll_user_manual.pdf] (ntd) :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 0 mS :4:../index/fsindexer.cpp:373:Indexfiles: purging orphans :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 0 mS :4:../index/fsindexer.cpp:385:FsIndexer::indexFiles: done :4:../rcldb/rcldb.cpp:720:Db::i_close(0): m_isopen 1 m_iswritable 1 :4:../rcldb/rcldb.cpp:731:Rcl::Db:close: xapian will close. May take some time :3:../rcldb/rcldb.cpp:1437:Db::waitUpdIdle: total xapian work 759 mS :4:../utils/workqueue.h:192:setTerminateAndWait:DbUpd :4:../utils/workqueue.h:312:WorkQueue:ok:DbUpd: not ok m_ok 0 m_workers_exited 0 m_worker_threads size 1 :4:../utils/workqueue.h:312:WorkQueue:ok:DbUpd: not ok m_ok 0 m_workers_exited 0 m_worker_threads size 1 :4:../utils/workqueue.h:291:workerExit:DbUpd :3:../utils/workqueue.h:215:DbUpd: tasks 0 nowakes 0 wsleeps 1 csleeps 0 :4:../utils/workqueue.h:234:setTerminateAndWait:DbUpd done :4:../rcldb/rcldb.cpp:738:Rcl::Db:close() xapian close done. :4:../internfile/mimehandler.cpp:128:clearMimeHandlerCache() :4:../utils/workqueue.h:192:setTerminateAndWait:Internfile :4:../utils/workqueue.h:312:WorkQueue:ok:Internfile: not ok m_ok 0 m_workers_exited 0 m_worker_threads size 4 :4:../utils/workqueue.h:312:WorkQueue:ok:Internfile: not ok m_ok 0 m_workers_exited 0 m_worker_threads size 4 :4:../utils/workqueue.h:4:../utils/workqueue.h:312::291:WorkQueue:ok:Internfile: not ok m_ok 0 m_workers_exited 0 m_worker_threads size 4 :4:../utils/workqueue.hworkerExit:Internfile :312:WorkQueue:ok:Internfile: not ok m_ok 0 m_workers_exited 0 m_worker_threads size 4 :4:../utils/workqueue.h:291:workerExit:Internfile :4:../utils/workqueue.h:312:WorkQueue:ok:Internfile: not ok m_ok 0 m_workers_exited 1 m_worker_threads size 4 :4:../utils/workqueue.h:312:WorkQueue:ok:Internfile: not ok m_ok 0 m_workers_exited 1 m_worker_threads size 4 :4:../utils/workqueue.h:291:workerExit:Internfile :4:../utils/workqueue.h:312:WorkQueue:ok:Internfile: not ok m_ok 0 m_workers_exited 2 m_worker_threads size 4 :4:../utils/workqueue.h:312:WorkQueue:ok:Internfile: not ok m_ok 0 m_workers_exited 2 m_worker_threads size 4 :4:../utils/workqueue.h:291:workerExit:Internfile :3:../utils/workqueue.h:215:Internfile: tasks 0 nowakes 0 wsleeps 4 csleeps 0 :4:../utils/workqueue.h:234:setTerminateAndWait:Internfile done :5:../index/fsindexer.cpp:155:FsIndexer: internfile wrkr status: 1 (1- >ok) :4:../utils/workqueue.h:192:setTerminateAndWait:Split :4:../utils/workqueue.h:312:WorkQueue:ok:Split: not ok m_ok 0 m_workers_exited 0 m_worker_threads size 2 :4:../utils/workqueue.h:312:WorkQueue:ok:Split: not ok m_ok 0 m_workers_exited 0 m_worker_threads size 2 :4:../utils/workqueue.h:291:workerExit:Split :4:../utils/workqueue.h:312:WorkQueue:ok:Split: not ok m_ok 0 m_workers_exited 0 m_worker_threads size 2 :4:../utils/workqueue.h:312:WorkQueue:ok:Split: not ok m_ok 0 m_workers_exited 0 m_worker_threads size 2 :4:../utils/workqueue.h:291:workerExit:Split :3:../utils/workqueue.h:215:Split: tasks 0 nowakes 0 wsleeps 2 csleeps 0 :4:../utils/workqueue.h:234:setTerminateAndWait:Split done :5:../index/fsindexer.cpp:160:FsIndexer: dbupd worker status: 1 (1- >ok) :4:../rcldb/rcldb.cpp:601:Db::Db: isopen 0 m_iswritable 0 :4:../rcldb/rcldb.cpp:720:Db::i_close(1): m_isopen 0 m_iswritable 0 limo@rs:/tmp/loeschmich_recoll_test > recoll -t "manual" :4:../common/rclconfig.cpp:394:RclConfig::initThrConf: autoconf requested :4:../utils/execmd.cpp:260:ExecCmd::startExec: (0|1) sh {-c} {egrep processor /proc/cpuinfo | wc -l} :5:../utils/netcon.cpp:242:Netcon::selectloop: fd 3 has 0x0 mask, erasing :5:../utils/execmd.cpp:499:ExecCmd::doexec: selectloop returned 0 :4:../utils/execmd.cpp:604:ExecCmd::wait: got status 0x0 :4:../common/rclconfig.cpp:446:RclConfig::initThrConf: chosen config (ql,nt): (2, 4) (2, 2) (2, 1) :5:../common/rclinit.cpp:158:rclinit: multi-threaded execution: do not use vfork :4:../rcldb/rcldb.cpp:623:Db::open: m_isopen 0 m_iswritable 0 mode 0 :5:../rcldb/stoplist.cpp:36:StopList::StopList: file_to_string(/home/limo/.recoll/stoplist.txt) failed: open/stat: errno: 2 : :5:../query/wasatorcl.cpp:165:wasaQueryToRcl: clause modifiers 0x0 :5:../query/wasatorcl.cpp:179:wasaQueryToRcl: leaf clause [:manual] slack 0 excl 0 :4:../rcldb/rclquery.cpp:176:Query::setQuery: :4:../rcldb/searchdata.cpp:176:SearchData::toNativeQuery: stemlang [english] :4:../rcldb/searchdata.cpp:894:StringToXapianQ:pUS:: qstr [manual] fld [] mods 0x0 slack 0 near 0 :5:../rcldb/searchdata.cpp:914:strToXapianQ: phrase/word: [manual] :5:../rcldb/searchdata.cpp:951:strToXapianQ: termcount: 1 :5:../rcldb/searchdata.cpp:702:StringToXapianQ::processSimpleSpan: [manual] mods 0x0 :5:../rcldb/searchdata.cpp:545:expandTerm: mods 0x0 fld [] trm [manual] lang [english] :5:../rcldb/rclterms.cpp:181:Db::TermMatch: typ stem diacsens 0 casesens 0 lang [english] term [manual] max 10000 field [] stripped 1 init res.size 0 :4:../rcldb/synfamily.cpp:148:XapCompSynFamMbr::synExpand([:Stm:english:]): term [manual] root [manual] :4:../rcldb/synfamily.cpp:180:XapCompSynFamMbr::synExpand([:Stm:english:]): term [manual] - > [manual] :4:../rcldb/rclterms.cpp:274:ExpTerm: stem exp- > manual :4:../rcldb/synfamily.cpp:148:XapCompSynFamMbr::synExpand([:DCa:all:]): term [manual] root [manual] :4:../rcldb/synfamily.cpp:180:XapCompSynFamMbr::synExpand([:DCa:all:]): term [manual] - > [manual] :4:../rcldb/rclterms.cpp:288:ExpandTerm:TM: lexp: manual :4:../rcldb/searchdata.cpp:658:ExpandTerm: final: manual :5:../rcldb/searchdata.cpp:165:SearchData::clausesToQuery: got 11 clauses :4:../rcldb/rclquery.cpp:242:Query::SetQuery: Q: (manual:(wqf=11)) :4:../rcldb/rclquery.cpp:357:Query::getResCnt: 0 mS Recoll query: (manual:(wqf=11)) 0 results :4:../rcldb/rclquery.cpp:386:Fetching for first 0, count 50 :4:../rcldb/rclquery.cpp:397:enquire- >get_mset: got empty result :5:../rcldb/searchdata.cpp:420:SearchData::erase :4:../rcldb/rcldb.cpp:601:Db::Db: isopen 1 m_iswritable 0 :4:../rcldb/rcldb.cpp:720:Db::i_close(1): m_isopen 1 m_iswritable 0

biomimetics writes

I meant to say: I have now set loglevel to 6!

medoc writes

I had understood :)

The "skipping" message is because the file is not part of the topdirs hierarchy. Please re-run the test with a file inside the indexed area. Be careful to use the same path that’s in recoll.conf, something symbolic links can mess things up if you use a relative path and recoll has to compute the absolute one.

To prevent a remark, I do agree that a suitable error message would be a good thing in this case !

biomimetics writes

could you give me a hint on where I go wrong. If I look at my recoll.conf the directory I had the manual’s pdf in is /tmp/loeschmich_recoll_test which is also in recoll.conf

cat recoll.conf # The system-wide configuration files for recoll are located in: # /usr/local/share/recoll/examples # The default configuration files are commented, you should take a look # at them for an explanation of what can be set (you could also take a look # at the manual instead). # Values set in this file will override the system-wide values for the file # with the same name in the central directory. The syntax for setting # values is identical.

unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae fifi\ flfl topdirs = /home/fs/OptimalAS/OSData/JUKEB/ /home/fs/OptimalAS/OSData/W\ ORK/ /tmp/loeschmich_recoll_test/ indexstemminglanguages = english german german2

medoc writes

I don’t know exactly how this happens because you give a relative path on the command line, but the file that recoll seems to be trying to index is in /tmp:

fsindexer.cpp:267:FsIndexer::indexFiles: skipping [/tmp/recoll_user_manual.pdf]  (ntd)

biomimetics writes

I thought I am only giving absolute paths? the only thing I was doing when I did recollindex -i is changing manually to that directory?

should I maybe restart from scratch so we know exactly what is wrong. For that would I have to remove the xapian db or run recollindex -z? I would then create a subfolder in tmp as the only folder to be indexed with a couple of xml/pdf/txt files in it.

Please let me know what the best way to proceed would be.

thanks a lot for your help

robin

medoc writes

Please try:

recollindex -e /tmp/loeschmich_recoll_test/recoll_user_manual.pdf
recollindex -i /tmp/loeschmich_recoll_test/recoll_user_manual.pdf

and save the output ftom the second command.

medoc writes

And also, can we please switch to email: jf at dockes.org

Bitbucket is doing strange formatting and not convenient for this kind of debugging. I’ll be in the plane this afternoon, but I should be able to continue this tomorrow.

jf

biomimetics writes

I removed my .recoll directory completely and now everything seems to be working. the xml might be indexed but doesn’t return results which may be due to the fact that all information is inside <info a info b info c > tags. I now index some remote files which are mounted and will let you know tomorrow if everything works fine now. best regards and many thanks for your help

robin

ps: I removed the xapiandb as suggested after upgrading, which may not have been the only adaptation one has to do to get a properply working update.

medoc writes

Hi,

I have found a bug in the recollindex command line args processing: using relative paths as arguments did not work at all with 1.19.3 (they were converted relative to /tmp). This caused a good part of the problems you saw.

This will be fixed in 1.19.4 which will shortly be released.

On a more general note, there is very little incompatibility between 1.18 and 1.19 indexes and configuration, I regularly switch back and forth without reindexing, while performing tests, and I don’t see any issues (there could be some search issues in some marginal cases)