Timo_Lee writes

I try to search for all plain text files under a directory. There is no particular keyword I want to search for, but just the file type and directory. Here is my command:

$ recoll -t -q mime:text/plain dir:"/windows-d/science/math/combine/Topological group" Recoll query: ((((group:(wqf=11) OR groups) FILTER Ttext/plain) FILTER (XP PHRASE 6 XPwindows-d PHRASE 6 XPscience PHRASE 6 XPmath PHRASE 6 XPcombine PHRASE 6 XPTopological))) 0 results

Although it found nothing, there are plain text files under the directory. So I wonder what went wrong? Note that there are spaces in the directory path name.

My OS is Ubuntu 12.04, and recoll was installed from its software center and its version is "Recoll 1.16.2 + Xapian 1.2.8".

Thanks and happy holidays!

medoc writes

You need to escape the double-quotes from the shell because the query parser needs them to see that "group" is part of the dir clause, not another term. So please try adding single-quotes before and after the double ones (or backslash them):

recoll -t -q mime:text/plain dir:'"/windows-d/science/math/combine/Topological group"'

or

recoll -t -q mime:text/plain dir:\"/windows-d/science/math/combine/Topological group\"

Timo_Lee writes

Thanks! But it still doesn’t find any result.

medoc writes

I just tried with recoll 1.16.2 and the exact same path that you are using (/windows-d etc., cut and pasted from the above response) and it works for me. I’m a bit at my wit’s end here. Please post the Xapian request that recollq shows so that I have look again. Then I guess that you’ll have to dump the whole document db with xadump (xadump -D for all values of docid from 1 to max), and grep in there to find what you’re looking for.

Timo_Lee writes

I guess the output of recoll is the Xapian request:

$ recoll -t -q mime:text/plain dir:\"/windows-d/science/math/combine/Topological group\" Recoll query: 0 results

$ recollq mime:text/plain dir:\"/windows-d/science/math/combine/Topological group\" Recoll query: 0 results

medoc writes

The request looks ok to me. I get the same Xapian expansion, and results, on my system.

I don’t know why you get no results. My best guess at this point would be that nothing was indexed in there (no text/plain docs at least). We’d have to look at xadump data to know what is in the index exactly.

Timo_Lee writes

I will give another example. The first command shows the existence of a plain text file under a directory without using "dir"", while the second command using "dir:" fails to find any plain text file under that directory.

$ recollq simplex mime:text/plain

text/plain [file:///windows-d/science/math/computation/numerical computation/new file] [new file] 10486 bytes xdocid 386978

$ recollq simplex mime:text/plain dir:\"/windows-d/science/math/computation/numerical computation\"

Recoll query: 0 results

Not sure if this matters. The directory existed when indexing was performed, but recently has been changed. I guess it doesn’t matter because the database created by indexing is sufficient for searching, regardless of later change to the directory.

medoc writes

I’m not too sure what’s happening here:

  • Would it be possible that the index was created with a version older than 1.16 ?

  • If you shorten the path does it change anything (e.g.: dir:\"/windows-d\") ?

  • Could you dump the document terms using "xadump -i 386978 -T" and send the result to me. I’m mostly interested in terms with an XP prefix, so you can delete any private information.

Timo_Lee writes

The index was created in May 2011 by recoll installed from a repository under Ubuntu at that time. I have installed a newer version 12.04 of Ubuntu and recoll from its software center. So it is possible that the index was created by recoll of a different version.

Shortening the path of the directory doesn’t change anything.

Output of "xadump -i 386978 -T" is here http://pastebin.com/NgzB4HEs. I don’t find "XP" in it.

medoc writes

Ok, so it would be important to check the term dump data. Previous versions of recoll did not create XP terms for path data (another method was used). I did not force an index rebuild on the transition because this is very bothersome for some people, and the incompatibility issue is relatively minor.

Timo_Lee writes

Thanks!

I tried to use grep to process the output of recoll searching. My goal is to find all plain text files under a given directory with "Euclidean" in its name.

In the following, the first and third commands find some results, but the second doesn’t. I am curious why the second doesn’t work? Thanks!

$ recollq simplex mime:text/plain | grep Euclidean

$ recollq mime:text/plain | grep Euclidean

$ recollq mime:text/plain

medoc writes

I think that your index has problems, I can’t reproduce the issue here.

I am confused about what you are trying to accomplish here.

  • If you are trying to salvage old data, you should do what I suggested: dump the index metadata using xadump (I think I sent the method earlier, else ask). Then use an editor or grep or whatever to look for the urls you are interested in, then use the document ids to rebuild the text (xadump -i idx -b). With this approach, you are guaranteed to recover the maximum of data, with some immunity from whatever recoll bugs.

  • If you just have issues with recoll indexing, the approach is different. First upgrade to 1.18 using the PPA (I’m not going to debug an old version), then rebuild the index: recollindex -z, then we’ll take it from there.

Timo_Lee writes

Thanks! I am trying to salvage old data.

  1. Yes, I saw you wrote earlier "dump the whole document db with xadump (xadump -D for all values of docid from 1 to max), and grep in there to find what you’re looking for." My concern is that the xapiandb directory is 6GB. Is that too big to dump? Do I expect another 6GB space and long time that dumping will take?

  2. Why does "recollq mime:text/plain" works, while "recollq mime:text/plain | grep Euclidean" doesn’t? Is it because it only prints the first 2000 results out of 50384 results found ("50384 results (printing 2000 max)"), while "Euclidean" doesn’t show up among the first 2000, although it appears in the 50384 results. So is it possible to print 50384 results instead of just 2000?

medoc writes

  • 1 I think that there should be less than 1kb per document. I saw a docid around 400000 above, but the docids are not necessarily contiguous. Let’s say there are 1 million documents in the index, this would be at most 1 GB of data. It may take a few hours to dump at most (mostly because using a shell loop for this is inefficient, a loop inside the C program would probably do it ten times faster). Once it’s dumped, you can work on it all you want with usual shell tools. I would try to use a loop like the following: {{{ #!bash i=1 while test $i -lt 1000000;do echo DOCID=$i xadump -d /path/to/xapiandb -i $i -D i=expr $i + 1 done > /path/to/dump/file }}}

And then something like egrep DOCID=|url=/my/interesting/path on the output will yield interesting docids, which you can then use further.

  • 2 Yes, when doing this you most definitely do not want the output to be truncated. Use "-n 0" for this (it’s in the recollq -h output if you read it).

Timo_Lee writes

Thanks!