Unknown reporter writes
The problem occurs when I try to index a big collection of files that includes many plain text files with huge amounts of numerical data. I cannot filter these data neither by file extension (this action filters out "good" files) nor by mime-type ("good" files are affected in this case, too). The only thing that may help is the special option to ignore numerical data. Recoll can either ignore all numerical data or ignore file if there are too much numerical data in the document. Of course, this behavior may be customized by the user using recoll options. I emphasize that sometimes (for example, in my case) there is no way to isolate "bad" files with numerical data from "good" files with sensible text. So my recoll index is overloaded with such terms as "1.234434e-20", "1.e-308" and so on :). May be you may help users like me and make Recoll even more flexible and fully ready for huge numerical data collections! Thanks.
Sergey.
Unknown User writes
At least an option to fully ignore any digital (numerical: integer and floating-point) data in all indexed files (without any special analysis) is an easy thing to implement. And this will be sufficient for many cases (is anybody going to perform Recoll search by NUMERICAL terms?!).
Unknown User writes
I also want to mention that in case with mixed plain text / numerical data collections the problem is not only in Recoll index size, but also in very low performance. For example, indexing of 40 Gb directory takes more than 9 hours (I was not patient enough to wait for finish and stopped it). And I am absolutely sure that Recoll was stuck due to numerical data! I can see it from my index containing a huge amount of fully numerical terms (just floating-point and integer numbers).
Unknown User writes
I also want to mention that in case with mixed plain text / numerical data collections the problem is not only in Recoll index size, but also in very low performance. For example, indexing of 40 Gb directory takes more than 9 hours (I was not patient enough to wait for finish and stopped it). And I am absolutely sure that Recoll was stuck due to numerical data! I can see it from my index containing a huge amount of fully numerical terms (just floating-point and integer numbers).
medoc writes
Added option nonumbers not to generate terms for numbers. closes #16
→ <<cset 3cefa1c240bd > >