r_mano writes

Hi!

This report is for version 1.23.1-1~ppa1~xenial1, on a 16.04 Ubuntu Gnome machine. Lately recollindex started to die while examining mailbox files in Thunderbird directory; I suppose that it is due to some Thunderbird update because it started out of the blue.

The error is

#!shell
2:internfile/internfile.cpp:738::FileInterner::internfile: next_document error [/home/romano/
.thunderbird/rd6f5x9j.default/ImapMail/localhost/to_2010_07|4742:2] application/x-7z-compressed

[...]

:2:internfile/mh_mail.cpp:640::walkmime: transcode failed from cs 'ISO8859_15_FDIS' to UTF-8
:2:common/textsplit.cpp:495::Textsplit: error occured while scanning UTF-8 string
:2:common/textsplit.cpp:495::Textsplit: error occured while scanning UTF-8 string

[...]

[...]
:3:rcldb/rcldb.cpp:611::Db::add: docid 168438 added [/home/romano/.thunderbird/rd6f5x9j.defaul
t/ImapMail/localhost/to_2012_12|2602]
:2:common/textsplit.cpp:495::Textsplit: error occured while scanning UTF-8 string
:3:rcldb/rcldb.cpp:611::Db::add: docid 168439 added [/home/romano/.thunderbird/rd6f5x9j.defaul
t/ImapMail/localhost/to_2013_07|6039:0:Gu<A1 >a Docente Regulaci<A2 >n Autom<A0 >tica (2013-2014)
.docx]
:2:common/textsplit.cpp:495::Textsplit: error occured while scanning UTF-8 string
:3:rcldb/rcldb.cpp:611::Db::add: docid 168440 added [/home/romano/.thunderbird/rd6f5x9j.default/ImapMail/localhost/to_2013_07|6039:0:Gu<A1 >a Docente Sistemas Din<A0 >micos (2013-2014) v1.doc]
:3:rcldb/rcldb.cpp:611::Db::add: docid 168441 added [/home/romano/.thunderbird/rd6f5x9j.default/ImapMail/localhost/to_2012_12|2602:0]
:3:rcldb/rcldb.cpp:611::Db::add: docid 168442 added [/home/romano/.thunderbird/rd6f5x9j.default/ImapMail/localhost/to_2012_12|2603]
:3:rcldb/rcldb.cpp:611::Db::add: docid 168443 added [/home/romano/.thunderbird/rd6f5x9j.default/ImapMail/localhost/to_2013_07|6039:1]
:2:common/textsplit.cpp:495::Textsplit: error occured while scanning UTF-8 string
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 8)  > this- >size() (which is 4)
Traceback (most recent call last):
  File "/usr/share/recoll/filters/rclxls.py", line 80, in <module >
    rclexecm.main(proto, extract)
  File "/usr/share/recoll/filters/rclexecm.py", line 338, in main
    proto.mainloop(extract)
  File "/usr/share/recoll/filters/rclexecm.py", line 265, in mainloop
    self.processmessage(processor, params)
  File "/usr/share/recoll/filters/rclexecm.py", line 245, in processmessage
    self.answer(data, ipath, eof)
  File "/usr/share/recoll/filters/rclexecm.py", line 184, in answer
    self.senditem("Document", docdata)
  File "/usr/share/recoll/filters/rclexecm.py", line 178, in senditem
    self.breakwrite(sys.stdout, data)
  File "/usr/share/recoll/filters/rclexecm.py", line 94, in breakwrite
    outfile.write(data)
IOError: [Errno 32] Broken pipe

I tried to isolate more, but the mailbox-archive files are quite big, and I couldn’t find how to isolate the snippet that is creating the problem (I think is .../to_2010_07|4742:2 if the :2: in the log is the id of the thread doing the mining). That file was just read by TB since 2011, and shouldn’t have changed, and started giving error now.

The error is not always on this file --- sometime is on a different one, but always is something about an UTF-8 string out-of-range. It could be a malformed file… but it could be something different, because I think that a malformed file will be ignored and the index would go on.

I tried to rebuild the index, same problem…

medoc writes

Hi,

Thanks for reporting this, and sorry about the late reply, I was not around last week. Actually, …/to_2010_07|4742:2 should be the second attachment to message 4742 inside mailbox file to_2010_07 (if indeed there is a second attachment, this could also be from the next file as the log verbosity is low).

It is not clear what happens, the preceding UTF-8 error could be unrelated.

I think that the best approach would be to try and extract the attachements for this message into a separate directory and try to index this to see what happens. You will need to set the log verbosity level higher. Or at least set the verbosity higher (6) and retry the indexing. This will allow seeing exactly what is being processed when the error happens.

The information in the 2 following wiki pages may be useful to you for better pinpointing the problem:

Of course, if you want your indexing to go back to work meanwhile, you can just add the file to the skippedPaths variable.

Thank you for your help in solving this !

r_mano writes

Hi, I am trying to gather data by following the instructions on the page https://bitbucket.org/medoc/recoll/wiki/ProblemSolvingData, but It seems I am not able to restrict recollindex to just one file. What I did is create a directory ~/tmp/recoll-test-data and copied the file which is (probably) failing there; then create a dir named recoll-test with this config file:

[romano:~/tmp/recoll-test] 127 % cat recoll.conf
topdir = /home/romano/tmp/recoll-test-data/
loglevel = 6
logfilename = /home/romano/tmp/recolltrace.txt

and run, from the same directory, recollindex -z -c .

The process starts indexing my home dir… for example it says:

 RCLMFILT: rclopxml.py: no/bad metadata in /home/romano/education/Notas/Z-LAB-SE-ME/Proyectos/NotasSanti/PROYECTO A.xlsx: "There is no item named 'docProps/core.xml' in the archive"

which shouldn’t even touch… what am I doing wrong?

Simply letting loglevel to 6 and starting the indexing I think will be impossible… the log file is growing too fast.

medoc writes

Tried to answer by email last night, but apparently this did not work. The problem is here:

topdir = /home/romano/tmp/recoll-test-data/

It’s topdirs (plural) not topdir, this is a list.

r_mano writes

Ah, ok. Will try again this night. Thanks.

(PS I also tried as per instructions to put just the name of the file in the "mini" recoll.conf, but it didn’t work either…)

r_mano writes

Ok; the file I moved seems not to be the culprit --- I added the options I have in my original recoll.conf, changing that to

 indexstemminglanguages = english italian spanish
defaultcharset = UTF-8//
topdirs = /home/romano/tmp/recoll-test-data/
loglevel = 6
logfilename = /home/romano/tmp/recolltrace.txt

…but here the commands runs ok, it runs very fast (less than 2 seconds for a big file:)

[romano:~/tmp/recoll-test] % ls -l ../recoll-test-data
total 1353368
-rw------- 1 romano romano 1385844232 May 10 22:03 to_2013_07

and creates a 9.5 Mb log file where I can not find any error. Short of enabling loglevel=6 for a normal run, I am quite at a loss…

medoc writes

If this is a thunderbird mbox, you need to tell it to recoll as their mbox format is broken. At the end of recoll.conf:

[~/tmp/recoll-test]
mhmboxquirks = tbird

If this does not work, yes, the next thing to do is a full run with the log at level 6

r_mano writes

Ok, now it’s running… let’s see what happens.

BTW, I do not have this option in my main recoll.conf --- is it autodetected if it’s in a thunderbird directory?

r_mano writes

Ok, it crashed now. The terminal errors are:

[Removed for privacy]

Error: ppt-dump: dump error /tmp/rcltmpfV2NwYE.pot
Error: getDirectoryStreamByName(Current User): 'PropertySetStream' object has no attribute 'getDirectoryStreamByName' - /tmp/rcltmpfmCUCe2.pot
Error: ppt-dump: dump error /tmp/rcltmpf8C5zJ7.pot
Error: getDirectoryStreamByName(Current User): 'PropertySetStream' object has no attribute 'getDirectoryStreamByName' - /tmp/rcltmpfcab9xd.pot
Error: XLDirStream:fillModel: 'Chart' object has no attribute 'addShape'
Error: XLDirStream:fillModel: 'Chart' object has no attribute 'addShape'
Error: getDirectoryStreamByName(Current User): 'PropertySetStream' object has no attribute 'getDirectoryStreamByName' - /tmp/rcltmpfLhfi16.pot
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 8)  > this- >size() (which is 4)
[1]    11173 abort (core dumped)  recollindex -z -c .

The trace could contains some personal data, so I have put it here: http:/www.rgtti.com/recolltrace.txt.gz --- as soon as you have it please tell me, so that I can remove the file.

Thanks!

medoc writes

the thunderbird quirks are set by default on the default thunderbird directories (in /usr/share/recoll/examples/recoll.conf, which is not an example but the defaults file).

I got the trace, you can remove it. But actually, given the error what we need is a stack trace. the core file should be in /tmp.

medoc writes

I don’t see anything really interesting in the trace file, I deleted my copy.

Maybe we can re-run this in single-threaded mode later, I’m not even sure what doc caused the problem because of multiple things going on at the same time.

But the priority would be the core file and a stack dump…

r_mano writes

Ok --- I didn’t find either file in /tmp. I will double check at home this evening, I probably have some limit set to forbid it. What should be the names of these files?

If you want I can try to run it single-threaded (to see if the crash is still there) and under strace. How should I do that?

medoc writes

To run in single-threaded mode, add the following to recoll.conf:

thrQSizes = -1 -1 -1

About the core file: have a look at the core dump section in https://bitbucket.org/medoc/recoll/wiki/ProblemSolvingData

With a bit of luck, you don’t need to rebuild recollindex, after the ulimit change, you should get a core file. As recollindex does a chdir to /tmp (because many external commands leave temp files in the current dir), the core should be in /tmp.

If all fails, maybe you might make a copy of the mbox available to me, but I understand the privacy issues (actually I don’t want the file if there are indeed privacy issues…).

r_mano writes

Ok, core and trace obtained, in single-thread mode. The core is available at http://www.rgtti.com/core.gz. Trace is following, but I suspect I have no symbols compiled in… :

[romano:/tmp] % gdb /usr/bin/recollindex core
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html >
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/ >.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/ >.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/recollindex...(no debugging symbols found)...done.
[New LWP 23848]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `recollindex -z -c .'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fca45f9e428 in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) thread apply all bt full

Thread 1 (Thread 0x7fca4743e780 (LWP 23848)):
#0  0x00007fca45f9e428 in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:54
        resultvar = 0
        pid = 23848
        selftid = 23848
#1  0x00007fca45fa002a in __GI_abort () at abort.c:89
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0},
          sa_mask = {__val = {0, 140506737857856, 140506734402559,
              140506737857856, 1, 140506737857987, 8, 10, 140506734408713,
              140506755753824, 140506734917344, 1, 140506755756176, 65054320,
              140506737862512, 140506755753856}}, sa_flags = 1180518528,
          sa_restorer = 0x3e0a718}
        sigs = {__val = {32, 0 <repeats 15 times >}}
#2  0x00007fca465d784d in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#3  0x00007fca465d56b6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#4  0x00007fca465d5701 in std::terminate() ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
---Type <return > to continue, or q <return > to quit---
No symbol table info available.
#5  0x00007fca465d5919 in __cxa_throw ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#6  0x00007fca465fe3f7 in std::__throw_out_of_range_fmt(char const*, ...) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#7  0x00007fca46ead374 in TextSplit::words_from_span(unsigned long) ()
   from /usr/lib/recoll/librecoll-1.23.1.so
No symbol table info available.
#8  0x00007fca46eaea20 in TextSplit::text_to_words(std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  > const&) ()
   from /usr/lib/recoll/librecoll-1.23.1.so
No symbol table info available.
#9  0x00007fca46f78d4d in Rcl::TextSplitDb::text_to_words(std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  > const&) ()
   from /usr/lib/recoll/librecoll-1.23.1.so
No symbol table info available.
#10 0x00007fca46f6559c in Rcl::Db::addOrUpdate(std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  > const&, std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  > const&, Rcl::Doc&) ()
   from /usr/lib/recoll/librecoll-1.23.1.so
No symbol table info available.
---Type <return > to continue, or q <return > to quit---
#11 0x00007fca46ec9d6f in FsIndexer::processonefile(RclConfig*, std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  > const&, stat const*, std::map<std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  >, std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  >, std::less<std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  >  >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  > const, std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  >  >  >  > const&) ()
   from /usr/lib/recoll/librecoll-1.23.1.so
No symbol table info available.
#12 0x00007fca46ecc1b9 in FsIndexer::processone(std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  > const&, stat const*, FsTreeWalker::CbFlag) () from /usr/lib/recoll/librecoll-1.23.1.so
No symbol table info available.
#13 0x00007fca46fd7684 in FsTreeWalker::iwalk(std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  > const&, stat*, FsTreeWalkerCB&)
    () from /usr/lib/recoll/librecoll-1.23.1.so
No symbol table info available.
#14 0x00007fca46fd7feb in FsTreeWalker::walk(std::__cxx11::basic_string<char, std::char_traits<char >, std::allocator<char >  > const&, FsTreeWalkerCB&) ()
   from /usr/lib/recoll/librecoll-1.23.1.so
No symbol table info available.
#15 0x00007fca46ec1f44 in FsIndexer::index(int) ()
---Type <return > to continue, or q <return > to quit---
   from /usr/lib/recoll/librecoll-1.23.1.so
No symbol table info available.
#16 0x00007fca46ed4962 in ConfIndexer::index(bool, ConfIndexer::ixType, int) ()
   from /usr/lib/recoll/librecoll-1.23.1.so
No symbol table info available.
#17 0x00000000004084ad in ?? ()
No symbol table info available.
#18 0x00007fca45f89830 in __libc_start_main (main=0x406080, argc=4,
    argv=0x7ffdfdc22988, init=<optimised out >, fini=<optimised out >,
    rtld_fini=<optimised out >, stack_end=0x7ffdfdc22978)
    at ../csu/libc-start.c:291
        result = <optimised out >
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {0, -4659668861382123999,
                4234944, 140728860813696, 0, 0, 4660799954103475745,
                4666024605697089057}, mask_was_saved = 0}}, priv = {pad = {
              0x0, 0x0, 0x416d90, 0x7fca472688e0 <_dl_fini >}, data = {
              prev = 0x0, cleanup = 0x0, canceltype = 4287888}}}
        not_first_call = <optimised out >
#19 0x0000000000409ee9 in ?? ()
No symbol table info available.
(gdb)
(gdb)

medoc writes

Thanks, this narrows things a bit. I have built an updated package on the PPA, same code but it builds the debug symbols package, so maybe you can update (the new package has ppa2 in its name), and install the debug package, named recoll-dbgsym

Also: you can send any link to me by email: jf@dockes.org, rather than publish them here, it will be better.

Also, now that we have a single-thread trace, it would be very interesting to look at the last 100 lines or so to know what document was being worked on, maybe extract it and see if you can reproduce the crash on it alone.

Given that the crash occurs in the text splitter, I think that there is a good chance that it is independant of the general context.

r_mano writes

Hi, I am back, sorry for the delay. Busy week-end, thanks goodness.

First of all, these are the last lines of the trace:

:2:common/textsplit.cpp:495::Textsplit: error occured while scanning UTF-8 string
:4:rcldb/rcldb.cpp:1079::TextSplitDb: TextSplit::text_to_words failed
:5:rcldb/rcldb.cpp:1341::Db::add: no prefix for field [md5], no indexing
:5:rcldb/rcldb.cpp:1572::Rcl::Db::add: new doc record:
url=file:///home/romano/tmp/recoll-test-data/to_2013_07
mtype=application/msword
fmtime=01494446637
dmtime=01374090266
origcharset=CP1252
fbytes=208384
pcbytes=1385844232
dbytes=17430
sig=13858442321494446637

[...]

…about the debug symbols, I have a problem --- I have version 1.23.1-1~ppa2~xenial1 (I am on 16.04) from the recoll-backport PPA, and it seems that the debug symbols have not been added there. But I could be doing something wrong, I switched from a Linux hacker to a Linux user around a decade ago…

medoc writes

About the debug symbols: did you install the recoll-dbgsym package ?

About the log: it seems to me that the crash occurs while indexing a file named Cambios_GD_y_Fichas_asignaturas_electr�nica_anal�gica13-14.doc

(the ? characters are most certainly originally accented iso8859 characters)

From what I see (not 100 % certain but almost), this file should be a member of a zip archive named Electrónica.zip, which is an attachment to message 6039. There is a doubt about the message number because I’m not quite sure that Thunderbird is showing the same numbers. To be tried anyway…

r_mano writes

The problem I have is that the recoll-dbgsym is in the "normal" PPA (https://launchpad.net/ubuntu/+source/recoll) but not in the one I installed, https://launchpad.net/~recoll-backports/+archive/ubuntu/recoll-1.15-on?field.series_filter=xenial --- so I have a package not found error…

About the file --- I’ll try tomorrow. Thanks!

medoc writes

Not sure that I understand the thing about the PPAs, but in any case, the dbgsym package is here: https://launchpad.net/recoll-backports/+archive/ubuntu/recoll-1.15-on/+files/recoll-dbgsym_1.23.1-1ppa2~xenial1_amd64.ddeb

r_mano writes

OK, installed. Could the fact that it did not go through via apt is because the extension is ddeb instead of deb? Anyway, I have a trace:

[Removed for privacy]
(gdb)

and the core is available in the same link as the last time.

I extracted the file that seems to be the culprit. I’ll try with it alone and come back if it crash or not.

medoc writes

Thanks, I think that we are making progress. Could you please redo the stack trace after running the following inside gdb:

set print elements 0

This will print the full strings instead of truncating them. With a bit of luck maybe I can reproduce the crash with just this (maybe not, in which case narrowing to a single zip or zip member is still very useful).

Edit: actually, please just make the core file available, there may be other things in there which I can look at.

r_mano writes

I have extracted the file (using the timestamp of the message, the number is not reliable…) and I can reproduce the crash.

The file is not really "private", but it was a work in progress, so I sent a link to you in email with all the data (including core and stack trace, configuration, etc.). I hope you can reproduce it!

Edit: the core file for the crash with the full mailbox is still available at http://www.rgtti.com/core.gz

medoc writes

In case you don’t get my reply email: thanks ! I can indeed reproduce the issue. I hope I’ll have a fix soon.

medoc writes

Hopefully fixed in release 1.23.2. Thanks again !

r_mano writes

Yes --- seems fixed. Tomorrow morning I’ll confirm after the full indexing run… Thanks to you!

r_mano writes

Confirmed. Big recollindex run without any problem. Thanks!