Unknown reporter writes

Reproduce: - open "Language support" - "Language for menus and windows" tab - install Japanese language - drag "Japanese" to top of list - logout and login - start recoll, start indexing - "Indexing failed" dialog message

using ppa:recoll-backports/recoll-1.15-on since 1.23.1-1ppa1xenial1

Thank you for reading this.

medoc writes

I guess that you are running ubuntu xenial. Is this true (could be mint or some others too) ?

What desktop are you running (Unity/xfce/gnome, etc.) ?

prevh writes

  • Ubuntu MATE

  • 16.04.2 LTS (Xenial Xerus) 64 bit

  • MATE 1.12.1

  • I installed with 16.04.2 iso image.

  • All packages are up to date. (Linux 4.8.0-46-generic x86_64)

medoc writes

Thanks.

Before I try to reproduce the issue, could you please follow the instructions in here: https://bitbucket.org/medoc/recoll/wiki/WhyIsMyFileNotIndexed and attach the end of the indexing log file so that we have an idea of what the cause could be ?

prevh writes

Is this the correct way?

#!bash

~$ mkdir .recoll
mkdir: ディレクトリ `.recoll' を作成できません: ファイルが存在します
~$ rm -r .recoll
~$ mkdir .recoll
~$ ls -a .recoll
.  ..
~$ echo "loglevel = 6"  > > .recoll/recoll.conf
~$ echo "logfilename = stderr"  > > .recoll/recoll.conf
~$ echo "thrQSizes = -1 -1 -1"  > > .recoll/recoll.conf
~$ cat .recoll/recoll.conf
loglevel = 6
logfilename = stderr
thrQSizes = -1 -1 -1
~$ recollindex  > /tmp/myindexlog 2 >&1
~$ mv /tmp/myindexlog ./
~$ recollindex -i myindexlog  > myindexlog-i 2 >&1
~$ ls my*
myindexlog  myindexlog-i
~$ grep failed my*
myindexlog::5:rcldb/stoplist.cpp:35::StopList::StopList: file_to_string(/home/prevh/.recoll/stoplist.txt) failed: open/stat: errno: 2 :
myindexlog::5:rcldb/stoplist.cpp:35::StopList::StopList: file_to_string(/home/prevh/.recoll/stoplist.txt) failed: open/stat: errno: 2 :
myindexlog::5:rcldb/stoplist.cpp:35::StopList::StopList: file_to_string(/home/prevh/.recoll/stoplist.txt) failed: open/stat: errno: 2 :
myindexlog::2:index/indexer.cpp:346::ConfIndexer::createAspellDict: aspell buildDict failed: aspell dictionary creation command failed:
myindexlog:Indexing failed
myindexlog-i::5:rcldb/stoplist.cpp:35::StopList::StopList: file_to_string(/home/prevh/.recoll/stoplist.txt) failed: open/stat: errno: 2 :
~$ grep errno my* | grep -v failed
myindexlog::2:utils/netcon.cpp:439::NetconData::send: send(11) errno 32 (Broken pipe)
~$

last stack of myindexlog

:3:rcldb/rcldb.cpp:611::Db::add: docid 130 added [/home/prevh/.bashrc|]
:3:index/fsindexer.cpp:240::fsindexer index time:  1667 mS
:4:rcldb/rcldb.cpp:1888::Db::purge
:4:rcldb/rcldb.cpp:1891::Db::purge: m_isopen 1 m_iswritable 1
:4:rcldb/rcldb.cpp:855::Db::i_close(0): m_isopen 1 m_iswritable 1
:4:rcldb/rcldb.cpp:869::Rcl::Db:close: xapian will close. May take some time
:4:rcldb/rcldb.cpp:873::Rcl::Db:close() xapian close done.
:4:rcldb/rcldb.cpp:758::Db::open: m_isopen 0 m_iswritable 0 mode 1
:5:rcldb/stoplist.cpp:35::StopList::StopList: file_to_string(/home/prevh/.recoll/stoplist.txt) failed: open/stat: errno: 2 :
:4:rcldb/rcldb.cpp:230::RclDb:: threads: haveWriteQ 0, wqlen -1 wqts 0
:4:rcldb/rcldb.cpp:796::Db::open: lastdocid: 130
:4:rcldb/rcldb.cpp:1842::Db::getStemLang
:4:rcldb/rcldb.cpp:1871::Db::createStemDbs
:4:rcldb/expansiondbs.cpp:44::StemDb::createExpansionDbs: languages: english
:4:rcldb/expansiondbs.cpp:151::StemDb::createExpansionDbs: done: 0.026924 S
:4:rcldb/rcldb.cpp:855::Db::i_close(0): m_isopen 1 m_iswritable 1
:4:rcldb/rcldb.cpp:869::Rcl::Db:close: xapian will close. May take some time
:4:rcldb/rcldb.cpp:873::Rcl::Db:close() xapian close done.
:4:rcldb/rcldb.cpp:758::Db::open: m_isopen 0 m_iswritable 0 mode 0
:5:rcldb/stoplist.cpp:35::StopList::StopList: file_to_string(/home/prevh/.recoll/stoplist.txt) failed: open/stat: errno: 2 :
:4:index/indexer.cpp:344::ConfIndexer::createAspellDict: creating dictionary
:4:utils/execmd.cpp:457::ExecCmd::startExec: (1|0) /usr/bin/aspell {--lang=ja} {--encoding=utf-8} {create} {master} {/home/prevh/.recoll/aspdict.ja.rws}
:2:utils/netcon.cpp:439::NetconData::send: send(11) errno 32 (Broken pipe)
:2:utils/execmd.cpp:704::ExecWriter: data: can't write
:5:utils/netcon.cpp:277::Netcon::selectloop: fd 11 has 0x0 mask, erasing
:5:utils/execmd.cpp:795::ExecCmd::doexec: selectloop returned 0
:4:utils/execmd.cpp:961::ExecCmd::wait: got status 0x256
:4:utils/execmd.cpp:457::ExecCmd::startExec: (0|1) /usr/bin/aspell {dicts}
:5:utils/netcon.cpp:277::Netcon::selectloop: fd 10 has 0x0 mask, erasing
:5:utils/execmd.cpp:795::ExecCmd::doexec: selectloop returned 0
:4:utils/execmd.cpp:961::ExecCmd::wait: got status 0x0
:2:index/indexer.cpp:346::ConfIndexer::createAspellDict: aspell buildDict failed: aspell dictionary creation command failed:
/usr/bin/aspell --lang=ja --encoding=utf-8 create master /home/prevh/.recoll/aspdict.ja.rws
One possible reason might be missing language data files for lang = ja. Maybe try to execute the command by hand for a better diag.
:4:internfile/mimehandler.cpp:129::clearMimeHandlerCache()
Indexing failed
:4:rcldb/rcldb.cpp:737::Db::~Db: isopen 1 m_iswritable 0
:4:rcldb/rcldb.cpp:855::Db::i_close(1): m_isopen 1 m_iswritable 0

head of myindexlog-i

:4:common/rclconfig.cpp:563::RclConfig::initThrConf: chosen config (ql,nt): (-1, 0) (-1, 0) (-1, 0)
:5:common/rclinit.cpp:346::rclinit: will use vfork() for starting commands
:3:index/recollindex.cpp:518::recollindex: changing current directory to [/tmp]
:4:utils/execmd.cpp:457::ExecCmd::startExec: (0|0) /usr/share/recoll/filters/rclcheckneedretry.sh
:4:utils/execmd.cpp:961::ExecCmd::wait: got status 0x0
:3:index/recollindex.cpp:548::recollindex: starting up
:4:utils/execmd.cpp:457::ExecCmd::startExec: (0|0) /usr/bin/ionice {-c} {3} {-p} {28448}
:4:utils/execmd.cpp:961::ExecCmd::wait: got status 0x0
:4:rcldb/rcldb.cpp:758::Db::open: m_isopen 0 m_iswritable 0 mode 1
:5:rcldb/stoplist.cpp:35::StopList::StopList: file_to_string(/home/prevh/.recoll/stoplist.txt) failed: open/stat: errno: 2 :

medoc writes

Thanks. Probably aspell does not work at all for japanese. Please try to set noaspell = 1 in the configuration file (~/.recoll/recoll.conf)

prevh writes

command line:

~$ echo "noaspell = 1"  > > .recoll/recoll.conf
~$ cat .recoll/recoll.conf
loglevel = 6
logfilename = stderr
thrQSizes = -1 -1 -1
noaspell = 1
~$ recollindex  > /tmp/myindexlog 2 >&1
~$ grep failed /tmp/myindexlog
:5:rcldb/stoplist.cpp:35::StopList::StopList: file_to_string(/home/prevh/.recoll/stoplist.txt) failed: open/stat: errno: 2 :
:5:rcldb/stoplist.cpp:35::StopList::StopList: file_to_string(/home/prevh/.recoll/stoplist.txt) failed: open/stat: errno: 2 :
~$

GUI:

  • Global parameters - > check box "No aspell usage" ON

  • Start indexing.

Both command line and GUI, "Indexing failed" disappeared.

Oh, spelling suggestions disappeared, too… I am sad.

medoc writes

noaspell was the nuclear option to check that the aspell dict creation was the cause.

I don’t think that aspell supports japanese at all, but setting aspellLanguage = en should get you back the english suggestions. I think that Recoll should correctly avoid to send japanese words to aspell in this situation (because of the different script). If this proves not to work, I’ll try to fix it, because there is no logical reason why it could not.

Automatic language detection is a very difficult issue, especially in mixed texts, but this case should be feasible, because of the entirely different character sets allowing unambiguous detection.

How does the search for Japanese words work for you by the way ? It’s based on multigrams, and works very differently from the western language search, and I have not heard about it for a long time.

prevh writes

Thanks. That is a bit over my head, but very useful for me.

# Topic 1. I use recoll as follows. #

Example:

  • [en] compare firefax cromium

  • [ja] firefax cromium 比較

Reliable suggestions are shown on english environment.

  • compare : compare comparer compre comprar comparor

  • cromium : chromium premium crime grim

  • firefax : firefox refx prefix fireargs

Convenient suggestions are shown on japanese environment. ( aspellLanguage = en )

  • cromium : chromium premium crime grim

  • firefax : firefox refx prefix fireargs

  • 比較 : w y ð ø a

No suggestions are shown on japanese environment. ( default or noaspell )

(I am sad.)

One of ideal suggestions is (on all languages environment, common view):

  • cromium : chromium[world], chrome[world], Cr[world], premium[en]

  • firefax : firefox[world], fire fax[en]

  • 比較 : 比べ[ja], 比较[zh]

# Topic 2. Suggestions of japanese words. #

Many major distros install "fcitx-mozc package (japanese input method)" to japanese language support.

About typical mistake, mozc suggests.

Example: "simulation" is hard to pronounce for native Japanese.

  • "シュミレーション" (sumilation), This is easy to pronounce, but mistake.

  • "シミュレーション" (simulation), Mozc suggest.

Note: In many distros, default config of fcitx-mozc package is incomplete by rule of thumb. It of Ubuntu MATE (or some distros) is excellent for native japanese.

# Topic 3. Results of the search for Japanese words. #

Recoll and grep are similar for japanese words. I feel almost equal. It is simple mechanical match.

~$ echo "シュミレーション sumilation"  > sim.txt
~$ echo "シミュレーション simulation"  > > sim.txt
~$ cat sim.txt
シュミレーション sumilation
シミュレーション simulation
~$ grep "imu" sim.txt
シミュレーション simulation
~$ grep "ミュ" sim.txt
シミュレーション simulation
~$

Does this answer your question?

medoc writes

I am sorry for the delay, in responding. Many thanks your very detailed answer.

I am glad to know that the Japanese search works reasonably well.

I am going to look into why you are seeing strange spelling suggestions for Japanese words when aspellLanguage is set to en.

I think that there is an outright bug (it makes no sense that the suggestions are in western characters). The bug might be in aspell or in recoll. Either I will find a way to fix it, or if I can’t, at least suppress nonsensical suggestions…

prevh writes

Thank you for accepting japanese! I’m looking forward to next version.

In checking japanese articles (personal blogs, tech sites, etc…), when using aspell with japanese locale, it seems that standard config is "lang=en". This workaround is just like FAQ.

Additionally, Pluma (text editor, Ubuntu MATE default) have Check Spelling function. By default, it works for english words with japanese locale. If there are japanese words, it misdetect them irregularly.

For your reference.

medoc writes

Hi,

I have produced a quite experimental version (only for what concerns katakana, the rest did not change). This indexes katakana words as a whole (not as digrams like what is done for e.g. kanji), and uses the Xapian spellcheck function, not aspell, for making suggestions. Only at most one replacement is proposed, for these words, as this is how the Xapian spellcheck function works. It seems to work fine with your simulation example above.

I have built a Xenial x64 package for trying this. It is there:

If you want to give it a try, you can download the .deb file and install it with:

sudo dpkg -i recoll_1.23.2-1~ppa1~xenial1_amd64.deb

If you want to go back you just need to do:

sudo dpkg -r recoll
sudo apt-get install recoll

When you have a moment, please let me know how this works for you, and I’ll decide if I keep it for the real 1.23.2

prevh writes

How exciting! It’s a meaningful and interesting. I review it.

# Key points characteristic of japanese to this review. #

"ー" of ends is often omitted.

  • プレイヤー (player)

  • プレイヤ (player)

  • コンピューター (computer)

  • コンピュータ (computer)

Generally, there are not any delimiters for plural words.

  • VLCメディアプレイヤー (VLC media player)

  • サウンド&ビデオ (Sound & Video)

  • サウンドとビデオ (Sound & Video), using hiragana.

  • キーボードショートカット (Keyboard Shortcuts), in most cases.

  • キーボード・ショートカット (Keyboard Shortcuts), if they use delimiters.

Often, Japanese are indifferent to plural form.

  • グラフィック (Graphic)

  • グラフィック (Graphics)

  • グラフィック (Graphics)

  • グラフィック (Graphic)

  • プレゼント (present)

  • プレゼン (presents)

  • プレゼント (presents)

# Testing samples #

List-A and List-B are same meaning. In english text, List-A is plural form, List-B is singular form.

List-A.txt

VLCメディアプレイヤー (VLC media player)
グラフィックス (Graphics)
キーボードショートカット (Keyboard Shortcuts)
サウンドとビデオ (Sound & Video)

List-B.txt

VLCメディアプレイヤ (VLC media player)
グラフィック (Graphic)
キーボード・ショートカット (Keyboard Shortcut)
サウンド&ビデオ (Sound & Video)

List-AJ.txt (only japanese of List-A)

VLCメディアプレイヤー
グラフィックス
キーボードショートカット
サウンドとビデオ

List-BJ.txt (only japanese of List-B)

VLCメディアプレイヤ
グラフィック
キーボード・ショートカット
サウンド&ビデオ

# Some the search words examples for testing. #

Good results or good suggestions.

  • サウンド (sound)

  • ビデオ (video)

No results for List-A(J).

  • キーボード (keyboard)

  • ショートカット (shortcut)

  • ショートカット キーボード (shortcuts keyboard)

  • グラフィック (graphic)

No results for List-AJ and List-BJ.

  • VLC (VLC)

No results, no suggestions.

  • メディアプレイヤー (media player)

  • メディアプレイヤ (media player)

# Conclusion #

Although that will be trouble, I was impressed with katakana suggestions. Because of it works perfect with simulation example above.

DB for suggestions and DB for simple mechanical match, if recoll creates two DBs and use them properly, maybe it’s very useful for all languages. Because it’s like "GUI grep (with suggestions)".

medoc writes

Hi,

I have made a new package, with a few fixes:

  • It should detect katakana/western transitions and process them as word breaks (for VLC and メディアプレイヤ)

  • It should remove ー at the end of Katakana words

  • It should process ・ as a word separator

I don’t think that I can handle plural forms or separator-less compound terms: except if I am missing something, this would need a complicated language-sensitive processor.

The new package is named recoll_1.23.2-1ppa2xenial1_amd64.deb (note: ppa2 not ppa1), at the same place as the previous one, you can install it in the same way.

Thank you for your testing !

prevh writes

Thanks. I tried it.

It seems that these were fixed.

  • detect katakana/western transitions

  • remove "ー"

About "・" as a word separator, I couldn’t find the difference between ppa1 and ppa2. Already at ppa1, it seems that "・" is processed as separator.

About plural forms or separator-less compound terms, I think so, too. As far as I know, it’s a hard and language-sensitive.

Therefore,

  • Practically, the search results of current stable version (like grep) is essential for Japanese data.

  • I think new indexes reached the suitable level as first release version of "katakana suggestions function".

Incidentally, I think it’s good that suggestions are shown always, whether there are or not the search results.

# Some words to test, for reference. #

Good results or good suggestions.

  • サウンド (sound)

  • ビデオ (video)

  • VLC (VLC)

  • メディアプレイヤー (media player)

  • メディアプレイヤ (media player)

No results for List-A(J).

  • キーボード (keyboard)

  • ショートカット (shortcut)

  • ショートカット キーボード (shortcuts keyboard)

  • グラフィック (graphic)

No results, no suggestions.

  • メディア (media)

  • プレイヤ (player)

medoc writes

At the moment, we can’t have both word suggestions and search results for parts of terms like キーボードショートカット (searching for a component of a separator-less compound term).

The latter suppose indexing the data as suites of characters and not try to break into words, which in turn forbids word suggestions.

It would be feasible in principle to index both words (used for suggestions) and n-grams (used for finding, grep-like), but this is a big change and quite a lot of work, and can’t be done right now.

This is a relatively hard choice, but it seems to me that the grep-like results are more important, so my opinion would be to desactivate the recent changes for indexing katakana words (keeping them around for a better future…), and go back to the previous approach.

But really, only someone whose language is Japanese can have a good advice on this. What do you think ?

prevh writes

I agree your opinion, but to be more precise, maybe details are different.

Looking back, the beginning of talk was "Indexing failed".

To improve Recoll, I think there are several phases.

1st phase. ensuring grep-like resutls for n-grams like japanese, permanently.

2nd phase. solving enigmatic error message.

I think enigmatic message is damaging reliability of Recoll wrongfully. It is really mottainai. This is the core of my motive.

At least, when "Indexing failed" occurred, I think notice are needed about "aspellLanguage = en" or "noaspell", on n-grams languages desktop environment (of course including japanese).

3rd phase. On n-grams environment, enabling english suggestions with no side effects by default.

I guess that "aspellLanguage = en" has no side effects on japanese environment. Because of I can’t find japanese dictionaries of aspell today on the Internet.

4th phase. first-level suggestions for n-grams (like 1.23.2-1~ppa2).

5th phase. high-level suggestions for n-grams.

About releasing schedule (or milestone) of each phases, I am confiding in your opinion.

Thank you for asking about my opinion !

medoc writes

Thank you for this very well thought out road map ! I am implementing the first steps.

medoc writes

I have uploaded new packages to the same place (still a bogus 1.23.2 version).

This restores grep-like, n-gram based search for japanese katakana.

An aspell dictionary creation error will not result in indexing failed any more. The GUI should display a more informative warning.

aspellLanguage should default to en in a japanese environment. Japanese words are not sent to the speller, so only english words (or other words in western character sets) will present suggestions.

Having spelling suggestions for words written in katakana will have to wait, because this would need a bigger change in recoll.

prevh writes

I tried version ppa3.

  • grep-like results for japanese.

  • informative warning in chinese environment by default.

  • english suggestions in japanese environment by default.

Everything is going along fine!

Is this RC1?

medoc writes

I don’t really have release candidates, the .0 minor version, and possibly the few next ones serve as candidates.

Strictly speaking, this should have been 1.24.x, because the changes are a bit more than a bug fix. In practise, Recoll is not changing a lot these days, and I’ll wait a bit and just release the code you have as 1.23.2

If I find other issues during the "stabilisation period", it will be 1.23.3

You can continue using the current 1.23.2: apart from the Japanese environment changes, it is identical to 1.23.1, and should be pretty stable.

prevh writes

Thanks. I understand very well.

medoc writes

Closing this as there is nothing more to be done at the moment (I have an abstract in my todo)