Korean Hangul text is not processed well either by the whitespace-based word splitter used for Western languages or by the n-gram based splitter used for Chinese characters.
As of version 1.27 Recoll has support for using an external text analyzer for splitting Korean text into appropriate terms.
The initial implementation was based on the Konlpy Python package, which has support for several morpheme analyzers for Korean (Hannanum, Kkma, Komoran, Mecab, Twitter/Okt).
Testing with a kind Korean Recoll user led to choosing the Mecab-ko package, as the one presenting the best performance/quality compromise. It is written in C++, unlike the others which are in Java.
The current Recoll implementation retains the capability to work with konlpy, if you want to experiment with different analyzers, but the default setup is now to use python-mecab-ko, which is a direct interface to Mecab-ko and avoids the multiple Konlpy dependancies.
Installing Mecab-ko and python-mecab-ko on Windows
Unzip both zip files under
C:\Mecab. The location is currently mandatory, maybe I’ll check if it can be made configurable one day.
Edit the Recoll index configuration file (default:
C:\Users\[me]\Appdata\Local\Recoll\recoll.conf) with, e.g., Notepad, and add the following line:
hangultagger = Mecab
Reset the index.
Installing Mecab-ko and python-mecab-ko on Linux
The following installs Mecab to
/usr/local. Use the
argument to the
configure commands to install to
Create a directory to build Mecab-ko:
cd mkdir mecab cd mecab
Retrieve, extract, build and install the software itself:
wget https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.1.tar.gz tar xvzf mecab-0.996-ko-0.9.1.tar.gz cd mecab-0.996-ko-0.9.1 ./configure make make check sudo make install
Retrieve, extract, build and install the dictionary:
cd .. # Now in the top mecab directory wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-1.6.1-20140814.tar.gz tar xvzf mecab-ko-dic-1.6.1-20140814.tar.gz cd mecab-ko-dic-1.6.1-20140814 ./configure # If you get an error about the version of automake files, re-bootstrap # and run configure again. You will need autoconf and automake installed # sh autogen.sh # ./configure make # Tell mecab where its dictionary lives sudo sh -c 'echo "dicdir=/usr/local/lib/mecab/dic/mecab-ko-dic" > /usr/local/etc/mecabrc' sudo make install # The following is necessary for the python-mecab-ko build to succeed if # you installed mecab to /usr/local sudo ln -s /usr/local/bin/mecab-config /usr/bin/mecab-config
The later version of the package (>= 1.0.9) does not currently work with the above version of mecab. You need to build and install 1.0.8
sudo python3 -m pip install python-mecab-ko==1.0.8
Done… Reset the index to get the new Korean terms.