idonttellyou_myname writes

Hello,

I consider Recoll is a great tool. I wonder whether it would be possible to port it to windows or at least a part of the components. Could you give me some hints, whether you consider this task possible? It would be sufficient for me to get the indexer working cross platform including windows and provide a C++ interface to start a search and retrieve results. QT GUI and X11 would not be important to me, in a first step.

I have been doing a lot of C++ programming in the last years on a lot of different platforms.

It would be great, if you give me some overview of the core components and where you think portability is an issue (maybe including a short description of what you do and which lib or so you use, to achieve this).

medoc writes

I’ve thought about this a number of times, and of course, a port of Recoll to Windows would be of great interest, because the Windows system search is not so great (there are competent commercial offers, but a free one would be nice). I don’t consider that this would be impossible, but there are a number of difficulties.

Most of the core code should be portable, and this includes the GUI. I recently ported another qt-based Linux GUI app to Windows, and there were no major issues. Xapian is supported on Windows.

The main issue is going to be with the text extraction framework (the so-called "filters").

There are 2 problem areas:

  1. Most Recoll filters are executed as separate processes. Many have a bidirectional dialog with a main app, this is not simply a popen() kind of thing. This is handled by the execmd/netcon files inside the utils/ directory. Execmd does the fork/exec part, netcon helps with asynchronous communications. I have no idea how to do this under Windows, about which I know about nothing. It seems quite probable that execmd will need a full reimplementation, and that netcon (which is only used by execmd inside recoll) might not be involved at all. Maybe a look at how cygwin reimplements the Unix pipes for bash etc. would help.

  2. When the communication issue is solved, we need to study the port of the actual text extraction programs. Obviously, the first priority is to support MS formats:

    • Excel and PPT are processed by Python filters, so there is probably no major issue here (apart from the Python dependancy).

    • MS-Word is processed by antiword. This is a relatively simple C program. I see that there is already a cygwin-based Windows port, I guess that it should probably be possible to get rid of the cygwin dependancy.

    • Mail: this is processed by internal C++ code under Linux. I guess that this would need to be supplemented to process Outlook storage, which should be possible by using libpst.

Other interesting formats:

  • pdf is handled by poppler, which has a Windows port. We’d need to check how it works for us. Obviously the shell script used to wrap pdftotext would need to be replaced.

  • Many other xml-based formats (the libreoffice ones among them) are processed by shell-scripts based on the xsltproc command. This would also need adaptation or replacement. The nice point is that solving one will give us the others.

Among the dependancies, libiconv is ported to Windows, and we’ll obviously have to do without inotify, either using a Windows equivalent, or restricting things to batch indexing.

And then there are also all the boring issues: converting the project to a Visual-c++ solution file (or using qmake for everything ?), changing the way user configuration and index storage is arranged (the ~/.recoll-xx directories), and the probable myriads of unforeseen problems…

In any case, if you want to get into this, I’m really willing to help and maybe do part of the work (can’t do windows-specific stuff, but, for example, using libpst should be in my ballpark).

idonttellyou_myname writes

Thanks that you are interested in porting. Yes, Recoll is so great, it would indeed be a help in working day to have in on Windows.

The problem areas: 1. Seperate processes: I will need to go a little bit deeper into that subject. As far is I understand, I is about IPC between Recoll and the filter processes. It this some speciall IPC or do you just redirect stdin/stdout for example? I have already done some IPC on both Linux and Windows (especially on Windows for a university course), so I think we can manage this. Further, I remember a colleague digged into Boost IPC, which should work cross platform. 2. I would say let’s keep the Python depenancy. There is still a lot of work, so way extend it if we can go with Python on Windows.

To XML: Here, we are lucky - I needed XML conversion on Windows and there is a xsltproc for Windows. I already successfully used it to run some XML conversion known from Linux on a Windows machine. As to indexing: we could replace inotify by the library: https://bitbucket.org/SpartanJ/efsw - I have already investigated other desktop search tools. Having a cross platform filesystem watcher seems a common need and this seems a reasonable library, though I have not tested i personally.

I would use QMake (is it similar to CMake?). Directly managing a Visual C++ Solution might be complicated if we have dependencies too other libraries. Well the ~/.recoll, we just replace with AppData under Windows. Maybe boost path would allow us to give a platform independent path for storing application settings.

Digging into process communication would seem the first big step. How do you see this? If you point me the according files in source code and explain a little bit more about what kind of communication is necessary, what data gets exchanged, I try to get a windows replacement for this. Or would it be ok, if I try to find a cross platform lib for IPC that does the job?

medoc writes

The interface between the main process and the filters is definitely the first problem to solve.

I will not replace the current solution on Linux: it works well, so we are looking for an alternate Windows implementation rather than a cross-platform library (this does not preclude using a cross-platform lib for the windows solution, just, I won’t use it on Linux at least for now).

The general filter interface is described here: http://recoll.sourceforge.net/usermanual/rcl.program.html

The interface to the filters is defined by the following files:

  • utils/execmd.h: the interface to execute commands, and send/receive data.

  • internfile/mh_exec.[h,cpp] the interface to the "simple" filters (popen-like).

  • internfile/mh_execm.[h,cpp] the interface to the persistent filters, which have an actual dialog with the indexer.

The simple filters mostly use "doexec" from execmd. The persistent ones use startexec/send/receive.

Note that the indexer is multithreaded, and that it may have dozens of filters in execution (with multiple, but not all, active: the persistent filter processes are reused to save on startup times).

medoc writes

I just had a look at portability, by grepping for unistd.h, which is a good indicator of not-unix portability issues, and it’s all over the place… For example, Recoll makes liberal use of raw Unix file descriptors in many places. I don’t think that there is anything really blocking, but this is going to be a big amount of work…

idonttellyou_myname writes

Yet, never touch a running system. I just consider it (for Windows only, too) as small library might give some abstraction.

I read about the interface. So the simple filters are just exe’s being past a filename and write date for index to stdout? What about the protocol for complex filters? Work the via stdin / stdout, or named pipes or socket? The interface defined in the headers above: Is this just used from the recoll side or for implementing the filter iteself? Are the majority of filters simple or complex ones?

Is the multithreading and locking (I am sure you need some synchronization) posix conform implemented or linux specific?

Where should I start in the source. I mean the project consists of so many files. For porting I would like to start with a small subset for the core functionality. Is functionality seperated in some library parts? You know the project for best. It would be great if you could give me some starting point with a subset of source files / dirs. Then, we can go step by step.

What do you need the raw file descriptors for?

I agree it for sure is a big amount of work, but from my side, I would say let’s try it. Before, I just want to have a more detailed overview.

medoc writes

You are right about the simple filters. There is additionally an environment variable which is set to specify if the operation is done for indexing or previewing (some of the filters behave slightly differently in both cases). It would probably be no big deal to change this for a command line option if using the environment was an issue.

Complex/persistent filters talk to the recollindex process on stdin/out (simple pipes). The "protocol" which is just a synchronous question/answer one (with timeouts though), is described at the top of the internfile/mh_execm.h include file. Of course the include is not used by the filters which are typically python (one is in perl though). The "protocol" is simple enough that reimplementing it in perl was not an issue. All the python filters share a common module which implements the dialog (filters/rclexecm.py)

Just looking at the MS formats, the msdoc filter is a simple one, the excel and ppt ones are persistent (because of python startup times mostly). So both kinds are necessary.

Multithreading and locking is all pthreads (I don’t know that there is anything else on Linux actually ?).

The raw descriptors which are used around are not really necessary, it’s very probably possible to change them for stdio files. This should not be difficult, but might take a lot of time. This was just an example of the kind of stuff which is going to take time.

As about how to start, I think that you should begin with execmd.cpp, the module which executes and does the raw communication with the filters. Actually, I think that the whole implementation is going to be discarded. Hopefully the interface itself (the .h) will not need too much modification. Have a look at the .h to see what it does, and I can answer questions. Actually, this might be simpler through email: jf at dockes.org, up to you.

There is a small test driver at the bottom of the .cpp (ifdefed). This only does what is needed for a simple filter currently, but it could be expanded to handle the dialog with a persistent filter, and some version of it could be used to test the windows version.

execmd is one of the bottom modules, it has very few dependancies (only the logging module, but this should be easy to port or emulate). A windows implementation is a strict requirement for the port, so I think that it is a good choice for what to start with.

Meanwhile I’ll start cleaning stuff up around, so that all the modules which don’t really need to be Linux-specific become standard c/c++.

What do you think ?

idonttellyou_myname writes

I write you an email tomorrow or the next days, when I have done a first check in the execmd.cpp. Mailing seems easier.

Just for other poeple reading here: We give it a try! If we have noteable progress, we report it back here.

medoc writes

Just for a progress update, recollindex indexes text files, and recollq can find them. Still a lot of work ahead, but we at least have a testing platform (WINDOWSPORT) branch.

medoc writes

Closing this, as there is now a working Windows port. See the News item on www.recoll.org, or http://www.lesbonscomptes.com/recoll/pages/recoll-windows.html and http://www.lesbonscomptes.com/recoll/pages/recoll-mingw.html

The port is still a bit experimental, but any of the many remaining problems should be reported as separate issues.