thanks_for_the_fish writes

Hey,

I read somewhere here that recoll somehow supports TMSU but I couldn’t find an explanation to it in the manual: So how do I use TMSU within recoll e.g. limit a seacr to files with particular tags?

happy easter!

medoc writes

Hi,

For processing the TMSU tags at index time, you need to set the "metadatacmds" field in the config file, see here: http://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.INSTALL.CONFIG.RECOLLCONF.TERMS.html

In the example, the output of tmsu is used to set a field named tags. Of course it could be named tmsu just the same, but using tags will just augment the standard recoll tags field (an alias for keywords), meaning that you don’t need to extend the configuration for processing yet another field (which can still easily be done, see the manual).

You can then search the tags field through any of its aliases, from the query language:

tags:some/alternate/values
tags:all,these,values

The above syntax is supported for recoll 1.20 and later. For older versions, you would need to repeat the tags: specifier for each term, e.g. tags:some OR tags:alternate

One gotcha is that tags changes will not be detected by the indexer if the file itself did not change. One possible workaround would be to update the ctime when you add TMSU tags, which would be consistent with how extended attributes function. A pair of chmod could accomplish this, or a touch -a. Alternatively, just couple the tag update with a recollindex -e -i filename.

I think that the above will make its way into the manual if you confirm that it works for you…

Cheers,

jf

thanks_for_the_fish writes

^^I’ll give it a try .

Any chance that recoll will be able to read tags from files in the future?

medoc writes

I am not too sure that I understand what you mean, Recoll is already able to index extended attributes data, which all I can think of in terms of "tags from files". Could you please explain in more detail ?

medoc writes

in the manual, now improved

thanks_for_the_fish writes

> Recoll is already able to index extended attributes data

You mean XMP-Meta data in PDF’s?

medoc writes

No, I meant system-level file extended attributes.

PDF XMP metadata is not currently indexed, but this would be quite easy to add as exiftool can extract the data.

thanks_for_the_fish writes

^^ That doesn’t solve all my problems but I guess it would be a step forward!

medoc writes

Ok, I’m adding PDF XMP to the todo…

thanks_for_the_fish writes

Thank you! My beef with Exiftool is that it doesn’t support all filetypes that I need (such as .ePub and .djvu - and Recoll is the only desktop search engine that supports .djvu … :-/ It also has difficulties with special characters from other languages. Have you thought about using the python-xmp toolkit instead?

medoc writes

I only mentionned exiftool because I saw that it did support pdf xmp. But I’d rather use a python tool actually. I’ll take a look at python-xmp.

thanks_for_the_fish writes

I spend quite some time searching for software that suits my needs ( 1# edits keywords to files of certain types that 2# can be indexed by Recoll) and found none so far. However, I stumbled upon TagSpaces https://www.tagspaces.org/ that has a very simple (?) way of storing those keywords - perhaps this might be of interest for recoll users, too? It’s open-source, supports loads of platforms, has very little limitations etc…

cheers

medoc writes

Hi,

I took a look, and it should quite feasible to extract XMP data using python-xmp.

Is there other file types beyond pdf and djvu for which it would make sense.

How should we select the tags to index ? I’m a bit afraid that we are going to extract a lot of noise if I take everything.

medoc writes

As I understand it, TagSpaces tags are stored in file names, so they will be naturally indexed by recoll. Storing tags in file names has a number of big advantages (portability, searching with standard tools) compared to using extended attributes, but polluting the file names will not be to the taste of all users…

medoc writes

Hi,

You never said what XMP tags you would be most interested in ?

thanks_for_the_fish writes

Sorry, I was a bit busy, I didn’t forget you though :-) I’d consider a "subject"-tag as most important. Other interesting ones would be ones for "publication date", "filename", "author", "language", "title" "entry type" (e.g. book, article, presentation, dissertation etc.), "publisher" and perhaps one for "rating" (perhaps one from 1-5?). A few other ones that come to my mind would be ones for ISBN and/or ISSN, "Institution" and file format (I don’t need those though).

medoc writes

Hi again. I gave a try to the python-xmp library, but it does not seem to give access to all the file xmp data.

What I did was write XMP metadata to a PDF with jabref, then try to extract it, but I had no success. The only tool which would actually extract the title and author information was the libexempi "dumpxmp" tool. python-xmp and the other libexempi samples only seem to extract dc:format, which incidentally is the only one without an rdf wrapper in the dumpxmp output:

<dc:format >application/pdf</dc:format >
<dc:title >
<rdf:Alt >
<rdf:li xml:lang="x-default" >Determinants of Protein Abundance and Translation Efficiency in S. cerevisiae</rdf:li >
</rdf:Alt >
</dc:title >

Looking into this would be too much effort given that I know little about the XMP format, and the tools involved, so I am giving this up.

By the way, there is a page about using PDF XMP data with recoll here: http://www.lesbonscomptes.com/recoll/recoll_XMP/index.html It describes an approach based on pdfinfo though, so its specific to PDF

thanks_for_the_fish writes

>The only tool which would actually extract the title and author information

How about Exiftool?

medoc writes

Exiftool does show the entries, but it’s perl, which means that I’d have to execute it from the python input filter, meaning a performance hit for people who are not interested in this. I have filed an issue with python-xmp-toolkit. Hopefully someone will tell me what I am doing wrong. Else, I think that your best approach is a custom input filter. such as the one linked above.

Johannes_Me writes

I’d like to rewarm this topic. Since the above mentioned patch to get several PDF-XMP Metadata formerly stored by eg. Jabref is outdated, because it patches an outdated rclpdf .sh file (the current one is a .py script), I’d like to know, if somebody can give me a brief introduction for getting XMP-metadata in recoll. I have some basic skills in bash, but I’m far away from programmng in Python. First, if data is stored by jabref, you can get a specific value as follows:

#!bash
pdfinfo -meta %f | grep 'bibtex:[field]' | sed 's/<[^ >]\+ >//g'

So, if I want to get the journaltitle (biblatex-format) of my file "Adam - Time.pdf", I’d perform

#!bash

pdfinfo -meta Adam\ -\ Time.pdf | grep 'bibtex:journaltitle' | sed 's/<[^ >]\+ >//g'

Now I don’t know how to continue. Can somebody tell me, which steps to perform to insert the %(journal) tag into my recoll html string?

medoc writes

I can probably modify the python filter to do this, in a configurable way so that it won’t hurt people not wanting the function for performance reasons.

A sample document, and a list of fields to extract, would be useful for testing.

If you have one around which you can share, it would be nice if you could attach it here (never tried…) or email it to jf@dockes.org

Johannes_Me writes

You would do this? That would be great! Tell me if it’s too much work, then you could present us a sample rclpdf.py and tell us how to add more entries by example.

Since I don’t find the attach function, I’ll email you the pdf file.

First: the entry list. The most common bibtex entries can be found [here](https://verbosus.com/bibtex-style-examples.html?lang=de), while the entrytypes (@*) article, book, inbook, incollection are the most interesting ones. so my favorite list would be:

# entries recoll already handles # * author, title, keywords, while author includes strange fields like dc:creator, so that sometimes Adobe Photoshop is the book author. I already edited that in my fields configuration, but maybe a custom author field would be cool # entries recoll doesn’t handle for specific entrytypes, doubles deleted # * general: bibtexkey, entrytype (other formats: type) * for articles: journal (journaltitle for BibLaTEX, can be wildcarded), volume, number, year (if not present: date), pages, issn, doi * for book: address (location if BibLaTEX), publisher, isbn * for inbook: chapter * for incollection: editor

Maybe you can explain how you edit the file, in the future I’d like to have a bibliographic info field in addition for each entrytype, the article field for example "%author(s): %title, in: %journal %volume, %number (%year), S. %pages."

I don’t want to flood this post, otherwise I would give you my output of "pdfinfo -meta %f" and "exiftool -all %f". But note that JabRef stores entries in several formats, like common XMP, bibtex, rdf. The latter causes the mentioned conflicts, since in example pdf the author is stored like that (pdfinfo -meta %f):

#!xml

<dc:creator >
<rdf:Seq >
<rdf:li >SurnameX Name</rdf:li >
<rdf:li >SurnameY Name</rdf:li >
</rdf:Seq >

while "pdfinfo %f" of another file returns also:

#!bash

Creator:        Adobe InDesign CS4 (6.0.4)

while the more interesting bibtex entries return like this ("pdfinfo -meta" again, for multiple values):

#!xml

<bibtex:author >
<rdf:Seq >
<rdf:li >Surname Name</rdf:li >
<rdf:li >Name Surname</rdf:li >
</rdf:Seq >
</bibtex:author >

or (for single values):

#!xml

<bibtex:journal >Journal of Advanced Python Scripting Modification</bibtex:journal >

Thanks a lot!

_ EDIT

What I forgot to mention:

Would be great to know, how to continue afterwards. Following the outdated [guide](https://www.lesbonscomptes.com/recoll/recoll_XMP) I guess its something like:

  • moving modified rclpdf.py to ~/.recoll/filters/rclpdf.py

  • editing ~/.recoll/mimeconf with entry

#!sh

application/pdf = exec /home/<username >/.recoll/filters/rclpdf.py
  • doing lots of changes in /usr/share/recoll/examples/filters or a custom version in ~/.recoll/examples/filters, in latter case there must be any modification, so that recoll knows which ../examples/filters to use.

medoc writes

Ok, I think that I have what I need to work… Of course, I’ll help with setting up the system.

It’s quite probable that I won’t get it right the first time, we may need a few iterations, if that’s ok with you.

What recoll version are you currently running and on what system ? The question is for the case where I can’t do everything needed in the input handler and I need to modify recollindex, and so that I can build a package for you.

Johannes_Me writes

Thanks! Every iteration, yes.

I’m running Recoll 1.23.1 + Xapian 1.2.22 on two different systems, which are:

  • xubuntu xenial 64bit

  • ubuntu studio zesty 32 bit

medoc writes

I have come up with a first version of the new pdf input handler, and pushed it to bitbucket. To use it:

[index]
application/pdf = execm /path/to/rclpdf.py

The fields to be extracted from the XML PDF metadata are defined by a new variable in the configuration file (e.g. ~/.recoll/recoll.conf): pdfextrameta. This is a space-separated list. Each element is either a field name (e.g. bibtex:location), or a pair like fieldname|recollfieldname. E.g., with non-sensical associations:

pdfextrameta =  bibtex:location bibtex:title|keywords \
dc:creator|author bibtex:chapter|keywords

This works at the XML level, not grep-like, so you can use any XML tag name present in the data, all text in the subtree will be extracted and concatenated (e.g. for the bibtex:author example above).

Multiple sources of data for the same recoll field should get concatenated in most cases, but there are exceptions for some well-known fields, e.g. title/caption, I did not do an exhaustive test. Concatenation should work for all custom fields. The concatenated values are separated by space characters. I’d have no objection to changing this to any separator compatible with the text splitter, but this would need a c++ modification.

There is no way to control the order in which the values are concatenated, nor to insert descriptive data like in your bibliographic info field, but this can be handled by the result list paragraph format if the different elements are extracted separately.

I gathered that you already know about the fields file and recoll field data processing. Once the data is extracted by the PDF input handler, it behaves like any recoll field.

There does not seem to be an issue with colon characters in recoll field names, but I have little experience with these, there may be surprises lurking, translation to a safe name is the safe approach but maybe I’m over-cautious.

I probably missed some issues, we’ll address them after your initial testing !

Johannes_Me writes

Wow, that was fast. I’m in the way and cannot return to my computer before tomorrow, but then I’ll do the initial test immediately. I have to add, that I know the fields file, but edited it by try and error, so I’ll probably need hell with this as well.

Greetings from a happy user from Berlin.

medoc writes

Nothing urgent at all !

Johannes_Me writes

Hello again. was just looking for the new input handler. To be honest: I’m a little lost.

First question would be: how exactly to handle custom config files in ~/.recoll. Is recoll adding entries there to other config variables stored in /usr/share/recoll/ or is it just executing these single configutations? How to tell recoll, which mimeconfig it has to read?

So, after setting the filter, editing the mimeconfig and try’n'erroring the fields file I executed recollindex -z, which more or less similar to each other returns for each file the following lines:

#!bash
:2:internfile/mh_execm.cpp:89::MHExecMultiple: getline error
:2:internfile/internfile.cpp:738::FileInterner::internfile: next_document error [/home/hannes/Texte/A/Adorno - Towards a New Manifesto - 1956.pdf] application/pdf
:3:rcldb/rcldb.cpp:611::Db::add: docid 2 added [/home/hannes/Texte/A/Adorno - Towards a New Manifesto - 1956.pdf|]
:2:utils/netcon.cpp:439::NetconData::send: send(11) errno 32 (Broken pipe)
:2:utils/execmd.cpp:838::ExecCmd::send: send failed
:2:internfile/mh_execm.cpp:209::MHExecMultiple: send error
:2:internfile/internfile.cpp:738::FileInterner::internfile: next_document error [/home/hannes/Texte/A/Arendt - Vita Activa.pdf] application/pdf
:3:rcldb/rcldb.cpp:611::Db::add: docid 3 added [/home/hannes/Texte/A/Arendt - Vita Activa.pdf|]
Traceback (most recent call last):
  File "/home/hannes/.recoll/filters/rclpdf.py", line 39, in <module >
    import rclexecm
ImportError: No module named rclexecm

The rclexecm.py is located in /usr/share/recoll/filters, but the indexing app seems not to know that. What can I do?

In general, could you confirm, that I had more or less the right ideas editing the files? In recoll.conf I added the line "pdfextrameta = bibtex:journal …",for each possible bibtex entry. In the fields file I added values to the "stored" categorie ("journal=") and in the aliases category I also listed them following the fields I wanted to add ("journal = bibtex:journal bibtex:journaltitle"). If I’d be right, I could’ve insert the field using %(journal) to the HTML String. Right?

Johannes_Me writes

Ok, forget what I said before. It works!

there were some extraction failures like

RCLMFILT: rclpdf.py: Metadata extraction failed: 'ascii' codec can't decode byte 0xc3 in position 94: ordinal not in range(128)

or

RCLMFILT: rclpdf.py: Metadata extraction failed: 'NoneType' object has no attribute 'findall'

or

Traceback (most recent call last):
  File "/usr/share/recoll/filters/rclpdf.py", line 522, in <module >
    rclexecm.main(proto, extract)
  File "/usr/share/recoll/filters/rclexecm.py", line 338, in main
    proto.mainloop(extract)
  File "/usr/share/recoll/filters/rclexecm.py", line 265, in mainloop
    self.processmessage(processor, params)
  File "/usr/share/recoll/filters/rclexecm.py", line 245, in processmessage
    self.answer(data, ipath, eof)
  File "/usr/share/recoll/filters/rclexecm.py", line 184, in answer
    self.senditem("Document", docdata)
  File "/usr/share/recoll/filters/rclexecm.py", line 176, in senditem
    l = len(data)
TypeError: object of type 'NoneType' has no len()
:2:internfile/mh_execm.cpp:89::MHExecMultiple: getline error
:2:internfile/internfile.cpp:738::FileInterner::internfile: next_document error [/home/hannes/Texte/A/Arendt - Kultur und Politik.pdf] application/pdf
:3:rcldb/rcldb.cpp:611::Db::add: docid 16 added [/home/hannes/Texte/A/Arendt - Kultur und Politik.pdf|]
RCLMFILT: rclpdf.py: Metadata extraction failed: 'ascii' codec can't decode byte 0xc3 in position 122: ordinal not in range(128)

The only thing is, as I mentioned before: It came to life, when I moved and linked everything to systemwide locations. So, when I inserted the new rclpdf.py to /usr/share/recoll/filters it also found the scripts to import like rclexecm.py.

That will probably cause problems in the future, since each update of recoll will overwrite the rclpdf.py, the fields file and so on.

So would you help me setting everything up as user configuration files? I guess i have to copy the whole file into my user specific recoll folder. And I need to know, where the custom mimeconf file is pointed. Additionally I think I have to modify the rclpdf.py, so that it finds the requested scripts to import.

One further question. Is it possible for a beginner like me to "sed" the imported fields anywhere on the way? For example the bibtex:pages fields must be formatted in the format "25—50" which I’d like to sed to "25-50" and possibly to perform something like this bibliographic info field, sed "25—50" to ",S. 25-50" (S. = Seite = page) so that I would just concatenate the strings in the output to get a proper reference. I’d like to perform these modifications, because if I wanted to build a reference string in the HTML-Output in the form AUTHOR: "TITLE", in: JOURNAL, Jg. VOLUME, H. NUMBER (YEAR), S. PAGES., a file which has no metadata would return like ": "Title", in: , , (), S. ."

On the following screenshot you can see my first search with the new fields: ![Bildschirmfoto_2017-05-14_20-24-27.png](https://bitbucket.org/repo/LoKMq/images/2779685685-Bildschirmfoto_2017-05-14_20-24-27.png)

So thanks for everything until now!

medoc writes

I should have tried what I suggested, things go wrong every time I suggest something without actually trying…

You are right that it’s not a good idea to change system-wide files.

The values in the files from your configuration directory (~/.recoll or whatever $RECOLL_CONFDIR or the -c value point to) override the system-wide values.

Any modification to recoll.conf or mimeconf should be done in your local configuration directory.

You should NOT copy the whole recoll.conf or mimeconf to your local directory, just add the appropriate lines to the local file (so the mimeconf below is the whole local file if you have no other custom modifications).

I don’t think that you need to copy the whole fields file either, just take care to use the appropriate sections when adding values.

But I had forgotten that the Python filters depend on python modules which they can’t find if they are copied to some other place.

I think that the best approach is to use symbolic links to the system-wide files. For example, to have the modified rclpdf.py live in ~/.recoll, and supposing that the modified rclpdf.py is currently in /usr/share/recoll/filters:

cd
cd .recoll
cp /usr/share/recoll/filters/rclpdf.py .
ln -s /usr/share/recoll/filters/rclexecm.py .
ln -s /usr/share/recoll/filters/rclconfig.py .

In mimeconf (replace /home/dockes with the proper value of course):

[index]
application/pdf = execm /home/dockes/.recoll/rclpdf.py

About the extraction errors: I’d be interested by the files which cause them, if you can share samples, please send them to jf@dockes.org

About the custom field editing: I don’t really know how to let you change specific fields with an external command, this seems too involved. But if you can use sed, you can do the same thing in Python: python uses the same regular expressions. If you can confirm an example of what you want to to, I can add what’s needed to the script, and you should be able to use the example for other modifications.

And I had to type the above twice because I clicked on your screen copy, and when I got back, bitbucket had lost my text… Every time I get bitten, everytime I swear that I’ll use an external editor next time, and every next time I get bitten again :)

medoc writes

Very nice result paragraph format by the way !

Johannes_Me writes

Concerning the paragraph format. I could share it here, when it works properly.

Towards the symbolic links: Don’t you think, if I would insert a ln -s to systemwide configurations pointing to my user folder, the problem of overwriting due update process would be still there? So, if Recoll gets updated, a new version of rclpdf.py would pushed via the symbolic link to my ~/.recoll.

medoc writes

You are only using links for the system files which you do not change.

So the files maybe updated by a new version, but the links will still point to them which is what you want.

There is a possible problem if there is an incompatible change, but the only change which you may have to port to the new version is the few lines of code used to edit the fields.

When we’re done and satisfied with the result, yes, it would be nice to publish the whole thing ! The new rclpdf.py will become the standard in any case, but the specific work with the bibliographic fields and the paragraph format may interest other people.

Johannes_Me writes

Ah, you mean, i should create symbolic links for these two scripts the new rclpdf.py needs to import, namely rclexecm.py and rclconfig.py, right? (EDIT: of course, you mean that. you exactly wrote that… sorry)

I sent an email to you with my recollindex -z output and two pdf documents causing problems while indexing. I think that should be an encoding problem in the metadata.

There is another critical issue, but I should open another topic for this. Short: Sometimes while using Recoll, the system gets slower using huge amount of memory, and i figured out that the problem is caused by ~/.config/Recoll.org/recoll.conf growing fast to some hundreds of MB. There is an entry created named

#!conf

prefs\adv\clauseList="4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 4 4 0 2 5 ..."

which is growing while running the programm. But this has nothing to do with the new handler, it happened before.

Johannes_Me writes

ok, I migrated the config files and it seems to work. local mimeconf is used and rclpdf.py doesn’t miss its import scripts. What about the fields file? Is it recommended to create a custom one in /home/user, and how can it be located by recoll? Fabric new recoll has not a fields file on user partition.

medoc writes

Fields is the same as mimeconf and recoll.conf. The values set in your local file complement or override the values in the system file.

medoc writes

By the way, given your questions, I wonder if you may have missed that the manual has a rather extensive section about the configuration files: https://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.INSTALL.CONFIG.html

About the clauses list bug: this is a long-lurking thing, and your input finally prodded me to fix it ! It should be gone from the next version. What happened was that, as soon as the clauses list was longer than the default, its size doubled each time you opened the preferences window !

Johannes_Me writes

Wonderful. the latter reminds me of the [wheat and chess problem](https://en.wikipedia.org/wiki/Wheat_and_chessboard_problem).

medoc writes

Yes, exponentials are nice functions :)

medoc writes

This has an additional method to alter the metadata fields:

    def _extrametafix(self, nm, txt):
        if nm == 'bibtex:pages':
            txt = re.sub(r'--', '-', txt)
        elif nm == 'someothername':
            # do something else
            pass
        elif nm == 'stillanother':
            # etc.
            pass

        return txt

The method is called each time a field is found. You should easily find info about re.sub() in any Python regular expression tutorial, if you can do sed, you can do this :)

Don’t hesitate if you have questions anyway.

Johannes_Me writes

New handler works like magic. Updated the well known partial B*-library, 149 documents. 4 errors, all the same type:

#!shell

RCLMFILT: rclpdf.py: Metadata extraction failed: 'NoneType' object has no attribute 'findall'

If you want to take a look, the documents causing the last few errors are the articles of Berg, Burke, Butler, Borensztein.

Johannes_Me writes

One question about the filter (Did not know that I would possibly start python coding through this project):

If i’d like to replace the bibtex:pages field from format "5—20" to ", S. 5-20" i need not just to replace a string but to insert. I’d like to know, how to insert text at the beginnng and at the end of a string. Can I do like that?

#!python

    def _extrametafix(self, nm, txt):
        if nm == 'bibtex:pages':
            txt = re.sub(r'--', '-', txt)
            txt = re.sub(r'.*', '\, S. \1', txt)
            txt = re.sub(r'.*', '\1XYZ', txt) # just in case i want to insert XYZ at the end of a string?
        elif nm == 'someothername':
            # do something else
            pass
        elif nm == 'stillanother':
            # etc.
            pass

        return txt

Does that work or is there a more elegant solution? _ Edit:

Try and error again. Should be solved by

#!python

            txt = re.sub(r'^', ', S. ', txt)
            txt = re.sub(r'$', 'any string at the end', txt)

Johannes_Me writes

Look what finally happened to my output after some modifications with your filter:

My output string is build that way:

#!html

<table class="respar" style="padding-bottom: 10px;" cellspacing="5" cellpadding="5" >

<thead style="vertical-align: top;" >
<tr >
<td colspan="3" style="border-bottom: 1pt dotted #004070; font-size: smaller;" ><a href="E%N" >%u</a > | %S | Relevanz: %R</td >
</tr >
</thead >

<tbody style="vertical-align: top;" >
<tr >
<td ><a href="P%N" ><img src="%I" alt="" width="64" height="auto" / ></a ></td >
<td style="width: 250px;" ><span style="color: #004070;" >
  <div style="font-style: italic;" >%(author)</div >
  <div style="font-weight: bold;" ><a href="E%N" >&raquo;%T&laquo;</a ></div >
  <div style="text-transform: uppercase; margin-top: 5pt" >%(reftype)</div ></td >
<td >
  <div style="font-size: smaller;" >
    %(refauthor)%(refchapter) %(reftitle)%(refeditor)%(refbooktitle)%(refjournal)%(refvolume)%(refnumber)%(refaddress)%(reflocation)%(refpublisher)%(refyear)%(refpages).</div >
  <div style="text-align: justify; font-family: serif; margin-top: 5pt; margin-bottom: 5pt" >&raquo;<a href="A%N" >%A</a >&laquo;</div >
  <div >%(refkeywords)</div >
  <div style="font-size: smaller;" ><a href="%(refurl)" >%(refurl)</a ></div >
  <div style="font-size: smaller" > %(refkey) %(refisbn) %(refissn) %(refdoi)</div ></td >
</tr >
</tbody >

</table >

Please note that my html skills are from the times, when web pages were blinking, framed and rainbow coloured. As underlying css style I just have slight modifications from the recoll manual, so that rows are coloured.

medoc writes

About the document errors: I had a quick look, and there is nothing obvious, I need to investigate.

About adding text at the beginning and at the end: you don’t need re.sub for this, you can just do string concatenation:

txt = 'S. ' + txt + 'whatever at the end'

For doing simple stuff like this you will quickly find out that Python is easier than the shell with all its quoting problems.

It’s too late for me to look at HTML. Just know that my familiarity with the thing dates from the same time area as yours :)

Johannes_Me writes

Thanks for the advice above. About the html thing: It was just to share the style if somebody wants to use it (and homefully modify it).

Another question, but I don’t know how much effort it would take:

It’s about the concatenating of "aliased" fields. When assuming (in my case its clear, because I know that Jabref is spitting out bibtex:*** fields) that users don’t know which field a document contains, one may add e.g. an entry in the fields file:

#!

refauthor = bibtex:author dc:creator xesam:author # and so on

If the document contains more than one of these fields (xesam:author for example is used by the original recoll config) the output string would be something like "Karl Marx - Karl Marx - Karl Marx", because multiple values are being concatenated.

Maybe it would be possible not to cat entrys have the same value bit by bit?

As said, for Jabref users and most other cases it would be suitable in the current state.

medoc writes

The issue was initially about TMSU, but quickly switched to XMP metadata. Should have opened another one…

medoc writes

I pushed an updated rclpdf.py, you will need to cut and paste your field-editing routine in there (we could push it into a separate module if this proves an issue at some point).

The new version processes the problem files. The issue was a slightly different XML tree (different root). As I’m doing this by looking at examples, not the formal spec, it’s possible that we will find other unexpected variations, so keep an eye for errors !

I do understand your intent with the HTML. But I’m curious, and I want to have a look at it.

About the duplicate entries, I think that this is a general problem, and we should make it another, longer term, issue because it needs modifications to the indexer code.

medoc writes

New version of rclpdf.py again. This was modified to load the field-editing code from a separate file, avoiding the need to copy and edit the main script itself. Your editing code can remain the same, you just need to move it to a separate file. There is a bit of documentation, but don’t hesitate to ask me for more: http://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.INDEXING.PDF.XMP.html

Johannes_Me writes

Ok, can you briefly explain, where to create the editing script and, if necessary, to point on it? And whats the difference in usage between metafix() and wrapup()?

The documentation seems to be for professionals.

Meanwhile I worked a little on my get-meta-from-internet-and-make-bibtex bash ;)

_ Edit: Meanwhile I found the right documentation. But can you still exlpain the wrapup() method or lead me to the documentation I cannot find.

Johannes_Me writes

I would share my user recoll files in case you want to use it for documentation. I edited them erasing fields, which no other user could use, because I added them manually through my bash script getting metadata from web.

recoll.conf looks like this (just self edited lines):

#!

pdfextrameta = bibtex:journal bibtex:journaltitle bibtex:pages bibtex:volume bibtex:number bibtex:booktitle bibtex:year bibtex:author bibtex:title bibtex:isbn bibtex:issn bibtex:editor bibtex:address bibtex:location bibtex:doi bibtex:chapter bibtex:url bibtex:entrytype bibtex:bibtexkey bibtex:abstract bibtex:date bibtex:keywords bibtex:comment bibtex:language bibtex:edition bibtex:totalpages dc:creator dc:relation dc:publisher dc:title dc:type dc:identifier
defaultcharset = UTF-8//

pdfextrametafix = /home/hannes/.recoll/metafix.py

metafix.py looks like this, I translated it to a more international format:

#!python

import sys
import re

# This can be used for local XMP field editing.
#
# A new instance is created for each PDF document (so the object could
# keep state to avoid, e.g. duplicate values)
#
# The metafix method receives an (original) field name, and the text
# value, and should return the possibly modified text.
class MetaFixer(object):
    def __init__(self):
        pass

    def metafix(self, nm, txt):
        if nm == 'bibtex:pages':
            txt = re.sub(r'--', '-', txt)
            txt = re.sub(r'^', ', p. ', txt)
        elif nm == 'bibtex:author':
            txt = re.sub(r'$', ':\ ', txt)
            pass
        elif nm == 'bibtex:chapter':
            txt = re.sub(r'^', ', in: id.: ', txt)
            pass
        elif nm == 'bibtex:editor':
            txt = re.sub(r'^', ', in: ', txt)
            txt = re.sub(r'$', ' (ed.):\ ', txt)
            pass
        elif nm == 'bibtex:year':
            txt = re.sub(r'^', ', ', txt)
            pass
        elif nm == 'bibtex:date':
            txt = re.sub(r'^', ', ', txt)
            pass
        elif nm == 'bibtex:volume':
            txt = re.sub(r'^', ', vol. ', txt)
            pass
        elif nm == 'bibtex:number':
            txt = re.sub(r'^', ', no. ', txt)
            pass
        elif nm == 'bibtex:journaltitle':
            txt = re.sub(r'^', ', in: ', txt)
            pass
        elif nm == 'bibtex:journal':
            txt = re.sub(r'^', ', in: ', txt)
            pass
        elif nm == 'bibtex:title':
            txt = re.sub(r'^', '"', txt)
            txt = re.sub(r'$', '"', txt)
            pass
        elif nm == 'bibtex:location':
            txt = re.sub(r'^', ', ', txt)
            txt = re.sub(r'$', ':\ ', txt)
            pass
        elif nm == 'bibtex:address':
            txt = re.sub(r'^', ', ', txt)
            txt = re.sub(r'$', ':\ ', txt)
            pass
        elif nm == 'bibtex:isbn':
            txt = re.sub(r'^', 'ISBN: ', txt)
            pass
        elif nm == 'bibtex:issn':
            txt = re.sub(r'^', 'ISSN: ', txt)
            pass
        elif nm == 'bibtex:doi':
            txt = re.sub(r'^', 'DOI: ', txt)
            pass
        elif nm == 'bibtex:bibtexkey':
            txt = re.sub(r'^', 'Key: ', txt)
            pass

        return txt

fields looks like this, would also create entries which can be interpreted by the html output string:

#!python

[prefixes]

refjournal=RFJOURNAL
refpages=RFPAGES
reftitle=RFTTITLE
refvolume=RFVOLUME
refauthor=RFAUTHOR
refyear=RFYYEAR
refisbn=RFISBN
refissn=RFISSN
refdoi=RFDOI
refeditor=RFEDITOR
refpublisher=RFPUBLISHER
refaddress=RFADDRESS
reflocation=RFLOCATION
refbooktitle=RFBOOKTITLE
refurl=RFURL
reftype=RFTYPE
refkey=RFKEY
refabstract=RFABSTRACT
refkeywords=RFKEYWORDS
refcomment=RFCOMMENT
refedition=RFEDITION
reflanguage=RFLANGUAGE

[stored]

refjournal=
refpages=
reftitle=
refvolume=
refauthor=
refyear=
refisbn=
refissn=
refdoi=
refeditor=
refpublisher=
refaddress=
reflocation=
refbooktitle=
refurl=
reftype=
refkey=
refabstract=
refkeywords=
refcomment=
refedition=
reflanguage=
refid=

[aliases]

refjournal = bibtex:journal bibtex:journaltitle
refpages = bibtex:pages
reftitle = bibtex:title
refvolume = bibtex:volume
refauthor = bibtex:author
refyear = bibtex:year bibtex:date
refid = dc:identifier bibtex:isbn bibtex:issn
refisbn = bibtex:isbn
refissn = bibtex:issn
refdoi = bibtex:doi
refeditor = bibtex:editor
refpublisher = bibtex:publisher
refaddress = bibtex:address
reflocation = bibtex:location
refbooktitle = bibtex:booktitle
refurl = bibtex:url
reftype = bibtex:entrytype bibtex:type
refkey = bibtex:bibtexkey
refabstract = bibtex:abstract
refkeywords = bibtex:keywords
refcomment = bibtex:comment
refedition = bibtex:edition
reflanguage = bibtex:language
author = xesam:author

In your documentation you recommend to use e.g. year and journal, in another documentation you said to be careful with using systemwide entry names for custom ones. Is that still right?

Johannes_Me writes

By the way: it might interest you, that I updated my core PDF library of 1330 files WITHOUT ANY obvious error.

Other errors seemed not be document related, as:

  • failed extract *.epub,

  • failed extract corrupted pdf, which both does not surprise

  • following errors, each line anywhere in recollindex -z output, so not related to each other. But I added the upcoming extracted file, in case you think it has to to with the file.

#!

Syntax Warning: Illegal annotation destination
:3:rcldb/rcldb.cpp:614::Db::add: docid 420 added [/home/hannes/Texte/M/Marx Alltogether.pdf|]

Syntax Error: Marked object is wrong type (boolean)
:3:rcldb/rcldb.cpp:614::Db::add: docid 944 added [/home/hannes/Texte/R/Rehberg - Differenz und Integration. Die Zukunft moderner Gesellschaften.pdf|]

:2:internfile/internfile.cpp:738::FileInterner::internfile: next_document error [/home/hannes/Texte/S/Strauss - Continual Permutations of Action.pdf] application/pdf
# this is the corrupted file

Syntax Error: Expected the optional content group list, but wasn't able to find it, or it isn't an Array
:3:rcldb/rcldb.cpp:614::Db::add: docid 1214 added [/home/hannes/Texte/O/Opitz - Ausnahme mit System.pdf|]

Syntax Error: Marked object is wrong type (boolean)
:3:rcldb/rcldb.cpp:614::Db::add: docid 1228 added [/home/hannes/Texte/J/Jameson - Representing Capital A Reading of Volume One.pdf|]

So: extraordinary good job with the new handler!

medoc writes

Thanks ! I will incorporate the configuration files data in the wiki article.

I think that the above errors all come from pdftotext disagreeing with details of the pdf format.

About wrapup(): it mostly allows doing the same things as metafix() with the difference that it gets the whole data at once, so that it could do things like, for example, deleting a redundant field. But you need a bit more Python programming to use wrapup() than metafix(), which is why I did not document it much (it’s evident for a programmer and difficult to use for a non-programmer).

medoc writes

The new page about processing PDF XMP is here: http://www.lesbonscomptes.com/recoll/recoll_XMP/

Johannes_Me writes

Was a pleasure to be a learning junior assistant.

Further questions for the future: Does the Output paragraph support javascript? If it is so, I think I found a method to replace strings in the output, but I’m not sure. The goal would be to have these string modifications within the output without destructively editing the meta fields. For current use it (the latter version, not this modification below) works like a charm.

The idea is to add functions within the header paragraph, which could look similar to this, which lacks of an of clause (if not zero) and doesn’t work by now:

#!javascript

<script type="text/javascript" >
function loadpages() {
    var str = document.getElementById("pages").innerHTML;
    var res = str.replace(str, ", S. " + str);
    var res = str.replace("--" "-");
    document.getElementById("pages").innerHTML = res;
}
window.onload = function() {
  loadpages();
}
</script >

But if it works, one could add the field like that:

#!html

<span id="pages" >%(refpages)</span >

What do you think?

Johannes_Me writes

And I forgot something. My Output paragraph header (to make e.g. links aesthetically more interesting) is as follows, derived from your string from the manual:

#!css

<!-- Custom Header -- >

<script type="text/javascript" >
  function altRows() {
      var rows = document.getElementsByClassName("rclresult");
      for (i = 0; i < rows.length; i++) {
          if (i % 2 == 0) {
              rows[i].style.backgroundColor = "#f0f0f0";
          }
      }
  }
  window.onload = function() {
      altRows();
  }
</script >

<style type="text/css" >
a:link {
    color: #004070;
    text-decoration: none;
}
a:visited {
    color: #004070;
    text-decoration: none;
}
a:hover {
    color: #0050a0;
    text-decoration: none;
}
a:active {
    color: #005080;
    text-decoration: none;
}
</style >
<!-- End of Custom Header -- >

medoc writes

Thanks for the header, I’ll add it to the doc.

Yes, the result list should fully support javascript, so it should be quite possible to perform any editing there. However, I think that some types of data edits make more sense during indexing.

medoc writes

Closing this as we seem to have accomplished what was needed.