XMP fields extraction

The rclpdf.py script in Recoll version 1.23.2 and later can extract XMP metadata fields by executing the pdfinfo command (usually found with poppler-utils). This is controlled by the pdfextrameta configuration variable, which specifies which tags to extract and, possibly, how to rename them.

The pdfextrametafix variable can be used to designate a file with Python code to edit the metadata fields (available for Recoll 1.23.3 and later. 1.23.2 has equivalent code inside the handler script). Example:

import sys
        import re

        class MetaFixer(object):
        def __init__(self):
        pass

        def metafix(self, nm, txt):
        if nm == 'bibtex:pages':
        txt = re.sub(r'--', '-', txt)
        elif nm == 'someothername':
        # do something else
        pass
        elif nm == 'stillanother':
        # etc.
        pass
        
        return txt
        def wrapup(self, metaheaders):
        pass
        

If the 'metafix()' method is defined, it is called for each metadata field. A new MetaFixer object is created for each PDF document (so the object can keep state for, for example, eliminating duplicate values). If the 'wrapup()' method is defined, it is called at the end of XMP fields processing with the whole metadata as parameter, as an array of '(nm, val)' pairs, allowing an alternate approach for editing or adding/deleting fields.