The PDF input handler

The PDF format is very important for scientific and technical documentation, and document archival. It has extensive facilities for storing metadata along with the document, and these facilities are actually used in the real world.

In consequence, the rclpdf.py PDF input handler has more complex capabilities than most others, and it is also more configurable. Specifically, rclpdf.py can automatically use tesseract to perform OCR if the document text is empty, it can be configured to extract specific metadata tags from an XMP packet, and to extract PDF attachments.