espinosa writes

Two subsequent headers are joined in one word (no space) making some terms from the headers affected unsearchable. Text extractor (filter) bug.

Provided example, please see the attached file: {{{ 30.4. - pondělí 1.5.2012 – úterý aa }}}

Preview (how recoll have seen it): {{{ 30.4. - pondělí1.5.2012 – úterý aa }}}

This cause nothing is found when searching by 1.5.2012. This otherwise works well and I use searching by dates quite often.

{{{ espinosa@espinosadell: > rpm -qa | grep -i recoll recoll-1.17.0-4.1.i586 recoll-runner-0.3-4.2.i586 espinosa@espinosadell: > rpm -qa | grep -i xapian libxapian22-1.2.8-16.1.i586 }}}

Relevant part of content.xml: {{{ <text:h text:style-name="Heading_20_1" text:outline-level="1"/ ><text:p text:style-name="Text_20_body"/ ><text:h text:style-name="Heading_20_2" text:outline-level="2" >30.4. - pondělí</text:h ><text:h text:style-name="Heading_20_2" text:outline-level="2" >1.5.2012 – úterý</text:h ><text:p text:style-name="P1" >aa</text:p ></office:text ></office:body ></office:document-content > }}} (looks good, doesn’t it?)

Please fix soon Thanks Espinosa

medoc writes

Hi, Could you please try the following patch and tell me if this fixes the problem ?

Cheers, jf

medoc writes

Hi, Simpler than using the patch, the updated filter is on the recoll web site: Retrieve the filter from the "updated open document filter" section and copy it to /usr/share/recoll/filters, then make it executable. I checked and it handles your attached file properly.