[daisy] Full text search: PDF word boundary problem

COELMONT, Geert Geert.COELMONT at headbird.com
Fri Jul 25 14:52:46 CEST 2008


Hi list,

We're experimenting with storing a lot of PDFs in Daisy, as a way of
archiving our print-outs. We convert the PCL to PDF and then upload the
PDF to Daisy. This works very nice.

When doing a full text search for text inside a PDF, the expected
document is not returned.
When doing a "fuzzy Lucene search" (e.g. FOO~ instead of FOO) then the
document matches. 

In the short summary text returned by the search, we see that the next
text in the document is "glued" to the invoice reference, e.g. it says
FOOBAR where FOO and BAR are clearly separated in the printed copy.  
In the original PCL there are some escape sequences in between FOO and
BAR.  In the uncompressed PDF they are also nicely separated, something
like this:

1.00 g BT /Fo1 10.00 Tf 487.65 552.59 Td 0.000 Tc (FOO) Tj ET 0 g
BT /Fo1 10.00 Tf 300.93 552.59 Td 0.000 Tc (BAR) Tj ET

It doesn't seem to matter whether we upload compressed or uncompressed
PDF, both do not return the "FOO" search term, and both show "FOOBAR" in
the short summary.
Who performs the indexing of the text inside a PDF document? I guess
it's Lucene, so this would be a Lucene issue?

Any ideas/suggestions on what to do before contacting the Lucene guys?

Thanks
Geert

**********************************************************************
All e-mail messages addressed to, received or sent by the Cobelfret Group or Cobelfret Group employees are deemed to be professional in nature. Accordingly, the sender or recipient of these messages agrees that they may be read by other Cobelfret Group employees than the official recipient or sender in order to ensure the continuity of work-related activities and allow supervision thereof.

This mail has been checked for viruses by Mailsweeper and Sophos
*********************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cocoondev.org/pipermail/daisy/attachments/20080725/2094b460/attachment.htm


More information about the daisy mailing list