[daisy] indexing and searching PDF

Bruno Dumon bruno at outerthought.org
Wed Jan 23 10:00:59 CST 2008


Hi,

On Wed, 2008-01-23 at 12:45 +0100, Jano Kula wrote:
> Hi,
> 
> while browsing the indextore I've noticed that the PDF indexes are 
> stored in the index file line by line like in the PDF file itself. Thus 
> hyphenated words are not found while searching. This is probably the 
> Lucene thing, should I report it to its developers? Saving the index in 
> the indexstore after one global substitution should solve this.

It's not a Lucene problem: Lucene is only the indexing component, and
doesn't concern itself with extracting text from various formats.

I don't know how feasible it would be to merge hyphenated words. If
you're interested in working on this, just search for the class
PDFTextExtractor in the source code.

> 
> Is there any information on document parts in the indexstore?

Nope, the documents (all parts of it) are indexed as a whole.

>  Imagine 
> this minimal example:
> 
> document-type: book description
>    part-type (daisy-html): annotation
>    part-type (PDF): table of contents
>    part-type (PDF): sample chapter      <--- here is the string
>    part-type (PDF): index
> 
> If one searches for the string and this compound document is found, I 
> can't see the way to find out which of the PDFs matches the string and 
> it is confusing the user can't find it in the displayed daisy-html. If 
> there is an information on document-parts in the indexstore, some mark 
> (an arrow or highlighted link) could mark the part, where the string was 
> found. Opening just one part would loose information on its context, I 
> think. But there might be no information on parts in the indexstore. How 
> would you deal with this situation?
> 
> And the last small issue with PDFs. While uploading the PDF with Firefox 
> on Linux, file is marked with application/binary and this can't be 
> overwritten to application/pdf changing the text in Mime-type field. 

Really? So if you change the mime-type, that change is ignored?

> Document fails to save not conforming to the mime-type of the document 
> part. Uploading the same file with IE on Windows works. Is this solely 
> the browser thing?
> 
> Thank you.
> 
> Jano

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno at outerthought.org                          bruno at apache.org


More information about the daisy mailing list