[daisy] indexing and searching PDF
Bruno Dumon
bruno at outerthought.org
Wed Jan 23 10:00:59 CST 2008
Hi,
On Wed, 2008-01-23 at 12:45 +0100, Jano Kula wrote:
> Hi,
>
> while browsing the indextore I've noticed that the PDF indexes are
> stored in the index file line by line like in the PDF file itself. Thus
> hyphenated words are not found while searching. This is probably the
> Lucene thing, should I report it to its developers? Saving the index in
> the indexstore after one global substitution should solve this.
It's not a Lucene problem: Lucene is only the indexing component, and
doesn't concern itself with extracting text from various formats.
I don't know how feasible it would be to merge hyphenated words. If
you're interested in working on this, just search for the class
PDFTextExtractor in the source code.
>
> Is there any information on document parts in the indexstore?
Nope, the documents (all parts of it) are indexed as a whole.
> Imagine
> this minimal example:
>
> document-type: book description
> part-type (daisy-html): annotation
> part-type (PDF): table of contents
> part-type (PDF): sample chapter <--- here is the string
> part-type (PDF): index
>
> If one searches for the string and this compound document is found, I
> can't see the way to find out which of the PDFs matches the string and
> it is confusing the user can't find it in the displayed daisy-html. If
> there is an information on document-parts in the indexstore, some mark
> (an arrow or highlighted link) could mark the part, where the string was
> found. Opening just one part would loose information on its context, I
> think. But there might be no information on parts in the indexstore. How
> would you deal with this situation?
>
> And the last small issue with PDFs. While uploading the PDF with Firefox
> on Linux, file is marked with application/binary and this can't be
> overwritten to application/pdf changing the text in Mime-type field.
Really? So if you change the mime-type, that change is ignored?
> Document fails to save not conforming to the mime-type of the document
> part. Uploading the same file with IE on Windows works. Is this solely
> the browser thing?
>
> Thank you.
>
> Jano
--
Bruno Dumon http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno at outerthought.org bruno at apache.org
More information about the daisy
mailing list