[daisy] Fulltext search enhancements

Bruno Dumon bruno at outerthought.org
Wed Aug 2 03:09:54 CDT 2006


On Wed, 2006-08-02 at 10:35 +1000, Nick dos Remedios wrote:
> On 02/08/2006, at 2:58 AM, Bruno Dumon wrote:
> > Also in Daisy 2.0, there are now some important enhancements to the  
> > full
> > text search (implemented by Paul, I'm just writing this since he seems
> > to have forgotten it ;-) ):
> >
> >  * it is possible to retrieve relevant fragments from the found
> > documents, with matching query words highlighted. This is simply done
> > using a function in query language, taking as a parameter the  
> > number of
> > fragments you want to retrieve.
> >
> >  * the score of the fulltext search is now accessible via an  
> > identifier
> > in the query language
> >
> >  * it is possible to retrieve chunks from query results (to show
> > paginated results, was previously only available for the faceted  
> > search)
> >
> >  * the fulltext search page in the wiki has been enhanced to make  
> > use of
> > these new features
> >
> >  * upgraded the Lucene engine to version 2.0
> >
> >  * the problem with the "too many open files" mentioned recently on  
> > this
> > list is also solved (and this also in the 1.5 branch)
> >
> > For people working on svn trunk: these changes mean you'll have to
> > rebuild your fulltext index: delete the content of the indexstore
> > directory and trigger rebuilding via the JMX console.
> >
> > While I'm at it, another (unrelated) change (which was needed for the
> > import/export tools): using the remote Java API no longer requires  
> > that
> > you have a user with the Administrator role (for the "cache user"),
> > which is also a rather important improvement.
> >
> > -- 
> > Bruno Dumon                             http://outerthought.org/
> 
> One feature I'd like to be able to see is the ability to search pages  
> for a fragment of (Daisy) HTML.
> 
> I apologies if this is already possible, I have tried to find such a  
> feature in the past but was not successful. It seems (to me) that  
> Lucene only indexes the rendered text not the HTML code(?).
> 
> My current work around is to search directly against the repo server  
> blobstore directory.

Having the possibility to do a search on the exact content of the
repository has been long on my wishlist too. The way I see it such a
search would just search document by document, part by part, thus
without using an index.

For the special case of Daisy-HTML or other XML formats, it would also
be possible to maintain some sort of XML index and search that.

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno at outerthought.org                          bruno at apache.org



More information about the daisy mailing list