[daisy] [discussion] Paging of documents (without ever loading them all into memory) instead of maxClauseCount error

Bruno Dumon bruno at outerthought.org
Thu May 3 09:33:38 CDT 2007


On Thu, 2007-05-03 at 15:35 +0200, Geoffrey De Smet wrote:
> In our application a user might enter "d*" to do a query,
> which has 5000+ documents as a result, it could take minutes.
> 
> If the users sees the first 20 matches fast,
> and be able to see 20 till 40,
> he 'd probably never look at the 4960+ other documents.
> 
> 
> Facetted browsing currently has some "paging" support in them,
> but it looks like it Daisy fetches them all into memory.

All queries support paging (in Daisy 2.0): see the query options
chunk_offset and chunk_length:
http://cocoondev.org/daisydocs-2_0/373-cd/9-cd.html

Retrieving only a chunk can make a huge performance difference when
there are a lot of query results, since a lot of time is spent in
building up the resultset XML on the server, transporting it, and
parsing it on the client. The smaller the resultset, the faster.

In fact, once all relevant documents are in Daisy's cache (make sure to
configure it big enough, default is 10,000 docs), and you use limit the
results to a small chunk, query performance will probably be reasonable.

> 
> I'd like to question if it's needed to fetch them all into memory:
> - The MyJDBC driver has scrollable support (which is paging support on 
> mysql lvl). If MySQL has the correct indexes, it doesn't require to load 
> all results of a query into memory to deliver the first 20.
> - Hibernate-JPA supports paging, if you're using a scrollabe JDBC driver.
> - Does Lucene have paging/scrolling support?
> 
> - Hibernate-Search (which combines JPA with lucene) is actually pulling 
> real scrollable paging off at the moment apparently:
>    luceneSession.createLuceneQuery(luceneQuery).scroll()
> Somehow they seem to have figured out how to combine lucene WHERE's and 
> mysql WHERE's without loading all results into memory.

A pointer would have been useful, I've found this:

http://www.hibernate.org/hib_docs/search/reference/en/html_single/

but don't find any mentioning of the ability to combine fulltext and SQL
searches.

> 
> - Of course, daisy would also need to combine ACL into it...
> 

ACL isn't the biggest problem, we could make the implementation smarter
to gradually fetch more records if too many are filtered through the
ACL.

A more problematic thing as far as I can think of right now is the
"order by": since this is also performed by Daisy, the only solution is
to fetch all results, since the last result returned by the database or
lucene might be the one sorted first in the results. For the case where
there is no order-by clause, things could be optimized, but in most
cases you probably want to order the results.

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno at outerthought.org                          bruno at apache.org



More information about the daisy mailing list