[daisy] Re: issue with large amout of documents
Bruno Dumon
bruno at outerthought.org
Mon Jun 11 05:05:21 CDT 2007
Thanks for sharing these results. They basically confirm what I assumed;
things stay fast as long as all documents of the query resultset are
cached.
I can't help with solving any concrete things around this in the short
term. Eventually we'll need to rethink the design of our query
implementation to make it scale better (possibly by doing all querying
using one engine, e.g. Lucene).
On Fri, 2007-06-08 at 11:59 +0200, Bart Van den Abeele wrote:
> Bruno Dumon schreef:
> > Hi Bart,
> >
> Hi,
>
> > Something that I'm still interested in having feedback on is if you have
> > already tried how things work once all documents are loaded in the
> > cache. Regardless of whether that's an acceptable option for your use
> > case, it would be good to know how much that helps. At least the second
> > execution of a query should go quite fast then? If not, could you
> > provide the timings included in the query results to see what takes the
> > most time?
>
>
> First query:
>
> select documentType, $sb_dd_gemeente, $sb_dd_aktenummer, $sb_dd_jaar,
> $sb_dd_datum, $sb_dd_pictures, $sb_dd_betrokkenen_voornaam,
> $sb_dd_betrokkenen_achternaam, $sb_dd_betrokkenen_hoedanigheid where
> (documentType = 'sb_dd_Geboorte' or documentType = 'sb_dd_Huwelijk' or
> documentType = 'sb_dd_BijgevoegdGeboorte')
>
> Second query:
>
> select documentType, $sb_dd_gemeente, $sb_dd_aktenummer, $sb_dd_jaar,
> $sb_dd_datum, $sb_dd_pictures, $sb_dd_betrokkenen_voornaam,
> $sb_dd_betrokkenen_achternaam, $sb_dd_betrokkenen_hoedanigheid where
> (documentType = 'sb_dd_Overlijden')
>
> Sum of query 1 and query 2:
>
> select documentType, $sb_dd_gemeente, $sb_dd_aktenummer, $sb_dd_jaar,
> $sb_dd_datum, $sb_dd_pictures, $sb_dd_betrokkenen_voornaam,
> $sb_dd_betrokkenen_achternaam, $sb_dd_betrokkenen_hoedanigheid where
> (documentType = 'sb_dd_Geboorte' or documentType = 'sb_dd_Huwelijk' or
> documentType = 'sb_dd_Overlijden' or documentType =
> 'sb_dd_BijgevoegdGeboorte')
>
> Results on 128mb + cache of 20K docs:
>
> first query = 7353 (slow > 2 min)
> (when i push reload, it reloads almost instantly)
> second query = 4757 (slow > 2 min)
> (when i try to reload query 1 it is again slow)
> 1 + 2 (12110) = out of memory (slow > 2 min)
>
> Results on 256mb + cache of 20K docs:
>
> 1 & 2 are slow the first time and same amount of documents.
> 1 + 2 = pretty fast ( 30sec ) :)
>
> after this i did :
> 1 (fast)
> 1 + 2 (fast)
> 2 ((slow > 2 min)
> 1 + 2 (slow > 2 min)
> 1 + 2 (slow > 2 min)
>
> Results on 512mb + cache of 100K docs:
>
> my first request is one of the documents of all types (slow > 5min), to
> oad them into cache. All other request are fast.
>
> Unfortunatly this is not a solution for me because the set of documents
> is not yet complete and could go wel over 1M.
>
> > I'm not sure what you mean with the sorting of lucene? You mean the
> > score-based sorting or other sorting? IIRC when there's a fulltext
> > search we keep the order of the documents as returned by lucene, unless
> > there's an explicit order by clause that orders them differently.
>
> Indeed, that is what i mean.
>
> > Indeed, the implementation could be optimized for the case we can on
> > beforehand determine that the ACL allows everyone read access to all
> > documents, and when there's no order-by clause, but again I'm not sure
> > that solves a lot as IMHO in many cases you'll want these features.
>
> If this would speed things up and use less memory, this would be a big help!
>
> Grtz,
> Bart
>
--
Bruno Dumon http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno at outerthought.org bruno at apache.org
More information about the daisy
mailing list