[daisy] Re: issue with large amout of documents

Bart Van den Abeele bvdaspam at gmail.com
Fri Jun 8 04:59:17 CDT 2007


Bruno Dumon schreef:
 > Hi Bart,
 >
Hi,

 > Something that I'm still interested in having feedback on is if you have
 > already tried how things work once all documents are loaded in the
 > cache. Regardless of whether that's an acceptable option for your use
 > case, it would be good to know how much that helps. At least the second
 > execution of a query should go quite fast then? If not, could you
 > provide the timings included in the query results to see what takes the
 > most time?


First query:

select documentType, $sb_dd_gemeente, $sb_dd_aktenummer, $sb_dd_jaar, 
$sb_dd_datum, $sb_dd_pictures, $sb_dd_betrokkenen_voornaam, 
$sb_dd_betrokkenen_achternaam, $sb_dd_betrokkenen_hoedanigheid where 
(documentType = 'sb_dd_Geboorte' or documentType = 'sb_dd_Huwelijk' or 
documentType = 'sb_dd_BijgevoegdGeboorte')

Second query:

select documentType, $sb_dd_gemeente, $sb_dd_aktenummer, $sb_dd_jaar, 
$sb_dd_datum, $sb_dd_pictures, $sb_dd_betrokkenen_voornaam, 
$sb_dd_betrokkenen_achternaam, $sb_dd_betrokkenen_hoedanigheid where 
(documentType = 'sb_dd_Overlijden')

Sum of query 1 and query 2:

select documentType, $sb_dd_gemeente, $sb_dd_aktenummer, $sb_dd_jaar, 
$sb_dd_datum, $sb_dd_pictures, $sb_dd_betrokkenen_voornaam, 
$sb_dd_betrokkenen_achternaam, $sb_dd_betrokkenen_hoedanigheid where 
(documentType = 'sb_dd_Geboorte' or documentType = 'sb_dd_Huwelijk' or 
documentType = 'sb_dd_Overlijden' or documentType = 
'sb_dd_BijgevoegdGeboorte')

Results on 128mb + cache of 20K docs:

first query = 7353 (slow > 2 min)
(when i push reload, it reloads almost instantly)
second query = 4757 (slow > 2 min)
(when i try to reload query 1 it is again slow)
1 + 2 (12110) = out of memory (slow > 2 min)

Results on 256mb + cache of 20K docs:

1 & 2 are slow the first time and same amount of documents.
1 + 2 = pretty fast ( 30sec ) :)

after this i did :
1 (fast)
1 + 2 (fast)
2 ((slow > 2 min)
1 + 2 (slow > 2 min)
1 + 2 (slow > 2 min)

Results on 512mb + cache of 100K docs:

my first request is one of the documents of all types (slow > 5min), to 
oad them into cache.  All other request are fast.

Unfortunatly this is not a solution for me because the set of documents 
is not yet complete and could go wel over 1M.

 > I'm not sure what you mean with the sorting of lucene? You mean the
 > score-based sorting or other sorting? IIRC when there's a fulltext
 > search we keep the order of the documents as returned by lucene, unless
 > there's an explicit order by clause that orders them differently.

Indeed, that is what i mean.

 > Indeed, the implementation could be optimized for the case we can on
 > beforehand determine that the ACL allows everyone read access to all
 > documents, and when there's no order-by clause, but again I'm not sure
 > that solves a lot as IMHO in many cases you'll want these features.

If this would speed things up and use less memory, this would be a big help!

Grtz,
Bart





More information about the daisy mailing list