[daisy] Re: issue with large amout of documents
Bart Van den Abeele
bvdaspam at gmail.com
Fri Jun 8 04:59:17 CDT 2007
Bruno Dumon schreef:
> Hi Bart,
>
Hi,
> Something that I'm still interested in having feedback on is if you have
> already tried how things work once all documents are loaded in the
> cache. Regardless of whether that's an acceptable option for your use
> case, it would be good to know how much that helps. At least the second
> execution of a query should go quite fast then? If not, could you
> provide the timings included in the query results to see what takes the
> most time?
First query:
select documentType, $sb_dd_gemeente, $sb_dd_aktenummer, $sb_dd_jaar,
$sb_dd_datum, $sb_dd_pictures, $sb_dd_betrokkenen_voornaam,
$sb_dd_betrokkenen_achternaam, $sb_dd_betrokkenen_hoedanigheid where
(documentType = 'sb_dd_Geboorte' or documentType = 'sb_dd_Huwelijk' or
documentType = 'sb_dd_BijgevoegdGeboorte')
Second query:
select documentType, $sb_dd_gemeente, $sb_dd_aktenummer, $sb_dd_jaar,
$sb_dd_datum, $sb_dd_pictures, $sb_dd_betrokkenen_voornaam,
$sb_dd_betrokkenen_achternaam, $sb_dd_betrokkenen_hoedanigheid where
(documentType = 'sb_dd_Overlijden')
Sum of query 1 and query 2:
select documentType, $sb_dd_gemeente, $sb_dd_aktenummer, $sb_dd_jaar,
$sb_dd_datum, $sb_dd_pictures, $sb_dd_betrokkenen_voornaam,
$sb_dd_betrokkenen_achternaam, $sb_dd_betrokkenen_hoedanigheid where
(documentType = 'sb_dd_Geboorte' or documentType = 'sb_dd_Huwelijk' or
documentType = 'sb_dd_Overlijden' or documentType =
'sb_dd_BijgevoegdGeboorte')
Results on 128mb + cache of 20K docs:
first query = 7353 (slow > 2 min)
(when i push reload, it reloads almost instantly)
second query = 4757 (slow > 2 min)
(when i try to reload query 1 it is again slow)
1 + 2 (12110) = out of memory (slow > 2 min)
Results on 256mb + cache of 20K docs:
1 & 2 are slow the first time and same amount of documents.
1 + 2 = pretty fast ( 30sec ) :)
after this i did :
1 (fast)
1 + 2 (fast)
2 ((slow > 2 min)
1 + 2 (slow > 2 min)
1 + 2 (slow > 2 min)
Results on 512mb + cache of 100K docs:
my first request is one of the documents of all types (slow > 5min), to
oad them into cache. All other request are fast.
Unfortunatly this is not a solution for me because the set of documents
is not yet complete and could go wel over 1M.
> I'm not sure what you mean with the sorting of lucene? You mean the
> score-based sorting or other sorting? IIRC when there's a fulltext
> search we keep the order of the documents as returned by lucene, unless
> there's an explicit order by clause that orders them differently.
Indeed, that is what i mean.
> Indeed, the implementation could be optimized for the case we can on
> beforehand determine that the ACL allows everyone read access to all
> documents, and when there's no order-by clause, but again I'm not sure
> that solves a lot as IMHO in many cases you'll want these features.
If this would speed things up and use less memory, this would be a big help!
Grtz,
Bart
More information about the daisy
mailing list