[daisy] issue with large amout of documents

Bart Van den Abeele bvda at schaubroeck.be
Wed Jun 6 01:16:55 CDT 2007


Thx for the info, i have some additional questions.

We work with allot of documents : 100.000  is no exception even 1000.000 
is possible.  The user can start a query and if he isn't carefull, he 
could query for all of them.  At the moment i don't have a way to say to 
the user that the action that he started will result in an error.

My first question is if it is possible to get the list no mather how 
long it is.   Now this query results in a outofmemory exception.  Could 
it perhaps be possible if only the list with it fields is retrieved and 
not the complete documents are loaded so not all that memory is taken 
and it doesn't take so long.  Perhaps if i go directly to the database 
in stead of working via the api?

My other question is how i can handle this situation.  At this moment i 
can't detect the out-of-memory exception because i don't get this one 
when i use the api to talk with te repository, i get a 
DaisyPropagatedException.  Could this be wrapped in a specific 
exception, so a can handle it properly? Or should i compare the string 
DaisyPropagatedException.getRemoteClassName() to "java.lang.OutOfMemory"?

Are there other propositions to handle this situation?

Thx,
Bart



Bruno Dumon schreef:
> Hi,
>
> I think you should first look at how much memory the repository server
> has available, which you can find in the config of the wrapper (our
> default scripts give the repository 128MB).
>
> Creation of single documents indeed doesn't take much memory, but if the
> VM uses the default memory limit (64 MB) it might be it runs out of
> memory due to the document cache and ActiveMQ.
>
> The decission to kill the VM on a OutOfMemoryError seems to be a
> decision of the wrapper, again I'd look at the config of the wrapper.
> Nevertheless, in some cases this is indeed the best decision.
>
> The slow first-time execution of queries which return large datasets is
> because the documents still need to be loaded in the cache. Also,
> building a resultset of 6000 documents can take quite some time because
> of the XML construction. Each query result contains timing information
> about how much time each part of the query execution took, which can
> help to see where most time was spent. When interpreting this data, you
> should however be aware that:
>  - the timing for the ACL filtering includes the time for retrieving the
> document, which is ignorable if cached but otherwise not.
>  - the timing for building the result includes the time to retrieve the
> live version of the document, dito remark about caching.
>
> (when executing a query in the wiki, quite some time will also be spent
> in XSL etc. It might be that the repository only takes 1 minute of those
> 2 minutes, meaning a throughput of 120 uncached docs / seconds -- not so
> bad)
>
> The limit clause won't [always] help indeed, since applying the limit is
> only done at the end of the query execution, before the searchresult is
> build. So if you select e.g. 6000 documents with a 'limit 50' clause, it
> will first ACL-filter the 6000 documents, then sort them according to
> the order by clause, and only then apply the limit (this could be
> optimized for the case there is no order by clause, but I guess in most
> cases there will be one anyway). The limit clause will avoid though that
> a large XML result needs to be build, which might be one of the largest
> time-consumers in case all documents are already in the cache.
>
> For optimal performance, especially if all documents in the repository
> might be frequently accessed, you need to make the cache as large as the
> number of document variants in the repository. The cache size is
> configured in the myconfig.xml and is 10.000 by default. The JVM memory
> should be configured correspondingly. I roughly calculate 100 MB per
> 10.000 docs, though half (and less) of that often suffice. (of course,
> there should be enough physical memory to match that, not forgetting
> that OS and other apps also need memory)
>
> For the missing count function, you can use chunking (new in 2.0) to
> limit the results you get (e.g. to zero), the searchresult will include
> information on the total size of the result set.
>
> Hope this helps a bit.
>
> While I'm at it, here's a list of things we could look at to improve
> performance in the future:
>  - profile the speed of loading a non-cached document, and consider
> using a persistent cache to improve this loading speed.
>  - load the data of the live version immediately while loading a
> document in the cache
>  - avoid concurrent threads loading the same document at the same time
> (using Java's 1.5's futures can help here)
>  - improve the speed of building large(r) searchresults, avoiding
> XMLBeans could help here
>
> On Tue, 2007-03-27 at 15:05 +0200, Bart Van den Abeele wrote:
>   
>> We are developing a system with daisy 1.5.1 that uses following
>> structure: 
>>
>> 1 documenttype with some fields and a multi-value link field to the
>> picture documenttype 
>> 2 picture-documenttype = documenttype that contains the data of the
>> picture (mostly tiff-format), a preview (png-format) and a thumbnail
>> (also png) (the latter 2 are generated on the client.) 
>>
>> I tried to upload allot of documents of type 1 with all 1 link
>> (sometimes 2 or 3) to doc of type 2. 
>> When i was around 12000 documents of type 1 (this means 24000 because
>> they all contain a link to type 2), the daisy-repository-server
>> halted, the process was killed (see screen shot) because of
>> out-of-memory. 
>> I found this strange because i expected that there was not to much
>> memory needed to create 1 document and 1 picture document. 
>> What happens in the iteration was : 
>> 1. checking whether there was already a document with such fields if
>> the fields hold values (that was alsmost never the case) 
>> 2. creating the picture document, also converting the original data to
>> a preview and thumbnail 
>> 3. if there doesn't exist a document with the given fields : creating
>> the document with a link to the picture 
>>    else : adding the link of the picture to the existing document. 
>>
>> For my next test, i limited the documents to upload to around 12000
>> and it succeeded.  But when i connected with my client and the wiki
>> and released some queries, i got an outofmemory exception in my client
>> (propagated by the server), but the server kept on running. 
>>
>> Can i get some help and guidelines around memory-configuration.  And i
>> think the first test should be looked at, i don't see how the
>> repository could crash there. 
>>
>> An other remark is that the query's last long.  When i query for about
>> 6.000 documents i have to wait about 2 minutes.  The next time i ask
>> them, its almost instantly.  But when i ask an other set of about
>> 5.000 documents i have to wait again 2 minutes.  When the criteria
>> only selects about 100 documents, its acceptable (< 5 sec).  We tried
>> to use i.e. limit 50, to have them not waiting so long, but that fix
>> doesn't work.  Also a count-function was not to be found. 
>>
>> Grtz, 
>> Bart 
>>     
>
>   

 **** DISCLAIMER ****
 http://www.schaubroeck.be/maildisclaimer.htm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cocoondev.org/pipermail/daisy/attachments/20070606/da6c9458/attachment.htm


More information about the daisy mailing list