[daisy] issue with large amout of documents

Bruno Dumon bruno at outerthought.org
Thu Jun 7 07:09:20 CDT 2007


On Thu, 2007-06-07 at 07:35 -0400, Ross Singer wrote:
> On 6/7/07, Bruno Dumon <bruno at outerthought.org> wrote:
> 
> > >   Perhaps if i go directly to the database in stead of working via the
> > > api?
> >
> > Then there's not much point in using the repository server: just use an
> > RDBMS then.
> >
> 
> This is a strange response.  Could you not just get back the document
> IDs from the RDBMS and then go back to interfacing with the repository
> server?  I'm not sure how that would be in opposition to 'using the
> repository server'.

Because you would be bypassing the repository abstraction. I'd rather
see the repository improved than doing these sort of things.

BTW, how would you retrieve the IDs from the RDBMS? Getting a list of
just all document IDs is easy of course (and could be added in a
streaming manner to the API if considered useful), but what if you need
them ACL-filtered and sorted?

> I guess the problem I see is that Daisy doesn't handle large document
> result sets sanely at all, so there's little recourse /but/ to find
> some sort of alternative to the API when dealing with them.

Again, I'd be interested in what those alternatives to the API would be,
since then we could use these techniques to improve the repository
itself.

> For example, in our setup, we have about 750K documents (of what would
> be around 2M total) that we might need to extract and either delete
> and resubmit or possibly tweak.  Since LIMITs don't apply until
> /after/ the query has taken place, it's nearly impossible to
> effectively handle this.
> 
> What do you suggest doing in these scenarios?

I have no suggestions, but I'm very interested in hearing other peoples
suggestions.

Daisy's first aim was to support at most tens of thousands of documents,
not millions. Handling such large datasets requires a different way of
working, just like supporting hundreds of million documents would again
require a different design.

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno at outerthought.org                          bruno at apache.org



More information about the daisy mailing list