[daisy] issue with large amout of documents

Ross Singer ross.singer at library.gatech.edu
Thu Jun 7 08:03:41 CDT 2007


On 6/7/07, Bruno Dumon <bruno at outerthought.org> wrote:
> On Thu, 2007-06-07 at 07:35 -0400, Ross Singer wrote:
> > On 6/7/07, Bruno Dumon <bruno at outerthought.org> wrote:
> >
> > > >   Perhaps if i go directly to the database in stead of working via the
> > > > api?
> > >
> > > Then there's not much point in using the repository server: just use an
> > > RDBMS then.
> > >
> >
> > This is a strange response.  Could you not just get back the document
> > IDs from the RDBMS and then go back to interfacing with the repository
> > server?  I'm not sure how that would be in opposition to 'using the
> > repository server'.
>
> Because you would be bypassing the repository abstraction. I'd rather
> see the repository improved than doing these sort of things.

No argument here :)  I think it's a case of reconciling what's good
for the long term vs. what needs to be done locally in the short term.
 (Especially since this is solely for maintenance, currently)
>
> BTW, how would you retrieve the IDs from the RDBMS? Getting a list of
> just all document IDs is easy of course (and could be added in a
> streaming manner to the API if considered useful), but what if you need
> them ACL-filtered and sorted?

Well, in this case I don't, but your point is well-taken.  It's my
understanding that sorting is the real obstacle here.  I'm not sure of
all the possible sort scenarios so I'll defer to somebody else why or
why not all the results would need to be retrieved first.  In the
absence of a SORT clause, could the LIMIT clause take precedence?
>
> Again, I'd be interested in what those alternatives to the API would be,
> since then we could use these techniques to improve the repository
> itself.

Again, I don't really know, but would be interested in helping where I can.
>
> Daisy's first aim was to support at most tens of thousands of documents,
> not millions. Handling such large datasets requires a different way of
> working, just like supporting hundreds of million documents would again
> require a different design.

It actually holds up quite well (well, we're just working with the
repository-server and haven't gotten anywhere near the 2M mark, yet):
fielded queries are /very/ fast.  This could be partially due to the
fact that we're indexing the documents outside of Daisy, however, so
we're really only using it to retrieve small sets of documents by
document ID (outside of the above scenario, of course).

I can understand this not being a pressing need, however, if our
application of Daisy is that edge-case-y (and if we come up with a
workaround, we'll share it here).

Thanks,
-Ross.


More information about the daisy mailing list