[daisy] Some feedback/questions

Bruno Dumon bruno at outerthought.org
Wed Nov 15 03:46:35 CST 2006


Thanks for the, as usual, good feedback. See comments inline.

On Tue, 2006-11-14 at 16:28 -0500, Mindaugas Idzelis wrote:
> I'm having growing pains with Daisy. Our database dump file is now 16meg. 
> Our blobstore directory is 1.8GB and 33123 files. A couple of things come 
> to mind. It would be nice if each version of a file was stored as a diff 
> against the previous version. This may make things more space efficient.

It certainly would, however the development of such a diff-based store
is not really trivial (subversion does this). It would also need to take
care that random access to large blobs stays possible.

>  
> Storing the blobs in the database itself may also make better space 
> utilization and file system access.

Depends much on the database I guess (what if the database simply stores
the blobs on the filesystem again?). Reading blobs from a database is
slower than reading a file from the filesystem (filesystems usually have
good caching). It also doesn't allow for random access (which we
currently don't do, but I imagine this is something we might want to add
in the future).

Support for blobs varies a lot accross databases. Sometimes the database
supports it well, but than the JDBC driver doesn't, often reading the
entire blob completely into memory, which very much limits file sizes,
concurrency, and is a lot slower (even downloads of small files became
faster in Daisy once we introduced streaming part downloads).

IIRC other CMSes like alfresco and hippo are using the same approach as
we do.

>  Having this many small files around 
> does create a lot of wasted slack space in the drive (each file entry is 
> rounded to the nearest 4K boundary, so a 1K file takes up 4K of space, 
> etc) It also takes a long time to copy that many files. It might also make 
> the database backup simpler to perform, since all you would not have to 
> backup the files. It might also make a "hot backup" possible - if you 
> perform the mysqldump using "--single-transaction" it should not back up 
> any new blob entries that were added after the start of the backup. So you 
> get a consistent snapshot of the data. This is important to us because 
> using the backup tool, it takes over a hour to backup the database - 
> during which time no new documents can be saved. 

I think with some thought we could remove the complete blocking of the
blobstore while backing it up. New file additions aren't a problem, and
deletions could be added to some log file to be executed after the
back-lock is released.

> 
> If the repository is locked, a new document can still be created - but not 
> saved. This thoroughly confuses our users. 

Indeed. There's currently no API to check if the repository is locked,
but if we'd add this it would be possible to perform a check up-front.
(maybe add a jira issue for this)

> 
> The following message is also not very user-friendly. Users don't know 
> what continuations, repository servers, or blobstores are. When the blob 
> write lock is enabled, it should say the site is currently undergoing a 
> backup. Please go back, and copy/paste your document in notepad 
> temporarily until it is back online. (An ETA would also be helpful - maybe 
> as simple as timing the last backup, and using that value) 
> 
> Sitemap: error calling continuation
> Received exception from repository server.
> Problem storing document.
> Error storing part data to blobstore.
> Write access to the blobstore is currently disabled. Try again later. 

This could be solved by throwing a more specific exception and then
handling it in the error.xsl, a relatively easy improvement. (maybe also
make a jira issue for this one)

> 
> The backup tool does some unnecessary copying. First, it copies all the 
> files to the target. Then it zips all those files up, then it deletes the 
> copy. Copying and even deleting 30K+ files takes a long time. Especially 
> if you are copying these files to a remote filesystem. It also requires a 
> lot of temporary space. Instead of copying, ziping the files directly 
> would be the best idea. 

I think this approach was taken to be able to release the backup lock
ASAP, however it might indeed be that zipping on-the-fly is faster than
first copying everything, and certainly would save a lot of space.

> 
> The backup tool doesn't provide a lot of feedback. It should be possible 
> to determine progress. Maybe controlled with --verbose flag? Timing 
> information would be nice to have. Emails on success also nice to have 
> (cmd line option?) 

Seems like useful suggestions.

[hint: since the backup tool is a rather small independent codebase, it
should be easy for anyone to jump in and improve]

> 
> Bug with display of queries... You have a document type that contains a 
> multi-valued field. You have some documents that have no values in this 
> field, and some that have more than one value. You create a query to 
> display the multivalue field. The table is is created by this table 
> doesn't generate table cells for these "null" multi valued fields, making 
> all the following columns off by one, severely affecting the display. 

I seem to remember this one, it might already be fixed in Daisy trunk.
Not sure though, maybe you can test this on demo.daisycms.org

> 
> Weirdness with the publisherResponse in skins. In my custom 
> document-to-html.xsl template, I inherit the base, and add some 
> customization. One thing I want to do is display the last modified date of 
> the document under the title. The only way I could generate an XPath to 
> pick out the field I wanted was by doing something like this:
> 
>  <xsl:variable name="lastModified" 
> select="/document/p:publisherResponse/d:document/@*[position()=13]"/>
> 
> This is because there is no namespace of the attributes of the included 
> document (I think - xslt is wacky) Trying to do the following didn't work. 
> 
> 
>  <xsl:variable name="lastModified" 
> select="/document/p:publisherResponse/d:document/@variantlastmodified"/>

It needs to be @variantLastModified (if you do a view-source in firefox,
it tends to lowercase everything).

> 
> One more thing that would be nice for skinning. A utility to transform 
> those pesky XSLT-formatted dates into normal date formats. I haven't tried 
> it - but 
> http://www-128.ibm.com/developerworks/java/library/x-xalanextensions.html 
> looks promising. Turns out you can call java methods directly from xalan. 
> A little trickery with SimpleDateFormat should be the way to go. This 
> would make a good addition to the util.xsl class. To format a xslt-format 
> date into the locale specific date. 

Usually we annotate the XML with pre-formatted copies of dates (the
formatted copy would then be stored in @variantLastModifiedFormatted),
however this is currently not done consistently for all dates. In the
future we'd like to have some global (and user-specific) configurable
date patterns, so it would be easier if these things are not executed
from the XSLT.

If you want to avoid any coding you could also generate instructions for
Cocoon's i18n transformer, which is also capable of formatting dates.

> 
> One last comment. Keeping custom modifications (skins, config) away from 
> the daisy install location has come a LONG way since Daisy 1.2. But there 
> is still 2 thing that I have change every time I upgrade. 
> 
> 1) I need to make a "work" directory inside of /daisywiki/webapp/WEB-INF/ 
> so that jetty doesn't store temp file in /tmp. (Red hat deletes them every 
> 30 days, and after that file uploads don't work) 

We have an open jira issue for this, if anyone can put some thought to
it and supply a patch, that would be helpful.

> 
> 2) drop my custom authentication schme jar into /lib/daisy/jars/ and then 
> modify /repository-server/conf/block.xml and add 
> 
> <include name="ibmauth" id="daisy:my_auth" version="1.5"/> to <container 
> name="authentication"> 
> 
> Would be nice if I could somehow specify these in myconfig.xml, and maybe 
> have a directory that is put on the classpath for additions like this. 

Yep, it's something that annoys me too as it makes the barrier for these
sort of customisations higher, I hope to address it in the future.

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno at outerthought.org                          bruno at apache.org



More information about the daisy mailing list