[daisy] summer of code projects

Guy Van den Broeck guyvdb at gmail.com
Fri Mar 23 19:14:22 CDT 2007


thank you for the quick answers. 
first of all: Mark and i will end our affair. everything goes through
the mailing list now ;)
i've given it some thought (not too much as the application deadline is
approaching soon) and the html diff project is what i'm going for. i've
done some initial research and brainstorming and i've come up with
something less vague than my first mail :)

first i'd like to request write access to the community scratchpad. i'll
post the stuff below there and keep it up to date. my account name is
"guyvdb"

if you find the time please read and comment on the draft below; thanks

***************************
***HTML DIFF PROJECT***
***************************

PROJECT DESCRIPTION:
-----------------------------------------
The goal of the "HTML diff" project is to provide a Java library that
compares XML files. The minimum inputs are 2 XML files, and the minimum
output file is a Java representation of the basic operations that turned
the first data structure into the second. 
This output can be converted to an XML-based representation or, given
the proper formatting guidelines (in a yet undecided data type), into an
html file with a certain syntax to denote changes. 
In addition, distance metrics can be computed that allow the user to
quantify by how much 2 files differ. This enables them to distinguish
minor from major changes, and tells them what the (chronological)
relationships are between 3 files.
For computing the changes there can be several algorithms:
- A simple LCS (Longest Common Subsequence) implementation that will
work on plain text or simple HTML files. This algorithm is the algorithm
used in the Unix "diff" command, in CVS, patch, etc. It's
computationally the best algorithm but it will destroy complex XML
hierarchies. There are many open source implementations available of
this algorithm. For instance the Wikipedia and Drupal software uses LCS,
as does the current text-based implementation in Daisy.
- A more complicated graph algorithm that uses a greedy method to
compute a difference but not the optimal one. this can be optimal for
XML files with little reproduction.
- A complex algorithm with perfect accurateness.
For each of these possibilities there are papers that describe
algorithms.
There should also be an option that chooses the optimal algorithm for a
given XML file at runtime (based on length, complexity, leafs, ...)

DELIVERABLES:
----------------------------
- A literature study with an outline of all the possible algorithms +
Testing of existing (proprietary) code.
- A study of the XML files used by DaisyCMS, which I would need to
prioritize on. (Do we want to compare ordered or unordered XML files?)
- A LCS implementation (could be copied from the current Daisy
codebase?)
- A HTML respresentation/change syntax that can be used for DaisyCMS
- One or more graph-based implementations
- A library with all the options from the project description
- Documentation

TIME LINE:
-------------------
June 1st - June 30th: Given the fact that in Belgium the exam period
ends somewhere in the of June this will be my least productive period of
the summer: Getting to know Daisy, and the literature study will be all
I can cope with.

July 1st - July 16th: To make up for the slow month of June, I will work
day and night to finish the LCS Implementation and the HTML
representation.

July 16th - July 26th: A well deserved vacation in Greece.

July 27th - August 15th: One or more graph algorithms.

August 15th - August 31st: Finish the package, complete as much
documentation as possible before the final deadline.

September 1st - September 25th: Integrate the package into DaisyCMS and
finish the unfinished.

BIO:
-------
todo
+ I'd like to work in the offices in Ghent
+ I'd like to do this as an internship for my university (how's that for
a guaranteed commitment; at least i wont run away after the initial
payment)









More information about the daisy mailing list