[daisy] HtmlCleaner : strange case ?
Bruno Dumon
bruno at outerthought.org
Thu Nov 9 05:49:21 CST 2006
On Thu, 2006-11-09 at 11:16 +0100, christophe blin wrote:
> Hi,
>
> I am looking for html cleaners and find that the one in daisy is
> particulary good :
> - no fucking regexps allover the place
> - clean configuration by xml
> - nice test cases
>
> So I was trying some cases and find that the following seems to behave
> strangely :
> cleaner = template.newHtmlCleaner();
> result =
> cleaner.cleanToString("<html><body><p><ul><li><p>hello!</p></li></ul></p></html>");
>
> I am expecting something like :
> <ul>
> <li>
> hello!
> </li>
> </ul>
>
> but the cleaner answer :
> <ul>
> <li/>
> </ul>
>
> <p>hello!</p>
>
> What I found pretty strange is that the p is put out off the li ?
> IMHO, the only mistake here is that p is forbidden inside a li (i.e it
> is unlikely that the user wants to have an empty li).
>
> I am currently searching where the behavior comes from but if you have
> any hint, do not hesitate to list them here.
When the HTML cleaner encounters a tag that is not allowed within its
current parent, it will close parents till it encounters a tag in which
the tag is allowed, and then insert the tag there, and open the parents
again.
The HTML cleaner has been mostly designed to work on HTML produced by
the IE/Mozilla editors, not just any free form HTML. For example, these
editors will allow users to insert a table in the middle of a paragraph,
and for this case this is exactly the desired corrective behaviour:
<body>
<p>
boe
<table ... />
ba
</p>
</body>
will be changed to:
<body>
<p>boe</p>
<table ... />
<p>ba</p>
</body>
Now, the <p> tag *is* allowed inside the <li> tag, but the ul tag is not
allowed inside the p tag (from memory). Still, the result you're getting
is not quite right, it should put the ul out of the p but not the p out
of the ul/li. This is probably a bug or so, something to be looked into.
--
Bruno Dumon http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno at outerthought.org bruno at apache.org
More information about the daisy
mailing list