ColdFusion-UDF Wrapper for JTidy to clean up HTML
JTidy is a Java port of HTML Tidy, which allows you to clean up messy HTML. This comes in useful when you need to output some Code which has been created by users. I'll show in some later post how to allow users to actually enter HTML without compromising the security of your site, today I'll just show how to clean up this user-generated code. JTidy will not only generate XHTML-valid code from incomplete code by correctly closing opened tags, it will also do a couple of "prettifying" operations to increase quality of the result.
JTidy can be obtained from Sourceforge. Tidy.jar needs to be available in the ColdFusion classpath. Here is a list of parameters to our makexHTMLValid:
| Parameter | Type | Required | Default | Hint |
|---|---|---|---|---|
| strToParse | String | yes | empty | contains the input HTML |
| bStripFrame | Boolean | no | TRUE | If set to TRUE, an already present HTML-frame consisting of <HTML><HEAD>…</HEAD><BODY>…</BODY></HTML> will be removed; all that remains is the code between the BODY-tags. If set to FALSE, missing elements of this HTML-frame or even the complete HTML-frame itself will be amended. |
| bForceOutput | Boolean | no | TRUE | If set to TRUE, maxexHTMLValid() will try to generate some output even if some error occurs. If the supplied code is too broken however, this output may still be empty, as jTidy is no magic snakeoil to heal sick code... |
| bKillMS | Boolean | no | TRUE | Cleans up Microsoft clutter in the code. |
| bKillWord | Boolean | no | TRUE | Cleans up MS-Word-XML/HTML-clutter in the code |
| bQuoteAsEntity | Boolean | no | TRUE | Quote-characters will be replaced with the corresponding HTML-entities (e.g. ") |
Here's an example:
<cfsilent> <cfsavecontent variable="sHTMLtoTidy"> <BODY> <h1>TEST</h1> <p>Foo bar <p>foo ÄÖÜ߀ </body> </cfsavecontent> </cfsilent> <cfoutput>#makexHTMLValid(strToParse="#sHTMLtoTidy#",bStripFrame=false,bKillMS=true,bKillWord=true,bQuoteAsEntity=true)#</cfoutput>
This is the UDF:
<cffunction name="makexHTMLValid" access="public" output="false" returntype="string" displayname="Tidy parser" hint="Takes a string as an argument and returns parsed and valid xHTML"> <cfargument name="strToParse" default="" required="yes" type="string"> <cfargument name="bStripFrame" default="true" required="no" type="boolean"> <cfargument name="bForceOutput" default="true" required="no" type="boolean"> <cfargument name="bKillMS" default="true" required="no" type="string"> <cfargument name="bKillWord" default="true" required="yes" type="boolean"> <cfargument name="bQuoteAsEntity" default="true" required="no" type="boolean"> <cfscript> var jTidy = createObject("java","org.w3c.tidy.Tidy"); var sEncoding = 'UTF-8'; var oReadBuffer = CreateObject("java","java.lang.String").init(strToParse).getBytes(sEncoding); var oInP = createobject("java","java.io.ByteArrayInputStream").init(oReadBuffer); var oOutx = createObject("java", "java.io.ByteArrayOutputStream").init(); // configuration jTidy.setQuiet(true); jTidy.setRawOut(true); jTidy.setIndentContent(false); jTidy.setSmartIndent(false); jTidy.setIndentAttributes(true); jTidy.setWraplen(1024); jTidy.setXHTML(true); jTidy.setShowWarnings(false); jTidy.setInputEncoding(sEncoding); jTidy.setOutputEncoding(sEncoding); jTidy.setTidyMark(false); jTidy.setForceOutput(true); if (bStripFrame) jTidy.setPrintBodyOnly(true); if (bKillMS) jTidy.setMakeBare(true); if (bKillWord) jTidy.setWord2000(true); if (bQuoteAsEntity) jTidy.setQuoteMarks(true); // do the parsing jTidy.parse(oInP,oOutx); // close the stream oOutx.close(); strToParse = oOutx.toString(sEncoding); </cfscript> <cfreturn strToParse> </cffunction>
UPDATE! Regarding comment by JanSR: The Tidy.jar provided as a file download on the project's website on Sourceforge is way too old. Unfortunately I couldn't find any official download offering for the binary distribution, so here's how to roll your own from the subversion repository; obviously you'll need the Java SDK on your system and a subversion client. You should also make sure that your classpath variable is set according to your Java environment. Now to compile your own package, simply do the following:
svn checkout https://svn.sourceforge.net/svnroot/jtidy/trunk/jtidy/ jtidy cd jtidy mkdir classes javac -source 1.4 src/main/java/org/w3c/tidy/*.java -d ./classes/ cp -p ./src/main/resources/org/w3c/tidy/*.properties ./classes/org/w3c/tidy/ cd classes jar cf ~/Tidy.jar org
You should now have a Tidy.jar sitting in your home directory. Simply drop that into you classpath - if in doubt, /{cfhome}/lib/Tidy.jar in a single server install or /{jrunhome}/servers/{instance}/cfusion.ear/cfusion.war/WEB-INF/cfusion/lib/Tidy.jar should do.
For your convenience I'll offer my Tidy.jar for download here, though please note that I cannot provide any support for this binary or the source code it was compiled from - please refer to the JTidy project page on SourceForge for this. Use this at your own risk.
UPDATE 2! Regarding comments by Mark Woods: I have dropped the settable parameters for input and output encoding, setting string encoding to UTF-8; this should make it more robust on platforms where UTF-8 is not the default platform encoding. Thanks Mark for you patience in explaining the issue!
UPDATE 3! Here's the result the example snippet would produce when being fed through jTidy:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title></title> </head> <body> <h1>TEST</h1> <p>Foo bar</p> <p>foo ÄÖÜ߀</p> </body> </html>
If you set bStripFrame=true, the HTML-frame is omitted, so you get
<h1>TEST</h1> <p>Foo bar</p> <p>foo ÄÖÜ߀</p>

June 16th, 2009 - 12:21
Hi. I like the way you write. Will you post some more articles?
June 16th, 2009 - 18:24
Thank you for your encouragement and yes, of course – in due time
We have only just started our blog, but I hope that we’re going to have lots more interesting dev tidbits online soon.
July 6th, 2009 - 13:46
Hi, nice idea, however most of the Methods used are denied by the Tidy jar. E.g.:
No matching Method/Function for org.w3c.tidy.Tidy.setmakebare() found
Any hints?
Cheers, Jan
July 6th, 2009 - 17:33
@JanSR: This is probably because you downloaded the binary from SourceForge, which is way too old – you’ll need to compile and package your own from the current SVN trunk. I’ve added some instructions on how to do this to my post and you could try if the version I compiled for my own needs suits you – I’ve made it available for download here.
July 9th, 2009 - 17:35
Hi,
Just stumbled upon this post looking for some info about JTidy configuration options. Although I haven’t tested it, I suspect these is an issue with the code for your function…
Java strings are UTF-16 internally, so you should probably set the input encoding for JTidy to UTF-16. You should also make sure the output encoding for JTidy and the encoding used when you convert the output stream to a string are the same. At the moment, the output from JTidy is UTF-8 encoded, but you are using the default system encoding, which may not be UTF-8, when converting the stream to a string. To keep things simple, I’d recommend just sticking with UTF-16 all the way.
Mark
July 9th, 2009 - 18:09
@Mark Woods: Nope, in this case we’re running on ColdFusion, whose encoding defaults to UTF-8, so all is well here. Implementing an explicit character conversion in ColdFusion is simply not necessary;
On the issue of the input/output-encodings: I don’t exactly understand the reason behind jTidy actually providing separate settings for input and output encodings, as it certainly doesn’t do any encoding-conversions – but I decided to make these settable in my UDF nevertheless, though I explicitly mentioned in the parameter description that there is no encoding conversion done, so in fact input- and output-encoding needs to be the same.
July 10th, 2009 - 13:25
Coldfusion strings are just java strings, which are UTF-16 encoded. However, ColdFusion does use UTF-8 by default when sending data to the browser, and on a typical unix based system, ColdFusion (i.e. the jvm) will use UTF-8 as the default encoding for input and output.
So, on a unix based system your code will work perfectly – you are converting a UTF-16 encoded string into a UTF-8 byte array, then sending that byte array to jtidy, telling it that it’s UTF-8 encoded, asking it to return a UTF-8 stream, then converting that stream to a string according to the default platform encoding, again UTF-8. All hunky dory.
When the default platform encoding isn’t UTF-8, your code may not work as expected, because you’ll be converting a string into a byte array that isn’t UTF-8 encoded, but telling jtidy that the input is UTF-8 encoded.
July 10th, 2009 - 13:28
Oops, I should have mentioned that the default encoding on a windows system usually isn’t UTF-8, but something like windows-1252
July 10th, 2009 - 13:37
The OS shouldn’t be all that relevant in this case, the default encoding of UTF-8 applies to ColdFusion on any available platform, no matter if it is Linux or Windows. This may be overridden by the application developer or the server administrator, so the developer actually needs to make sure to set sInputEncoding and sOutputEncoding to the encoding used in his script. There’s really no need to convert anything to and from UTF-16.
When it comes to reading file content, the developer needs to use an appropriate <cffile>-statement to cater for encoding.
July 10th, 2009 - 13:58
But the OS is relevant, because that’s where the JVM gets its default encoding from. Although you can override it, the default jvm configuration on a coldfusion server doesn’t.
You can check what the default encoding is on a CF installation by outputting this:
#createObject(“java”,”java.nio.charset.Charset”).defaultCharset().name()#
On a windows system set up for a western european locale, it’s likely to be windows-1252, so your calls to String.getBytes() will encode the string into an array of bytes using windows-1252 encoding.
July 10th, 2009 - 14:10
Ah, now I see the problem. Unfortunately I don’t have a Windows box availabe to test and fix this, I’ll see if I can whip up a virtual box with Windows somewhere and get CF to run on this. I on the other hand you are already on Windows and can test the UDF and patch it to make it more robust, it’d be very welcome.
July 10th, 2009 - 16:04
I’m not actually on windows, I’ve just run into a lot of encoding issues in the past.
You can make the UDF more robust by removing the arguments related to character encoding and making sure that all of your translations to and from streams and strings are the same. I use UTF-16 out of habit, because that’s what java strings are and I just guess that it’ll perform better, but UTF-8 will work fine too.
If you leave the input and output encoding for jtidy as UTF-8, and change the calls to String.getBytes() and OutputStream.toString() to String.getBytes(“UTF-8″) and OutputStream.toString(“UTF-8″) it should work whatever the platform’s default encoding is.
July 13th, 2009 - 10:06
If I leave out any user defined encoding setting, wouldn’t this break when the template encoding is anything other than the platform’s default encoding?
July 13th, 2009 - 23:13
No, your function accepts a string to parse and also returns a string. Java (and coldfusion) strings are always UTF-16 internally, so the string being passed to the function and the string being returned will always be UTF-16 encoded, regardless of the platform’s default encoding.
The platform’s default encoding is used when converting strings to streams, and vice versa, when no encoding is specified. Because the platform’s default encoding won’t always be the same on all platforms, but you need to tell JTidy what encoding was used, you should specify the encoding explicitly. You could in theory just tell JTidy to use the platform’s default encoding, but if that was windows-1252 you might suffer data loss or encoding conversion exceptions or something like that because win1252 can’t represent the same range of characters as the various unicode encodings.
RE the template encoding – I’m not sure if you are referring to the encoding that can be set using cfprocessingdirective, but if so, this refers to the encoding of the source file, which, will again default to the platform’s default encoding if not specified using either cfprocessingdirective or by using a byte order mark.
August 11th, 2009 - 10:08
Hi Markus,
I’ve attempted to explain the issue more clearly in a blog post, which may be of interest to cf developers who have assumed, understandably, that all text in ColdFusion is UTF-8 encoded.
It’s at http://www.thickpaddy.com/2009/8/10/coldfusion-is-not-utf-8-encoded. I hope my musings make sense
Mark
November 26th, 2009 - 21:32
Do you have an example of what the output looks like? I don’t see it here.
November 27th, 2009 - 10:09
@Sami Hoda: I added the processed HTML from the example to the article.