devbox@COMPUTEC The Computec development blog

21Jul/090

UDF to strip certain chars, but leave UBB tags alone

We are developing a commenting system which is supposed to discourage comment spam by making comments more or less unreadable when they crossed a certain threshold of negative ratings. We decided that we'd like to strip all vowels from the text, though we'd like to keep the UBB-style tags inside the comment unchanged.
You'll find that this last bit makes the whole task a little more complicated than just a simple Regex-Replace. We'll need to use a negative lookbehind, then mark the characters we do not wish to strip, then remove any "unmarked" characters and finally remove our marker.

I don't know if lookaround has actually made it into CF8 (or CF9 for that matter), but using on ColdFusion always means that if you've got some more complex need, Java may come to the rescue. We've been using the Java RegEx Component by massimocorner.com for ages now (i.e. since CFMX 6.1) and we're making quite excessive use of it, so we just pop it into the SERVER scope to avoid re-instancing.

Now to illustrate the task, this is what we got:

The quick brown fox [foo]jumped[/foo] over the lazy dog. The quick brown fox [bar]jumped[/bar] over the lazy dog. [foobar]

And this is what we want:

Th qck brwn fx [foo]jmpd[/foo] vr th lzy dg. Th qck brwn fx [bar]jmpd[/bar] vr th lzy dg. [foobar]

So here's our UDF:

<cffunction name="stripChars" access="public" returntype="string" output="false">
	<cfargument name="strToParse" type="string" required="yes" hint="String to parse">
	<cfargument name="bPreserveUBB" type="boolean" required="no" default="true"
				hint="should text enclosed by square brackets be left unchanged?">
	<cfargument name="strCharsToRemove" type="string" required="no" default="aeiouäöüAEIOUÄÖÜ"
				hint="characters to remove; by default all vowels will be removed; strings inside square brackets are left unchanged">
	<cfargument name="strMarker" type="string" required="no" default="µµµ" hint="marker to save characters inside brackets">
	<cfargument name="iMaxBracketTagLength" type="numeric" required="no" default="50" 
				hint="maximum length of string enclosed by square brackets">
	<cfscript>
		var strParsed = arguments.strToParse;		
		if (not structKeyExists(server,'JavaRegExp')) { 
			SERVER.JavaRegExp = createObject("component","JavaRegExp");
		}
		if (arguments.bPreserveUBB) {
			// first we place a marker behind the vowels we need to preserve
			strParsed = SERVER.JavaRegExp.regExpReplace("(?<=\[[^\]]{0,#arguments.iMaxBracketTagLength#})([#arguments.strCharsToRemove#])",strParsed,"$1#arguments.strMarker#",true);
			// then we do the actual replacing for any unmarked character
			strParsed = SERVER.JavaRegExp.regExpReplace("[#arguments.strCharsToRemove#](?!#arguments.strMarker#)",strParsed,'',true);
			// now we need to remove the markers
			strParsed = Replace(strParsed,arguments.strMarker,'','ALL');
		  } else {
		  	strParsed = SERVER.JavaRegExp.regExpReplace("\[[^\]]*\]|[#arguments.strCharsToRemove#]",strParsed,'',true);
		} // end if (arguments.bPreserveUBB)
	</cfscript>
	<cfreturn strParsed>
</cffunction>

Have you noticed the iMaxBracketTagLength argument? This is necessary as Java doesn't support fully variable width assertions in its RegEx engine - but it will accept min/max-width declarations. This actually makes sense when you think about it, because for really long text, the engine would need to "step back" for any number of characters to check the assertion as soon as the inner group matches; so it's really just common sense and in your best interest not to allow unlimited backwards searching. In most cases, your bracketed expression would only have a certain maximum width anyway, so you might just as well give your UDF this information to help it perform better.

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • LinkedIn
  • MisterWong.DE
  • Netvibes
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • Twitter
  • Yahoo! Bookmarks
  • LinkArena
  • Live
  • MySpace
  • Yahoo! Buzz
  • Yigg
  • blogmarks
  • Faves
  • FriendFeed
  • MisterWong