devbox@COMPUTEC The Computec development blog

12Jun/090

Using wdiff to show differences between text strings

We're currently in the process of adding some versioning- and history-functions to our ColdFusion-based CMS CBOX. Versioning is completely done on the database layer, thanks to some PostgreSQL database programming with lots of triggers and PL/pgSQL and PL/Perl. I might elaborate on this some other time.

One issue I had was to display the changes between two versions of a text to users. On the database level I actually already use unified diffs for a similar purpose, as I don't want so store a couple of thousand characters which haven't changed at all but for maybe one deleted comma. I'm using some Perl to do the diffing and patching from within PL/pgSQL functions, but you could also try something like cfdiff if you need a pure ColdFusion solution.

A simple diff might look like this:

@@ -1,3 +1 @@
-Article 1
+Article 1 UPDATED
-
-

I find diffs quite hard to read for humans though, especially as they drop at least some context. It might be alright for programmers, but ordinary people expect something more like the "change track" feature from OpenOffice or Word, something that's not line-based, but in fact word-based and displays the full text with the changes marked intuitively inside. This is where wdiff kicks in because it can provide just that.

Here's a partial screenshot from the output of my wrapper function to wdiff for some blind text:

wdiff output

Execution of this function is fairly expensive though. The underlying wdiff expects to work on two physical text files, so in order to compare two strings, they have to be written out to disk. Examining the execution path for bottlenecks showed however, that the actual execution call to the wdiff program remains the major bottleneck; both the writing of the files and the wdiff processing itself are running fairly fast, especially after replacing <CFFILE action=“write”>-calls with some inline Java code.

In the current incarnation of the function, the <CFEXECUTE>, too, has been replaced with Java calls. This allowed getting rid of the helper shell-script which was generated on the fly in the first version (avoiding a third file that needed to be written and deleted) as well as easier escaping of the arguments to the wdiff command - now only the quotation marks need to be doubled as customary in ColdFusion, none of the other characters that otherwise have some special meaning on the shell (like spaces, < and >) need to be treated special in any way.

This allows for some added flexibility, so now there are optional parameters for setting the tags to go around deleted and inserted text - and neither is the developer using the function required to remember which characters to escape and how, nor do we need any internal algorithm to do such escaping. If no parameters are set, span-tags are being used, the one for deleted text marking it with red and line-through attributes, the one for inserted text marking it in blue.

For demonstration purposes, the function is wrapped in its own CFC; when this is integrated, this should really go into some toolfactory-singleton CFC. In the demo-setup, a comparison between two multi-byte strings with 3.8k/3.5k characters each took around 350-400ms on a production webserver.

<cfcomponent displayname="wdiff">
 
    <cfscript>
        this.strWDTmpPath = '/tmp/';
        this.strWDTmpSuffix = '.worddiff';
        this.strWDPathToWdiff = '/usr/bin/wdiff';
    </cfscript>
 
    <cffunction name="diff" returntype="string" access="public" output="false">
        <cfargument name="strSource" type="string" required="yes">
        <cfargument name="strTarget" type="string" required="yes">       
        <cfargument name="strStartDeleteTag" type="string" required="no" default="<span style=""color:red;text-decoration:line-through"">">
        <cfargument name="strEndDeleteTag" type="string" required="no" default="</span>">
        <cfargument name="strStartInsertTag" type="string" required="no" default="<span style=""color:blue"">">
        <cfargument name="strEndInsertTag" type="string" required="no" default="</span>">
 
        <cfscript>
            // initializing local variables
            var strFname = this.strWDTmpPath & CreateUUID();
            var strSourceFile = strFname & '.source' & this.strWDTmpSuffix;
            var strTargetFile = strFname & '.target' & this.strWDTmpSuffix;
            var strArrCommand = '';
            var string = '';
            var array = '';           
            var strOutput = '';
            var exec = '';
            var stdOut = '';
            var stdErr = '';
            var streamReader = '';
            var bufferedReader = '';
            var check = TRUE;
            var line = '';
            var errorv = '';
            // creating the file that will hold the source string
            var SourceFile = CreateObject('java', 'java.io.File').init(strSourceFile);
            var SourceFileWriter = CreateObject('java', 'java.io.FileWriter').init(SourceFile);           
            // creating the file that will hold the target string
            var TargetFile = CreateObject('java', 'java.io.File').init(strTargetFile);
            var TargetFileWriter = CreateObject('java', 'java.io.FileWriter').init(TargetFile);           
 
            // writing source and target strings and closing the file handles
            SourceFileWriter.write(arguments.strSource);
            TargetFileWriter.write(arguments.strTarget);
            SourceFileWriter.close();
            TargetFileWriter.close();
 
            // now we construct the command to execute as a Java string array
            // each bit (i.e. command and each parameter) needs to go into a separate element
            string = CreateObject("java", "java.lang.String");
            array = CreateObject("java", "java.lang.reflect.Array");
            strArrCommand = array.newInstance(string.getClass(), 8);
            array.set(strArrCommand, 0, this.strWDPathToWdiff);
            array.set(strArrCommand, 1, "--start-delete=#arguments.strStartDeleteTag#");
            array.set(strArrCommand, 2, "--end-delete=#arguments.strEndDeleteTag#");
            array.set(strArrCommand, 3, "--start-insert=#arguments.strStartInsertTag#");
            array.set(strArrCommand, 4, "--end-insert=#arguments.strEndInsertTag#");
            array.set(strArrCommand, 5, "-n");
            array.set(strArrCommand, 6, strSourceFile);
            array.set(strArrCommand, 7, strTargetFile);
 
            // execute the command and attach to StdOut and StdErr
            exec = CreateObject('java', 'java.lang.Runtime').getRuntime().exec(strArrCommand);
            stdOut = exec.getInputStream();
            stdErr = exec.getErrorStream();
 
            // now assemble the output string from the StdOut stream
            streamReader = createObject("java", "java.io.InputStreamReader").init(stdOut);
            bufferedReader = createObject("java", "java.io.BufferedReader").init(streamReader);           
            while (check) {
                line = bufferedReader.readLine();
                if (not isDefined("line")) {
                    check = false;
                    } else {
                    strOutput = strOutput & line;
                } // if (not isDefined("line"))
            } // while (check)           
 
            // and assemble what we got from StdErr
            check = TRUE;           
            streamReader = createObject("java", "java.io.InputStreamReader").init(stdErr);
            bufferedReader = createObject("java", "java.io.BufferedReader").init(streamReader);           
            while (check) {
                line = bufferedReader.readLine();
                if (not isDefined("line")) {
                    check = false;
                    } else {
                    errorv = errorv & line;
                } // if (not isDefined("line"))
            } // while (check)           
        </cfscript>
 
        <!--- clean up after ourselves --->
        <cffile action="delete" file="#strSourceFile#">
        <cffile action="delete" file="#strTargetFile#">
 
        <!--- if we got something from StdErr, throw an error --->
        <cfif len(errorv)><cfthrow message="#errorv#"></cfif>
 
        <cfreturn strOutput>
    </cffunction>
 
</cfcomponent>

To use this function, you'd simply write something like this:

<cfscript>
    if ((isDefined("URL.reloadwd") and URL.reloadwd is true) or not isDefined("application.objWordDiff")) {
      application.objWordDiff=CreateObject('component','wdiff');
    }
    variables.strDiff=application.objWordDiff.diff(variables.strSource,variables.strTarget);
</cfscript>

Obviously, wdiff would need to be installed on you server - on Debian a simple

aptitude install wdiff

will do the job.

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • LinkedIn
  • MisterWong.DE
  • Netvibes
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • Twitter
  • Yahoo! Bookmarks
  • LinkArena
  • Live
  • MySpace
  • Yahoo! Buzz
  • Yigg
  • blogmarks
  • Faves
  • FriendFeed
  • MisterWong