Calculate minimal operations to make two tree structures identical

Algorithm Tree Comparison Diff Computer Science

Algorithm Problem Overview

This is more of a CS question, but an interesting one :

Let's say we have 2 tree structures with more or less the same nodes reorganized. How would you find

any
in some sense minimal

sequence of operations

MOVE(A, B) - moves node A under node B (with the whole subtree)
INSERT(N, B) - inserts a new node N under node B
DELETE (A) - deletes the node A (with the whole subtree)

that transforms one tree to the other.

There might obviously be cases where such transformation is not possible, trivial being root A with child B to root B with child A etc.). In such cases, the algorithm would simply deliver an result "not possible".

Even more spectacular version is a generalization for networks, i.e. when we assume that a node can occur multiple times in the tree (effectively having multiple "parents"), while cycles are forbidden.

Disclaimer : This is not a homework, actually it comes from a real business problem and I found it quite interesting wondering if somebody might know a solution.

Algorithm Solutions

Solution 1 - Algorithm

There is not only a Wikipedia article on graph isomorphism (as Space_C0wb0y points out) but also a dedicated article on the graph isomorphism problem. It has a section Solved special cases for which polynomial-time solutions are known. Trees is one of them and it cites the following two references:

P.J. Kelly, "A congruence theorem for trees" Pacific J. Math., 7 (1957) pp. 961–968
Aho, Alfred V.; Hopcroft, John; Ullman, Jeffrey D. (1974), The Design and Analysis of Computer Algorithms, Reading, MA: Addison–Wesley .

Solution 2 - Algorithm

You weren't clear if you were comparing abstract syntax trees for source code, XML documents interpreted as trees, or some other type of tree.

There's a number of papers that discuss comparing syntax trees and computing minimum distances by various means. The ideas should be relevant.

A good paper is http://www.merlin.uzh.ch/publication/show/2531">Change Distilling, which tries to compare the source code for two abstract syntax trees and report a minimal difference. The paper talks about a specific method, and also briefly mentions (and provides references) to a variety of similar techniques.

Few of these algorithms are actually realized in available tools for comparing computer program source text. Our http://wwww.semanticdesigns.com/Products/SmartDifferencer">Smart Differencer is one of them; it is driven by an explicit language grammar for many languages.

Solution 3 - Algorithm

Although this question is old, i'll add a couple more references and algorithms below:

Furthermore, there are libraries and frameworks on GitHub (in javascript) which implement diffing of Tree-like structures for example applications dealing with JSON data or XML Trees (e.g for client-side MVC/MVVM):

Solution 4 - Algorithm

In case folks find this question and need something implemented for Node.js or the browser, I'm providing a link and code example for an implementation I've written that you can find on github here: (https://github.com/hoonto/jqgram.git) based on the existing PyGram Python code (https://github.com/Sycondaman/PyGram).

This is a tree edit distance approximation algorithm, but it is much, much faster than trying to find the true edit distance. The approximation performs in O(n log n) time and O(n) space whereas true edit distance is often O(n^3) or O(n^2) using known algorithms for true edit distance. See the academic paper from which the PQ-Gram algorithm comes: (http://www.vldb2005.org/program/paper/wed/p301-augsten.pdf)

So using jqgram:

Example:

var jq = require("jqgram").jqgram;
var root1 = {
    "thelabel": "a",
    "thekids": [
        { "thelabel": "b",
        "thekids": [
            { "thelabel": "c" },
            { "thelabel": "d" }
        ]},
        { "thelabel": "e" },
        { "thelabel": "f" }
    ]
}

var root2 = {
    "name": "a",
    "kiddos": [
        { "name": "b",
        "kiddos": [
            { "name": "c" },
            { "name": "d" },
            { "name": "y" }
        ]},
        { "name": "e" },
        { "name": "x" }
    ]
}

jq.distance({
    root: root1,
    lfn: function(node){ return node.thelabel; },
    cfn: function(node){ return node.thekids; }
},{
    root: root2,
    lfn: function(node){ return node.name; },
    cfn: function(node){ return node.kiddos; }
},{ p:2, q:3 },
function(result) {
    console.log(result.distance);
});

And that gives you a number between 0 and 1. The closer to zero, the more closely related the two trees look to jqgram. One approach could be to use jqgram to narrow in on several closely related trees from among many trees given its speed, then utilize true edit distance on the few trees remaining that you need to take a closer inspection of, and for that you can find python implementations for reference or port of the Zhang & Shasha algorithm for example.

Note that the lfn and cfn parameters specify how each tree should determine the node label names and the children array for each tree root independently so that you can do funky things like comparing an object to a browser DOM for example. All you need to do is provide those functions along with each root and jqgram will do the rest, calling your lfn and cfn provided functions to build out the trees. So in that sense it is (in my opinion anyway) much easier to use than PyGram. Plus, its Javascript, so use it client or server-side!

ALSO, to answer with respect to cycle detection, check out the clone method inside of jqgram, there is cycle detection there, but credit for that goes to the author of node-clone from which that piece was slightly modified and included.

Solution 5 - Algorithm

This is called the tree to tree correction problem or the tree to tree editing problem. Most of the literature dealing with this explicitly relates to comparing XML trees for some reason, so searching for "XML diffing algorithm" yields a lot of results. In addition to Nikos's list of links, I found these:

Fine-grained Change Detection in Structured Text Documents (2014)
Change Detection by Level (CDL): An Efficient Algorithm to Detect Change on XML Documents (2010)
Comparing XML Documents as Reference-aware Labeled Ordered Trees (2011) ~~The code for this - VTracker still exists!~~ Edit: actually the interesting bit of code is not included. This pointed me to...
UMLDiff: An Algorithm for Object-Oriented Design Differencing (2005).
Revisiting the tree edit distance and its backtracing: A tutorial (2018) - looks like a good tutorial for the Zhang-Shasha algorithm, which seems to be the "classic" solution, but has terrible time complexity because it compares every sub-tree with every other sub-tree.

I also strongly recommend reading Change Detection in XML Trees: a Survey but it is from 2005 so barely any of the tools it mentions exist anymore. Comparing XML Documents as Reference-aware Labeled Ordered Trees has the best intuitive description of some of the algorithms that I have found so far (start at section 2.1.2).

Unfortunately there doesn't seem to be much open source code available that does this and isn't ancient. Just a lot of overly-complex papers. :-/

Content Type	Original Author	Original Content on Stackoverflow
Question	Tomas Vana	View Question on Stackoverflow
Solution 1 - Algorithm	Andre Holzner	View Answer on Stackoverflow
Solution 2 - Algorithm	Ira Baxter	View Answer on Stackoverflow
Solution 3 - Algorithm	Nikos M.	View Answer on Stackoverflow
Solution 4 - Algorithm	Matt Mullens	View Answer on Stackoverflow
Solution 5 - Algorithm	Timmmm	View Answer on Stackoverflow