[转]Trees in SQL: Nested Sets and Materialized Path

上一篇 / 下一篇  2008-04-10 16:22:57 / 个人分类:Oracle

From:http://www.dbazine.com/oracle/or-articles/tropashko4

————————————————————————————————————

by Vadim Tropashko

Relational databases are universally conceived of as an advance over their predecessorsnetworkandhierarchicalmodels. Superior in every querying respect, they turned out to be surprisingly incomplete when modeling transitive dependencies. Almost every couple of months a question about how to model a tree in the database pops up at the comp.database.theory newsgroup. In this article I'll investigate two out of four well known approaches to accomplishing this and show a connection between them. We'll discover a new method that could be considered as a "mix-in" between materialized path and nested sets.

Adjacency List

Tree structure is a special case of Directed Acyclic Graph (DAG). One way to represent DAG structure is:

createtableemp(
   ename  varchar2(100),
   mgrnamevarchar2(100)
);

Each record of the emp table identified by ename is referring to its parent mgrname. For example, if JONES reports to KING, then the emp table contains <ename='JONES', mgrname='KING'> record. Suppose, the emp table also includes <ename='SCOTT', mgrname='JONES'>. Then, if the emp table doesn't contain the <ename='SCOTT', mgrname='KING'> record, and the same is true for every pair of adjoined records, then it is calledadjacency list. If the opposite is true, then the emp table is atransitively closedrelation.

A typical hierarchical query would ask if SCOTT indirectly reports to KING. Since we don't know the number of levels between the two, we can't tell how many times to selfjoin emp, so that the task can't be solved in traditional SQL. If transitive closure tcemp of the emp table is known, then the query is trivial:

select'TRUE'fromtcemp
whereename = 'SCOTT'andmgrname = 'KING'

The ease of querying comes at the expense oftransitive closure maintenance.

Alternatively, hierarchical queries can be answered with SQL extensions: either SQL3/DB2 recursive query

withtcempas (
   selectename,mgrnamefromtcemp
 union
   selecttcemp.ename,emp.mgrnamefromtcemp,emp
   wheretcemp.mgrname = emp.ename
)select'TRUE'fromtcemp
whereename = 'SCOTT'andmgrname = 'KING';

that calculates tcemp as an intermediate relation, or Oracle proprietary connect-by syntax

select'TRUE'from(
 selectenamefromemp
 connect by priormgrname = ename
 start withename = 'SCOTT'
)whereename = 'KING';

in which the inner query "chases the pointers" from the SCOTT node to the root of the tree, and then the outer query checks whether the KING node is on the path.

Adjacency list is arguably the most intuitive tree model. Our main focus, however, would be the following two methods.

Materialized Path

In this approach each record stores the whole path to the root. In our previous example, lets assume that KING is a root node. Then, the record with ename = 'SCOTT' is connected to the root via the path SCOTT->JONES->KING. Modern databases allow representing a list of nodes as a single value, but since materialized path has been invented long before then, the convention stuck to plain character string of nodes concatenated with some separator; most often '.' or '/'. In the latter case, an analogy to pathnames in UNIX file system is especially pronounced.

In more compact variation of the method, we use sibling numerators instead of node's primary keys within the path string. Extending our example:

ENAMEPATH
KING  1
JONES1.1
SCOTT1.1.1
ADAMS1.1.1.1
FORD1.1.2
SMITH1.1.2.1
BLAKE1.2
ALLEN1.2.1
WARD1.2.2
CLARK1.3
MILLER1.3.1

  
Path 1.1.2 indicates that FORD is the second child of the parent JONES.

Let's write some queries.

1. An employee FORD and chain of his supervisors:

selecte1.enamefromemp e1, emp e2
wheree2.path like e1.path || '%'
ande2.name = 'FORD'

2. An employee JONES and all his (indirect) subordinates:

selecte1.enamefromemp e1, emp e2
wheree1.path like e2.path || '%'
ande2.name = 'JONES'

Although both queries look symmetrical, there is a fundamental difference in their respective performances. If a subtree of subordinates is small compared to the size of the whole hierarchy, then the execution where database fetches e2 record by the name primary key, and then performs a range scan of  e1.path, which is guaranteed to be quick.

On the other hand, the "supervisors" query is roughly equivalent to

selecte1.enamefromemp e1, emp e2
wheree2.path > e1.pathande2.path < e1.path || 'Z'
ande2.name = 'FORD'

Or, noticing that we essentially know e2.path, it can further be reduced to

selecte1.enamefromemp e1
wheree2path > e1.pathande2path < e1.path || 'Z'

Here, it is clear that indexing on path doesn't work (except for "accidental" cases in whiche2path happens to be near the domain boundary, so that predicate e2path > e1.path is selective).

The obvious solution is that we don't have to refer to the database to figure out all the supervisor paths! For example, supervisors of 1.1.2 are 1.1 and 1. A simple recursive string parsing function can extract those paths, and then the supervisor names can be answered by

selecte1.enamefromempwheree1.pathin('1.1','1')

which should be executed as a fast concatenated plan.

Nested Sets

Both the materialized path andJoe Celko's nested setsprovide the capability to answer hierarchical queries with standard SQL syntax. In both models, theglobalposition of the node in the hierarchy is "encoded" as opposed to an adjacency list of which each link is alocalconnection between immediate neighbors only. Similar to materialized path, the nested sets model suffers from supervisors query performance problem: 

selectp2.empfromPersonnel p1, Personnel p2
wherep1.lftbetweenp2.lftandp2.rgt
andp1.emp = 'Chuck'

(Note: This query is borrowed from thepreviously cited Celko article). Here, the problem is even more explicit than in the case of a materialized path: we need to find all the intervals that cover a given point. This problem is known to be difficult. Although there are specialized indexing schemes like R-Tree, none of them is as universally accepted as B-Tree. For example, if the supervisor's path contains just 10 nodes and the size of the whole tree is 1000000, none of indexing techniques could provide 1000000/10=100000 times performance increase. (Such a performance improvement factor is typically associated with index range scan in a similar, very selective, data volume condition.)

Unlike a materialized path, the trick by which we computed all the nodes without querying the database doesn't work for nested sets.

Another — more fundamental — disadvantage of nested sets is that nested sets coding isvolatile. If we insert a node into the middle of the hierarchy, all the intervals with the boundaries above the insertion point have to be recomputed. In other words, when we insert a record into the database, roughly half of the other records need to be updated. This is why the nested sets model received only limited acceptance forstatichierarchies.

Nested sets are intervals of integers. In an attempt to make the nested sets model more tolerant to insertions, Celko suggested we give up the property that each node always has (rgt-lft+1)/2 children. In my opinion, this is a half-step towards a solution: any gap in anested set model with large gaps and spreads in the numberingstill could be covered with intervals leaving no space for adding more children, if those intervals are allowed to have boundaries atdiscretepoints (i.e., integers) only. One needs to use a dense domain like rational, or real numbers instead.

Nested Intervals

Nested intervals generalize nested sets. A node [clft, crgt] is an (indirect) descendant of [plft, prgt] if:

plft <= clft and crgt >= prgt

The domain for interval boundaries is not limited by integers anymore: we admit rational or even real numbers, if necessary. Now, with a reasonable policy, adding a child node is never a problem. One example of such a policy would be finding an unoccupied segment [lft1, rgt1] within a parent interval [plft, prgt] and inserting a child node[(2*lft1+rgt1)/3, (rgt1+2*lft)/3]:

After insertion, we still have two more unoccupied segments [lft1,(2*lft1+rgt1)/3] and [(rgt1+2*lft)/3,rgt1] to add more children to the parent node.

We are going to amend this naive policy in the following sections.

Partial Order

Let's look at two-dimensional picture of nested intervals. Let's assume that rgt is a horizontal axis x, and lft is a vertical one - y. Then, the nested intervals tree looks like this:

Each node [lft, rgt] has its descendants bounded within the two-dimensionalconey >= lft & x <= rgt. Since the right interval boundary is always less than the left one, none of the nodes are allowed above the diagonal y = x.

The other way to look at this picture is to notice that a child node is a descendant of the parent node whenever a set of all points defined by the child cone y >= clft & x <= crgt is a subset of the parent cone y >= plft & x <= prgt. A subset relationship between the cones on the plane is apartialorder.

Now that we know the two constraints to which tree nodes conform, I'll describe exactly how to place them at the xy plane.

The Mapping

Tree root choice is completely arbitrary: we'll assume the interval [0,1] to be the root node. In our geometrical interpretation, all the tree nodes belong to the lower triangle of the unit square at thexyplane.

We'll describe further details of the mapping by induction. For each node of the tree, let's first define two important points at thexyplane. Thedepth-first convergence pointis an intersection between the diagonal and the vertical line through the node. For example, the depth-first convergence point for<x=1,y=1/2>is<x=1,y=1>. Thebreadth-first convergence pointis an intersection between the diagonal and the horizontal line through the point. For example, the breadth-first convergence point for<x=1,y=1/2>is<x=1/2,y=1/2>.

Now, for each parent node, we define the position of the first child as a midpoint halfway between the parent point and depth-first convergence point. Then, each sibling is defined as a midpoint halfway between the previous sibling point and breadth-first convergence point:

For example, node 2.1 is positioned at x=1/2, y=3/8.

Now that the mapping is defined, it is clear which dense domain we are using: it's not rationals, and not reals either, but binary fractions (although, the former two would suffice, of course).

Interestingly, the descendant subtree for the parent node "1.2" is a scaled down replica of the subtree at node "1.1." Similarly, a subtree at node 1.1 is a scaled down replica of the tree at node "1." A structure with self-similarities is called afractal.

Normalization

Next, we notice that x and y are not completely independent. We can tell what are both x and y if we know theirsum. Given the numerator and denominator of the rational number representing the sum of the node coordinates, we can calculate x and y coordinates back as:

functionx_numer(numerinteger, denominteger)
RETURN integer IS
   ret_numinteger;
   ret_deninteger;
BEGIN
   ret_num := numer+1;
   ret_den := denom*2;
  whilefloor(ret_num/2b = ret_num/2loop
      ret_num := ret_num/2;
      ret_den := ret_den/2;
  end loop;
  RETURNret_num;
END;

functionx_denom( numerinteger, denominteger)
...
  RETURNret_den;
END;

in which function x_denom body differs from x_numer in the return variable only. Informally, numer+1 increment would move the ret_num/ret_den point vertically up to the diagonal, and then x coordinate is half of the value, so we just multiplied the denominator by two. Next, we reduce both numerator and denominator by the common power of two.

Naturally, y coordinate is defined as a complement to the sum:

functiony_numer( numerinteger, denominteger)
RETURN integer IS
   numinteger;
   deninteger;
BEGIN
   num := x_numer(numer, denom);
   den := x_denom(numer, denom);
  whileden < denom loop
      num := num*2;
      den := den*2;
  end loop;
   num := numer - num;
  whilefloor(num/2) = num/2loop
      num := num/2;
      den := den/2;
  end loop;
  RETURNnum;
END;

functiony_denom( numerinteger, denominteger)
...
  RETURNden;
END;

Now, the test (where 39/32 is the node 1.3.1):

selectx_numer(39,32)||'/'||x_denom(39,32),
y_numer(39,32)||'/'||y_denom(39,32)fromdual

5/8     19/32select5/8+19/32, 39/32fromdual

1.21875 1.21875

I don't use a floating point to represent rational numbers, and wrote all the functions with integer arithmetics instead. To put it bluntly, the floating point number concept in general, and the IEEE standard in particular, is useful for rendering 3D-game graphics only. In the last test, however, we used a floating point just to verify that 5/8 and 19/32, returned by the previous query, do indeed add to 39/32.

We'll store two integer numbers —numeratoranddenominatorof the sum of thecoordinates xandy— as anencoded nodepath. Incidentally, Celko's nested sets use two integers as well. Unlike nested sets, our mapping isstable: each node has a predefined placement at thexyplane, so that the queries involving node position in the hierarchy could be answered without reference to the database. In this respect, our hierarchy model is essentially a materialized path encoded as a rational number.

Finding Parent Encoding and Sibling Number

Given a child node with numer/denom encoding, we find the node's parent like this:

functionparent_numer( numerinteger, denominteger)
RETURN integer IS
   ret_numinteger;
   ret_deninteger;
BEGIN
  ifnumer=3then
     return NULL;
  end if;
   ret_num := (numer-1)/2;
   ret_den := denom/2;
  whilefloor((ret_num-1)/4) = (ret_num-1)/4 loop
      ret_num := (ret_num+1)/2;
      ret_den := ret_den/2;
  end loop;
  RETURNret_num;
END;

functionparent_denom( numerinteger, denominteger)
...
  RETURNret_den;
END;

The idea behind the algorithm is the following: If the node is on the very top level — and all these nodes have a numerator equal to 3 — then the node has no parent. Otherwise, we must move vertically down thexyplane at a distance equal to the distance from the depth-first convergence point. If the node happens to be the first child, then that is the answer. Otherwise, we must move horizontally at a distance equal to the distance from the breadth-first convergence point until we meet the parent node.

Here is the test of the method (in which 27/32 is the node 2.1.2, while 7/8 is 2.1):

selectparent_numer(27,32)||'/'||parent_denom(27,32)fromdual

7/8

In the previous method, counting the steps when navigating horizontally would give the sibling number:

functionsibling_number( numerinteger, denominteger)
RETURN integer IS
   ret_numinteger;
   ret_deninteger;
   retinteger;
BEGIN
  ifnumer=3then
     return NULL;
  end if;
   ret_num := (numer-1)/2;
   ret_den := denom/2;
   ret     := 1;
  whilefloor((ret_num-1)/4) = (ret_num-1)/4loop
     ifret_num=1 and ret_den=1then
        returnret;
     end if;
      ret_num := (ret_num+1)/2;
      ret_den := ret_den/2;
      ret     := ret+1;
  end loop;
  RETURNret;
END;

For a node at the very first level a special stop condition, ret_num=1andret_den=1 is needed.

The test:

select sibling_number(7,8) from dual

1

Calculating Materialized Path and Distance between nodes

Strictly speaking, we don't have to use a materialized path, since our encoding is an alternative. On the other hand, a materialized path provides a much more intuitive visualization of the node position in the hierarchy, so that we can use the materialized path for input and output of the data if we provide the mapping to our model.

Implementation is a simple application of the methods from the previous section. We print the sibling number, jump to the parent, then repeat the above two steps until we reach the root:

functionpath( numerinteger, denominteger)RETURN varchar2 ISBEGINifnumerisNULLthenreturn'';end if;RETURNpath(parent_numer(numer, denom),
               parent_denom(numer, denom)) 
     || '.' || sibling_number(numer, denom);END;selectpath(15,16)fromdual

.2.1.1

Now we are ready to write the main query:given the 2 nodes, P and C, when P is the parent of C? A more general query would return the number of levels between P and C if C is reachable from P, and some exception indicator; otherwise:

functiondistance( num1integer, den1integer, 
                   num2integer, den2integer)RETURN integer IS 
BEGINifnum1isNULLthenreturn-999999;end if;ifnum1=num2andden1=den2thenreturn0;end if;RETURN1+distance(parent_numer(num1, den1),
                     parent_denom(num1, den1), 
            num2,den2);END;selectdistance(27,32,3,4)fromdual

2

Negative numbers are interpreted as exceptions. If the num1/den1 node is not reachable from num2/den2, then the navigation converges to the root, and level(num1/den1)-999999 would be returned (readers are advised to find a less clumsy solution).

The alternative way to answer whether two nodes are connected is by simply calculating the x and y coordinates, and checking if the parent interval encloses the child. Although none of the methods refer to disk, checking whether the partial order exists between the points seems much less expensive! On the other hand, it is just a computer architecture artifact that comparing two integers is an atomic operation. More thorough implementation of the method would involve a domain of integers with a unlimited range (those kinds of numbers are supported by computer algebra systems), so that a comparison operation would be iterative as well.   

Our system wouldn't be complete without a function inverse to the path, which returns a node's numer/denom value once the path is provided. Let's introduce two auxiliary functions, first:

functionchild_numer 
( numinteger, deninteger, childinteger)RETURNinteger ISBEGINRETURNnum*power(2, child)+3-power(2, child);END;functionchild_denom 
( numinteger, deninteger, childinteger)RETURNinteger ISBEGINRETURNden*power(2, child);END;selectchild_numer(3,2,3) || '/' ||
       child_denom(3,2,3)fromdual

19/16

For example, the third child of the node 1 (encoded as 3/2) is the node 1.3 (encoded as 19/16).

The path encoding function is:

functionpath_numer( pathvarchar2)RETURN integer ISnuminteger; 
   deninteger;
   postfixvarchar2(1000); 
   siblingvarchar2(100); 
BEGIN
   num := 1;
   den := 1;
   postfix := '.' || path || '.';whilelength(postfix) > 1loopsibling := substr(postfix, 2, 
                        instr(postfix,'.',2)-2);
      postfix := substr(postfix, 
                        instr(postfix,'.',2), 
                              length(postfix)
                             -instr(postfix,'.',2)+1);
      num := child_numer(num,den,to_number(sibling)); 
      den := child_denom(num,den,to_number(sibling));end loop;RETURNnum;END;functionpath_denom( pathvarchar2) 
...RETURNden;END;

select path_numer('2.1.3') || '/' ||
       path_denom('2.1.3') from dual
51/64

The Final Test

Now that the infrastructure is completed, we can test it. Let's create the hierarchy

createtableemps (
   namevarchar2(30),
   numerinteger,
   denominteger
)

alter tableemps
ADD CONSTRAINTuk_nameUNIQUE(name)USING INDEX
   (CREATE UNIQUE INDEXname_idxonemps(name))
ADD CONSTRAINTUK_node
  UNIQUE(numer, denom)USING INDEX
   (CREATE UNIQUE INDEXnode_idxonemps(numer, denom))

and fill it with some data:

insertintoempsvalues('KING',
      path_numer('1'),path_denom('1'));
insert intoempsvalues('JONES',
      path_numer('1.1'),path_denom('1.1'));
insert intoempsvalues('SCOTT',
      path_numer('1.1.1'),path_denom('1.1.1'));
insert intoempsvalues('ADAMS',
      path_numer('1.1.1.1'),path_denom('1.1.1.1'));
insert intoempsvalues('FORD',
      path_numer('1.1.2'),path_denom('1.1.2'));
insert intoempsvalues('SMITH',
      path_numer('1.1.2.1'),path_denom('1.1.2.1'));
insert intoempsvalues('BLAKE',
      path_numer('1.2'),path_denom('1.2'));
insert intoempsvalues('ALLEN',
      path_numer('1.2.1'),path_denom('1.2.1'));
insert intoempsvalues('WARD',
      path_numer('1.2.2'),path_denom('1.2.2'));
insert intoempsvalues('MARTIN',
      path_numer('1.2.3'),path_denom('1.2.3'));
insert intoempsvalues('TURNER',
      path_numer('1.2.4'),path_denom('1.2.4'));
insert intoempsvalues('CLARK',
      path_numer('1.3'),path_denom('1.3'));
insert intoempsvalues('MILLER',
      path_numer('1.3.1'),path_denom('1.3.1'));
commit;

All the functions written in the previous sections are conveniently combined in a single view:

create or replace
viewhierarchyas
 selectname, numer, denom,
         y_numer(numer,denom) numer_left,
         y_denom(numer,denom) denom_left,
         x_numer(numer,denom) numer_right,
         x_denom(numer,denom) denom_right,
         path (numer,denom) path,
         distance(numer,denom,3,2) depth
 fromemps

And, finally, we can create the hierarchical reports.

      • Depth-first enumeration, ordering by left interval boundary
selectlpad(' ',3*depth)||name
fromhierarchyorderbynumer_left/denom_left

LPAD('',3*DEPTH)||NAME
-----------------------------------------------
KING
   CLARK
      MILLER
   BLAKE
      TURNER
      MARTIN
      WARD
      ALLEN
   JONES
      FORD
         SMITH
      SCOTT
         ADAMS

      • Depth-first enumeration, ordering by right interval boundary
selectlpad(' ',3*depth)||name
fromhierarchyorder bynumer_right/denom_right desc

LPAD('',3*DEPTH)||NAME
-----------------------------------------------------
KING
   JONES
      SCOTT
         ADAMS
      FORD
         SMITH
   BLAKE
      ALLEN
      WARD
      MARTIN
      TURNER
   CLARK
      MILLER

      • Depth-first enumeration, ordering by path (output identical to #2)
selectlpad(' ',3*depth)||name
fromhierarchyorder bypath

LPAD('',3*DEPTH)||NAME
-----------------------------------------------------
KING
   JONES
      SCOTT
         ADAMS
      FORD
         SMITH
   BLAKE
      ALLEN
      WARD
      MARTIN
      TURNER
   CLARK
      MILLER

      • All the descendants of JONES, excluding himself:
selecth1.namefromhierarchy h1, hierarchy h2
whereh2.name = 'JONES'
anddistance(h1.numer, h1.denom,
             h2.numer, h2.denom)>0;

NAME
------------------------------
SCOTT
ADAMS
FORD
SMITH

      • All the ancestors of FORD, excluding himself:
selecth2.namefromhierarchy h1, hierarchy h2
whereh1.name = 'FORD'
anddistance(h1.numer, h1.denom,
             h2.numer, h2.denom)>0;

NAME
------------------------------
KING
JONES

--

Vadim Tropashkoworks for Real World Performance group at Oracle Corp. In prior life he was application programmer and translated "The C++ Programming Language" by B.Stroustrup, 2nd edition into Russian. His current interests include SQL Optimization, Constraint Databases, and Computer Algebra Systems.


贡献人: Vadim Tropashko

TAG:

 

评分:0

我来说两句

显示全部

:loveliness: :handshake :victory: :funk: :time: :kiss: :call: :hug: :lol :'( :Q :L ;P :$ :P :o :@ :D :( :)

日历

« 2008-10-16  
   1234
567891011
12131415161718
19202122232425
262728293031 

数据统计

  • 访问量: 6980
  • 日志数: 47
  • 建立时间: 2007-12-12
  • 更新时间: 2008-10-07

RSS订阅

Open Toolbar