JMay 2005        Issue: 35

Journal of Conceptual Modeling
www.inconcept.com/jcm

Using ANSI SQL as a Conceptual Hierarchical Data Modeling

and Processing Language for XML
Michael M David

Advanced Data Access Technologies, Inc.

 Abstract

 This paper will explain and show how standard ANSI SQL processors can naturally model and automatically process complex multi-leg hierarchical data structures at a full conceptual hierarchical level. This also means the query user does not need to have structure knowledge of the hierarchical structures involved. The data modeling capability includes dynamically combining logical hierarchical relational and physical XML data structures at a full hierarchical level. This also includes the ability to link below the root of the lower level structure intuitively forming a valid unified hierarchical structure. As will be shown, ANSI SQL’s high level hierarchical data processing allows the flexible conceptual control of hierarchical node promotion, fragment processing, structure transformation, and variable structure creation.  Also described is how the complex multi-leg hierarchical data filtering and processing is automatically carried out by the relational engine’s Cartesian product processing which inherently utilizes Lowest Common Ancestor (LCA) logic. Other ANSI SQL hierarchical processing features covered include the natural ability to maintain hierarchical processing operation in distributed processing environments across heterogeneous SQL systems and to apply all of these natural hierarchical processing abilities to the transparent and full integration of XML.

 1) Hierarchical 4GL Capabilities

 Fourth generation languages are nonprocedural languages. They are also known as declarative languages because the user only needs to specify what they want and not how to get it. In order for this to work, the 4GL needs to know the hierarchical data structure being processed. Using the inherent meta information derived directly from hierarchical structures definitions, the 4GL can automatically determine the semantics in the data structure. This can be naturally applied to the processing of the query without having to be supplied with the hierarchical processing instructions or logic.

 With the advent and popularity of XML today, hierarchical data structures are coming back in a big way. Unfortunately, maybe because of the lack of knowledge about hierarchical structures and their capabilities, XML query languages have not taken full advantage of hierarchical structures and the powerful structural semantics they contain. This results in XML query languages having to be procedurally supplied with hierarchical semantic logic causing significant difficulty in specifying hierarchical queries that operate across multiple legs of the hierarchical structure. The programming complexity level increases as the number of legs involved in processing the query increases. This can make semantically complex queries not worth the time or effort to specify procedurally. It also prevents the ability to specify ad hoc queries, and increases the chances of inaccurate results from faulty hierarchical processing logic.

 Multi-leg queries greatly increase the hierarchical semantics involved and complexity level necessary for processing them. While more complex to process, multi-leg queries processed by nonprocedural 4GLs are easy and intuitive for the user to specify, are useful in decision support, and can avoid using procedurally complex queries. More importantly, being able to freely query the entire structure nonprocedurally allows any query to be processed regardless of the number of legs involved. This frees the ad hoc user or query developer of any concerns or knowledge of the structure. The 4GL user does not need to know the location of the data or hierarchical relationships of the structure being queried. These complex hierarchical operations which are determined by the interaction of the query semantics and the hierarchical data structure semantics are not well understood or documented today, but can still be performed automatically by a 4GL like SQL.

 1.1) Hierarchical Semantics

 Full hierarchical structures as described in this paper support the use of multiple node types and twins. Twins mean that there can be multiple data occurrences under their related parent node data occurrence. Hierarchical structures built on top of relational databases do not usually support these capabilities, but ANSI SQL’s inherent hierarchical support naturally supports these capabilities allowing complex hierarchical structures to be supported. Twins (multiple data occurrences) are more tightly related than siblings (multiple legs) and have different semantics. Siblings affect the hierarchical structure model and twins do not. For this reason twin semantics are often overlooked in properly processing hierarchical data meaningfully. Node occurrences not related by the same parent node occurrence are not usually considered meaningfully related and if treated as they are will usually produce an incorrect result. This is why the use of “occurrence” or “data occurrence” is used liberally and meaningfully in this paper.

 Hierarchical data structures have an unambiguous semantics because they have only a single path to every node. This allows a nonprocedural query language to navigate them automatically and process the request utilizing the unambiguous semantics in the structure. The different legs of a hierarchical structure are independent of one another, but are related indirectly by the Lowest Common Ancestor (LCA) [Ullman] nodes that connect them. This makes the nodes on different legs indirectly related which is referred to here as cousins. This means that a single node data occurrence on one leg of the structure is also related to all other node data occurrences for sibling legs under their Lowest Common Ancestor node data occurrence. Each Lowest Common Ancestor data node occurrence naturally coordinates the hierarchical control logic necessary for hierarchical processing across legs. The number of Lowest Common Ancestors nodes necessary to process the query increases with the number of legs referenced in a query.

 Hierarchical data structure semantics today are currently being utilized only on a single leg at a time. A primary advantage of a full hierarchical structure is that it can define any combination of the legs necessary to process any multi-leg query. This semantics applied across the multiple legs means that the semantics between every node in the structure are meaningful and can be utilized for hierarchical processing. Typically a 4GL user querying data nonprocedurally from a full hierarchical structure is naturally going to reference multiple legs. This is because the user does not need to know the structure or be restricted by it. This multi-leg processing can be handled automatically by a 4GL utilizing the inherent semantics and principles that occur inherently between all nodes in the entire structure. Hierarchical semantics remain consistent independent of how the structure is stored and represented physically or logically.

 1.2) Hierarchical SQL Opportunity 

 SQL’s underutilized and overlooked inherent hierarchical processing capabilities offers proof of what has been stated above about current underutilized hierarchical capabilities in XML. By modeling hierarchical structures directly in SQL using the ANSI SQL Left Outer Join operation where tables and XML elements are nodes in the hierarchical structure, the relational engine automatically operates at a full hierarchical level by inherently utilizing the semantics in the structure [David 1]. This means hierarchical processing is a valid subset of relational processing. This makes XML integration in SQL seamless and capable of full hierarchical coverage carried out completely nonprocedurally and automatically.

 SQL’s nonprocedural processing utilizes the semantics naturally present in hierarchical data structures. This drives the hierarchical processing which can automatically correlate information throughout the entire structure and can also be performed in ad hoc mode to support powerful decision support. This significantly increases the value of the data by automatically being able to dynamically process very semantically complex queries. This takes on staggering importance and value with the goldmine of hierarchical semantics naturally existing in XML data structures and also available for legacy hierarchical data sources which can also be processed seamlessly by SQL hierarchical processing.

 2) ANSI SQL as a Hierarchical 4GL Query Processor

 With ANSI SQL being a  4GL, data modeling SQL syntax and semantics automatically instruct the SQL relational engine how to perform the hierarchical query processing. This is because all nodes are inherently related semantically and can be automatically processed as stated earlier. This enables SQL to perform hierarchical processing naturally by following the semantics in the data structure being processed. The data structure is defined hierarchically by the Left Outer Join operation which unlike the standard and default Inner Join operates hierarchically. This enables complex hierarchical multi-leg queries to be processed nonprocedurally and automatically. For example, this allows an SQL query that selects data from one area of a structure based on data in another area of the structure to be performed automatically. The logic involved uses the Lowest Common Ancestor nodes between the referenced legs to coordinate and correlate the multi-leg processing based on the hierarchical semantics.

 Prior SQL hierarchical processing research involved different approaches for extending SQL syntax and semantics, a representative example can be found in [Abiteboul], [Leven] and [Sengupta]. The SQL hierarchical processing research described in this current paper, unlike previous hierarchical attempts, utilizes unrealized inherent ANSI SQL hierarchical processing capabilities [David 1] and its seamless extension to XML [David 2] to avoid the use of SQL extensions to perform full hierarchical processing against XML. The research in this current paper may be the first to trace and explain how Lowest Common Ancestor logic is being inherently performed by the relational engine.

 A simple SQL query involving more than two legs would involve multiple Lowest Common Ancestors and their interrelationships making the internal logic increasingly complex. Fortunately, nonprocedural relational languages such as SQL can do this accurately and automatically regardless of the semantic complexity. This is because the relational hierarchically restricted Cartesian product processing automatically generates all valid hierarchical combinations between the legs of each Lowest Common Ancestor node data occurrence which automatically performs this hierarchical logic inherently and accurately. Current XML query languages such as XQuery are controlled procedurally by user supplied multi-leg hierarchical semantic logic. This can make them impracticable for semantically complex multi-leg hierarchical operations. For this reason there are a number of academic projects adding Lowest Common Ancestor logic on top of XQuery for the nonprocedural processing of queries across multiple legs [Chen 1], [Chen 2], [Cohen], and [Li].

 With hierarchical structure data preservation, each separate leg of a hierarchical structure can vary dynamically in number of data types from leg occurrence to leg occurrence. Surprisingly, this does not present a problem for fixed column relational data. Variable column legs in SQL relational databases are logically represented correctly hierarchically in form and length in standard fixed column relational rowsets. This is because the variable missing trailing columns of every leg occurrence are automatically padded with Nulls as a normal operation of the SQL Outer Join data preservation operation. This allows a variable number of columns in hierarchical leg occurrences to be processed accurately in fixed length rowsets dynamically preserving the fixed alignment. This enables seamless and transparent operation of variable length legs.

 SQL’s automatic processing of full multi-leg structures has all the same operational characteristics and capabilities as any physical or logical hierarchical structure being processed by older proven nonprocedural hierarchical processors. This relationally backed processing also validates that this is the correct mathematically correct way of hierarchical processing. Hierarchical processing and its semantics operate the same for physical (XML) or logical (relational) structures. This enables a consistent and seamless operation across heterogeneous structure formats when SQL is automatically performing hierarchically. There will be further coverage on physical and logical structures later.

 2.1) Hierarchical Query Specification

 A typical SQL request specification uses the familiar SELECT, FROM, and WHERE clause syntax.  The following description explains how this syntax naturally applies to hierarchical processing when SQL is processing hierarchically modeled structures. Since the SELECT clause relies on the data specified in the FROM clause, the FROM clause will be discussed first. The FROM clause specifies the  location of data types that form a pool of nodes consisting of relational tables and XML elements which will be placed in the input working set as needed. All required data types for processing the query must be included there which the FROM clause satisfies. The Left Outer Join operation is specified in the FROM clause to specify how the nodes are related hierarchically into a single structure. This can be abstracted into separate SQL views representing meaningful substructures that can be combined in multiple ways.

 The SELECT clause specifies which data items in the modeled structured defined in the FROM clause are to be returned. If any data item in a node is returned then that node is represented in the returned structure, otherwise that node is excluded in the returned structure.  The WHERE clause is used to return only the qualified data occurrences. If no WHERE clause is specified than no data filtering is performed.

 2.2) FROM Clause Controls Hierarchical Data Modeling

 The data associated with the left argument of the Left Outer Join is hierarchically preserved over the right data argument because it is preserved even when the right argument’s data is not present. Using this capability, any hierarchical structure can be modeled and processed hierarchically by the relational engine performing the hierarchical semantics associated with the SQL hierarchical modeling syntax. This is shown below in the definition of the ViewX structure in Figure 2.2 below. You can notice how legs are created by modeling them going down the structure and how multiple legs are formed when ON clauses link back to nodes already with a formed or partial formed leg. This SQL hierarchical Left Outer Join view can be easily generated automatically from existing metadata sources such as XML schemas.

                                ViewX             CREATE VIEW ViewX AS
                                    A                  SELECT * FROM A
                                   /  \                  LEFT JOIN B ON A.a=B.a            
                                 B   C                LEFT JOIN C ON A.a=C.a
                                      /  \               LEFT JOIN D ON C.c=D.d
                                     D  E             LEFT JOIN E ON C.c=E.c

            Figure 2.2: SQL hierarchical data modeling

 The SQL hierarchical data modeling syntax in Figure 2.2 above is also directly executable by SQL to perform hierarchical processing. This is possible because the associated SQL semantics of the data modeling SQL is specifying the basic principles of hierarchical structures and their processing. This makes for an extremely tight and seamless bond between SQL hierarchical data modeling and its associated hierarchical processing assuring data modeling accuracy, efficient processing, and an open and available data modeling language and its accompanying hierarchical processor. Also realize that this SQL data modeling syntax is a self defining data structure definition that accompanies the SQL were ever it may go. These powerful characteristics have very useful implications that will be examined further.

2.2.1) Joining Hierarchical Structures

In the same way that the hierarchical data modeling was performed a node at time using Left Outer Joins in Section 2.2, hierarchical substructures defined in SQL views can be joined (linked) by Left Outer Joins deriving a hierarchical superstructure. The left structure is joined over the right structure linked by the ON clause join criteria as demonstrated in Figure 2.2.1 below. The ON clause takes on added importance because it also specified the join points in each structure, in this case it is the C node linked to the X node.

                                      Relational                        XML
                                         ViewX                          ViewY
                                             
A                                  X
                                            /   \                                /   \         
                                          B    C                            Y    Z
                                                /  \
                                              D   E                              
                                                                                 
              SELECT * FROM ViewX LEFT JOIN ViewY
                                    ON C.Key=X.Key

                                                  Joined  Structure                
                                                              
A      
                                                             /   \                 
                                                           B    C   
                                                                /  |  \
                                                             D  E  X
                                                                      /  \
                                                                    Y   Z

           Figure 2.2.1: Joining hierarchical data structures

 The SQL input views in Figure 2.2.1 above are modeling hierarchical structures. Since logical and physical structures have the same semantics and operational principles, this means that an SQL view can hierarchically model relational or XML data. XML data can be retrieved and returned as a rowset which is mapped by its hierarchical SQL view. This allows it to be accessed seamlessly at the SQL data item level. This is why the data structures referenced in Figure 2.2.1 above can consist of relational and XML data. More information on XML access can be found in Section 3.1.

 Since the SQL hierarchical modeling views are SQL and the invoking statement is SQL, they combine naturally into a unified SQL executable statement. This means the SQL joining of the Left Outer Join views (ViewX and ViewY) in Figure 2.2.1 above automatically expand into a heterogeneous unified virtual SQL view that correctly maps the combined heterogeneous hierarchical structure being processed in the SQL query. This enables seamless access and processing across the heterogeneous global virtual structure. Also notice that the level of data modeling is at a very high conceptual hierarchical modeling level. It can manipulate structures as whole entities as can be seen visually above in Figure 2.2.1 and in examples to follow. 

2.2.2) Linking Below the Lower Level Structure’s Root

 The joining of the two hierarchical structures in Figure 2.2.1 above is standard for hierarchical structures since the lower level structure was linked to its root node. The semantics of the combined hierarchical structure are very intuitive as shown. But, what would the semantics of the combined structure if the lower level structure was linked to its lower Y node instead of its root node X ?  Interestingly, this capability is naturally supported in ANSI SQL and the combined result structure generated remains the same as if the actual root was linked to producing the same structure as shown above in Figure 2.2.1. This is easily understood and makes sense since the root still exhibits the same inheritance and subordination effect on its structure. It also makes a good standard way of supporting this useful capability which is backed by ANSI SQL natural hierarchical processing and will be applied consistently across logical and physical structures. Prior solutions to this semantic problem usually required awkward transformations of the lower level structure. Another example is show below in Figure 2.2.2.

                 SELECT * FROM View1 LEFT JOIN View2
                                     ON A.a=Y.y
                                                                          Result
                                View1                             Structure            
                                    A                                        A
                                     |  \      View2    à            /    \
                                    B   \       X                       B     X
                                     |      \    /  \                        |      /  \
                                    C        Y   Z                     C   Y   Z

                         Figure 2.2.2: Linking below the root

 In Figure 2.2.2 above, View1 is linked over View2  where the lower level View2 is linked to below its root at its Y node. As you might expect, data filtering is applied to the joined lower level structure at the lower link point Y node, but for determining the result structure, node X  remains the root. This means this capability is useful if there is no link available to the lower level structure’s root node or there is a reason for filtering on a non-root node(s). This capability also increases the overall data abstraction and ease of use for SQL hierarchical processing. More info on linking below the root will be discussed in Section 6.1 on Right Sided View Nesting.

2.3) SELECT Clause Controls Node Promotion

 Data selection, known as projection in relational terms, is specified by the SQL SELECT clause. It controls which data types from the input working set defined by the FROM clause are moved to the output structure (a fixed vertical/column kind of filtering). Nodes with no data selected are not moved to the output structure which is standard relational processing. This slicing out of unselected output nodes is also standard hierarchical processing and is known as node promotion because the removed node’s selected descendent nodes are preserved by being naturally promoted up and around the removed nodes following the hierarchical form of the data structure. This is shown below in Figure 2.3 where node C was not selected for output.

               View X                                                               Node Promotion
                  A
                  SELECT A.a, B.b, D.d, E.e                      A
                 /   \      à       FROM ViewX                        à             /  |  \
                B
   C                                                                             B  D  E
                     /  \
                   D  E

                 Figure 2.3: Hierarchical node promotion and collection

The selected nodes in Figure 2.3 above also maintain their hierarchical semantics because the remaining nodes still exert the same hierarchical effect and semantics on each other (subordination is maintained). This is because the structure is naturally condensed in a hierarchically controlled manner. Also notice that nodes D and E with separate paths directly under node A is an example of node collection. All of these operations are basic hierarchical processing and are represented in relational flat hierarchical structures.

 2.3.1) Fragment Isolation

 Fragments are pieces of a hierarchical structure similar to substructures but more dynamic. This  definition is extended here to include node promotion caused by node exclusion. By extending upon the same data selection principles used in node promotion demonstrated  above in Figure 2.3, it is possible to isolate a hierarchically structured fragment in a view. This is demonstrated below in Figure 2.3.1 where a fragment is isolated by only selecting on data  in the C, D, and E nodes.

               View X                                                    Fragment Isolation
                 
A              SELECT C.c, D.d, E.e                        C
                 /   \    à     FROM ViewX                à                /  \
                
B   C                                                                    D  E       
                     /  \                                                                             
                   D  E                                                                          
                    |     |                                                                     
                   F   G            

            Figure 2.3.1: Hierarchical fragment isolation

2.3.2) Structure Transformation

 By utilizing fragment processing shown above in Section 2.3.1 and hierarchical structure joining in Section 2.2.1, it is possible to isolate multiple structure fragments in a single structure and then manipulate them with join operations. This is fairly simple and intuitive, but involves another level of complexity by using high level prefixes (V1 and V2) to distinguish the two fragments. This is demonstrated in Figure 2.3.2 below where two fragments are formed and separately reconstructed from the same SQL view. These are fragments A-B and C-D, which can be used to independently manipulate and perform structure transformation by rejoining the fragments differently which is also shown below.

           ViewX                                                                                     Transformed
              A                                                                                               Structure 
            /    \                 SELECT V2.C.c, V2.D.d, V1.A.a, V1.B.b                 C
          B     C     à      FROM V2.ViewX LEFT JOIN V1.ViewX   à         /  \
                 /  \              ON V2.C.c=V1.A.a                                                D   A
                D  E                                                                                                   |
                                                                                                                          B  

            Figure 2.3.2: Structure transforming using fragment processing

 As shown in Figure 2.3.2 above, transforming data structures in standard SQL is possible by defining different fragments in the same structure by using the SQL alias/rename high level prefix capability to keep them separate and distinct. Both logical and physical structures once retrieved to the working set in memory are stored as contiguously fixed structures. This allows any logical fragment in the working set to be naturally isolated and moved separately to the result set using standard relational processing. This is possible because all accessed data types, such as logical relational, contiguous XML, and even linked IMS data forms are now stored in a homogenous rowset in the relational working set. This means the structured data in the working set can be treated as a fixed contiguous rowset and can be logically joined with other rowsets irregardless of the make up of the source data format. See Section 6.3 for more information on logical and physical structure operation.

 2.4) WHERE Clause Controls Hierarchical Data Filtering

 Hierarchical data filtering is controlled by the WHERE clause. It specifies which data row occurrences are filtered out based on their data content. This is a dynamic horizontal filtering of rows or path data occurrences. Data filtering will affect which data is in the result set on a row by row, path by path data occurrence basis.  

 One of a hierarchical query’s most powerful capabilities is its hierarchical WHERE clause data filtering which is applied to the full hierarchical structure. To fully understand hierarchical query data filtering which is more involved than flat structure filtering, it is more easily understood if examined as data qualification. This is because it is simpler to demonstrate a positive rather than negative operation when demonstrating the natural hierarchical filtering operation.  This means that when hierarchical data filtering is used, it is examined as qualifying data rather than specifically filtering out data. When no WHERE clause filtering is specified, all the queried data is qualified.

 2.4.1) Hierarchical Data Qualification

 WHERE clause data qualification operates on the entire hierarchical structure in a hierarchical fashion based on how all the nodes relate to each other This same hierarchical process is occurring in standard SQL and relational table processing, but takes on hierarchical meaning (semantics) when the entire structure is recognized and the result is examined against its hierarchical structure. This semantics is more easily traced through the hierarchical structure following the related data qualified by the WHERE clause which is described below.

 ViewX below in Figure 2.4.1 consists of tables A, B, C, D, E joined into the hierarchical structure as shown. The tables become nodes in the structure when modeled in SQL. If  “A”, “B”, “C”, “D” and “E”  are selected from this structure qualified on some value for “C” as in: SELECT A,B,C,D, E FROM ViewX WHERE C=5, what are the data selection semantics of this query? The qualification process starts at the WHERE clause condition that directly qualifies the C data node occurrence(s) with C=5. Then all path data occurrences under the qualified C data node occurrence (D and E node occurrences) and the path data occurrence above each qualified C data node occurrence (A node related data occurrence) qualify.

                     ViewX
                       A              SELECT A.a, B.b, C.c, D.e, E.e
                                  FROM ViewX   
                     B  C=5  ß WHERE C.x=5
                       
                       D  E

             Figure 2.4.1: WHERE clause data qualification flow

 The above described WHERE clause in data selection logic in Figure 2.4.1 covers directly related qualification logic familiar today, but multi-leg qualification includes any node path data occurrences indirectly related across the legs. These are connected to a qualified node data occurrence such as the related B node cousin data occurrences related by the Lowest Common Ancestor (LCA) qualified A node data occurrence in ViewX above in Figure 2.4.1. This same hierarchical result exactly reflects the result found in the relational result, with the meaning obscured by the flat representation. The hierarchical semantics and associated data are still available for hierarchical use. It can enable structured XML output to be nonprocedurally produced. See Section 3.1 for more on information on XML integration.

 2.4.2) Multi-leg Filtering Semantics

 When multiple legs take part in data filtering as in WHERE D=1 AND E=0 or arithmetic operations as WHERE D=E, using the ViewX below in Figure 2.4.3, all combinations across the qualified data value occurrences under the Lowest Common Ancestor data occurrences of node C are processed for a matching combination that qualifies. If lower level nodes are also selected, than all combinations that qualify must be processed since the lower level nodes can be qualified separately by different qualified combinations. This is also the same logic performed in hierarchical processing which is naturally reproduced by the relational Cartesian product processing.

 Many queries will have data filtering criteria applied to more than two legs which can produce more than one Lowest Common Ancestor node filtering processes as described above. Using the hierarchical structure ViewX below in Figure 2.4.3, data filtering as in WHERE D=1 AND E=0 AND B=3 will cause a more complex processing of data filtering. In this example it requires nested processing of Lowest Common Ancestor logic in this compound WHERE clause. Common ancestor node C is derived from D and E nodes. It is located under common ancestor node  A node which is derived from B node and C node’s sub result of D and E. This requires nested common ancestor processing of C under A node that naturally follows correct hierarchical processing logic.

 Regardless of the number of common ancestors involved, standard SQL operating hierarchically will perform this hierarchical logic perfectly thanks to the hierarchically restricted Cartesian product processing controlled by the WHERE clause. It generates the correct combination of row values to automatically emulate this type of hierarchical nested Lowest Common Ancestor processing a single row at a time. More information of SQL’s automatic use of Lowest Common Ancestor logic can be founding Section 6.2.

 2.4.3) Multi-leg Variable Semantics

 As a further example of the subtleties and complexities of hierarchical processing, the automatic hierarchical query internal semantic processing can become more powerful and complex with OR decision processing in the WHERE clause. This enhances and complicates the natural semantics of hierarchical data filtering processing, making it dynamically variable. Using the hierarchical structure from ViewX in Figure 2.4.3 below, what would the semantics of the query: SELECT D, E From ViewX WHERE D=1 OR E=0 ?  If D=1 is true and E=0 is not, then every occurrence of  E under the qualified Lowest Common Ancestor C node occurrence is qualified (thanks to D=1 being true) and only the D node data occurrence with the qualified D=1 qualifies since its only qualification is more specific.

                           ViewX                                                                  Result
                                   A                                                                           A
                                  /  \          SELECT D, C FROM ViewX                  /  \    
                                 B  C   à            WHERE D=1 OR E=0    à    B   C
                                ___|___                                                                 ___|___        
                                |            |                                                                |            |
                            D=1      E=1                                                           D=1       E=1
                            D=2      E=2                                                                         E=2
             Figure 2.4.3: Variable semantics for WHERE D is true and E is not

 As you would expect from the data qualification results directly in Figure 2.4.3 above, the opposite results occurs if only E is true and D is not. In this case every occurrence of D under the qualified common ancestor C node occurrence is qualified and only the E node data occurrence that tested true qualifies. If both conditions are true, then all occurrences of  the D and E nodes qualify because of cross qualification. These results also demonstrate why it is easier to examine data filtering as data qualification, because both sides of the OR operation have an additive affect. While these output qualification semantics are complex, the results they produce are logical and intuitive to the user.

 The WHERE clause OR processing semantics described above means that both sides of the OR condition must always be tested because the qualification semantics can dynamically change depending on which side of the OR operation is true. SQL handles this advanced hierarchical processing automatically with its Cartesian product processing that generates all combinations of hierarchical relations which automatically checks both sides of the OR operation. The correctness of this operation can be logically verified by replacing the OR operation with separate filtering operations on two queries and unioning the results.

 This sophisticated and naturally intuitive variable hierarchical data filtering is performed naturally in relational SQL Cartesian product processing. This natural hierarchical processing is quite remarkable and meaningful for mathematically correct relational hierarchical processing. It reinforces and validates that the hierarchical processing known and used before relational processing became popular is correct because it is the same as the hierarchical processing produced logically from relational processing.

 2.4.4) Path Data Filtering (Data Model Rules)

 WHERE clause hierarchical data filtering is very powerful operating intuitively on the entire multi-leg structure, but you may still need a more restrictive (path only) type filtering similar to XPath. This is where ON clause data filtering can be used. While the ON clause is used to specify the join condition, it can also include a data filtering condition. Being on the ON clause it only affects the join operation of the node it is specified on and the related lower level data nodes which can not exist without its parent data occurrence existence (causing a cascading delete). All nodes on other legs are not affected in any way. As you probably realize, all ON clauses are processed before the WHERE clause is processed since the WHERE clause affects the entire structure which requires all the ON clauses to have been processed (at least logically).

 The above differences in the operation of the WHERE and ON Clauses are  extremely noteworthy and useful. Using the structure Department over Employee over Dependent for example, a qualification on a dependent under the age of twenty-one on a WHERE clause can cause entire structure occurrences to be removed from the result when there are no data node occurrences that match the qualification. Placed on the ON clause of a Dependent node, this filtering can only remove the Dependent node occurrences it is used on, and any of its related descendent nodes. This is very useful as a business rule that can be included in the hierarchical data model defined in the SQL view since it is better associated with the data model and not the specific query. Without this capability, employees with no dependents under twenty-one years old would be removed and if this causes departments with no employees, the departments would be removed and so on up the structure to the root. This is shown below in Figure 2.4.4.

         SELECT Emp, Dpnd FROM Emp LEFT JOIN Dpnd ON EmpId=DpndEmpId…

                     WHERE DpndAge<18            ON…AND DpndAge<18
                                        ↓                                          ↓                                  
             Results:    Emp2  Dpnd2                       Emp1   Null
                                                                           Emp2   Dpnd2

          Figure 2.4.4: Difference between WHERE and ON clause data filtering

3) Other Hierarchical Based Capabilities

 3.1) XML Joinless Access

 While logical structures composed of relational tables use joining operations when retrieved, physical structures like XML do not need to simulate costly relational joins in the retrieval process when they are modeled by Outer Joins views. This is because the Left Outer Join hierarchical data modeling syntax for physical structures only represents the data structure metadata semantics. This means physical structures can be accessed and processed directly without joins because the hierarchical semantics of the Left Outer Join are reflected naturally in the physical hierarchical structures themselves and do not require expensive and needless joins to physically model their hierarchical structure. Query languages that simulate physical join views to integrate with relational data relationally will have join processing overhead for XML to support SQL/XML integration. But because the natural solution described in this document utilizes ANSI SQL’s hierarchical processing to perform SQL/XML integration at a hierarchical level, it can efficiently process physical hierarchical structures hierarchically. This allows constructing the rowset that matches and models the Left Outer Join data modeling without performing expensive relational joins.

 3.2) Natural Distributed Hierarchical Processing

 The performing of distributed hierarchical processing automatically when distributed processing is performed is extremely powerful and simple. When the hierarchical data modeling Left Outer Joins are broken up and sent to remote sites for processing, the hierarchical substructures they represent will automatically be performed hierarchically. The returned results have been naturally processed hierarchical at the remote sites and still remain correctly hierarchically mapped at the local site. The final result remains fully hierarchically processed. This happens automatically because the hierarchical data modeling Left Outer Join specification is self contained in the SQL and each ANSI SQL site’s ANSI SQL processor naturally performs the hierarchical processing defined by the data modeling SQL.

 3.3) Hierarchical Structure Construction Order

 Hierarchal structures can generally be built or processed in any starting order without changing their semantics or result. This processing can be top-down, left to right, bottom-up, or in any combination. This is also true of the hierarchical structure’s data modeling definition. The same hierarchical structure can be modeled in a top-down, left to right, or bottom-up order which controls the order built. The completed hierarchical structure has the same semantics and derives the same results in what ever order it is built or processed. As an indication of SQL’s hierarchical processing capability and flexibility,  all of these same capabilities and operational characteristics remain the same as hierarchical processing because the semantics are identical.

 Generally speaking, hierarchical structures are most easily defined and built top-down as all the previous examples have intuitively demonstrated. Some automatic SQL data modeling definition or view expansion processes may change the standard data modeling order while remaining semantically the same. This can produce inefficiencies that cause throwaway data from dangling tuples (no matching row) that are avoided with top-down processing. This is avoided by rewriting the expanded outer join for top-down processing at runtime.

 3.4) Full Ad Hoc Nonprocedural Processing Supported

 It is important to point out that all other SQL/XML integration solutions require proprietary solutions and XML centric syntax that is procedural. A side effect of this is that ad hoc processing is really not possible if allowed at all. The ANSI SQL hierarchical solution presented in this document is nonprocedural and transparent. It also supports full dynamic ad hoc processing support of native XML and other forms of hierarchical data format such as legacy data.

 4) Processing Unconventional Hierarchical Structures

 XML hierarchical structures are not quite as fixed as conventional hierarchical structures. This is because XML is composed of semistructured data where their structure is defined in the data allowing it to change dynamically or even define logical network data structures. These present problems for relational processing. Relational solutions to these cases based on SQL hierarchical processing are described below. While it is nice to support all XML capabilities derived from its dynamically flexible structure definition, no XML processor is capability of supporting all possible capabilities. Each has their own strengths and limitations. For this reason, SQL should not be expected to support all XML capabilities.

 4.1) Variable Structures

 Variable structures are hierarchical structures whose structure can change dynamically  between different structure occurrences and even within a given structure occurrence. This is allowed in XML. SQL can support variable structures by using variable values in the ON clause that hierarchically model the structures differently depending on the value of the variable. Since separate ON clauses are used at each join point they can dynamically control whether the join at any point in the structure building process is performed or not. By having a test condition based on a higher level data value located higher on the current path occurrence (hierarchical data inheritance is supported in SQL), the variable generation of the data structure can be dynamically controlled. This is shown below in Figure 4.1 where either a D node or an E node data occurrence is generated, but not both, controlled by the value in the C.x data field. This is similar to COBOL’s Depending ON clause in its File Definition section. This is demonstrated below with an overly simple example, but can give you an idea of how this capability is supported and utilized.

         ViewX-1                                                                      ViewX-2
             A           SELECT * FROM A                                       A
            /   \          LEFT JOIN B ON A.a=B.a                           /   \                    
          B    C        LEFT JOIN C ON A.a=C.a                         B    C
                  |   ß  LEFT JOIN D ON C.c=D.c AND C.x=1            |
                 D        LEFT JOIN E ON C.c=E.c  AND C.x=2  à     E

                           Figure 4.1: Variable structures

 4.2) Mapping Network Structures to Hierarchical Structures

 Some of XML’s advanced features require or create a logical network structure. Network structures unlike hierarchical structures allow a node type to be accessed from more than one path. This is demonstrated directly below in the XML IDRef Structure in Figure 4.2. This makes XML hierarchical structures ambiguous for nonprocedural, navigationless languages like SQL because a node can have multiple path entries each with its own semantics. A similar problem occurs with the use of duplicate element type nodes which is also permitted in XML and shown below. It is possible to model these network type structures as unambiguous hierarchical structures in SQL. SQL has an alias/rename ability that can enable a network to hierarchical structure mapping capability as shown below in Figure 4.2.

     IDRef  Structure        Hierarchical Structure      Duplicate Element

              Dept                                    Dept                                Dept
             /      \                                    /      \                               /     \
       Cust    Emp      à                Cust    Emp            ß       Cust    Emp
            \       /                                 |          |                             |          |
             Addr                           AddrC   AddrE                  Addr   Addr

       Figure 4.2: Hierarchical mapping solutions for network structures

 While the underlying storage of the Addr node is different in both usages (shared or separate) above in Figure 4.2, the semantics of both structures are basically the same and can be mapped into the same remodeled hierarchical structure which is unambiguous as shown above.

 5) Hierarchical SQL View Use and Importance

 Hierarchical outer join definitions of logical relational table structures and physical XML documents can be placed in standard SQL views. These ANSI SQL hierarchical views have unique capabilities that come together synergistically producing even more powerful capabilities making SQL hierarchical processing more powerful and very user friendly as described below.

 5.1) Hierarchical View Synergies

 Hierarchical SQL views naturally map logical relational and physical XML structures using the Left Outer Join operations. This makes hierarchical processing flexible, reusable, and intuitive because of these hierarchical views’ powerful hierarchical view abstraction and conceptual processing. These hierarchical views can be embedded or hierarchically joined with other hierarchical SQL views. This is performed the same way tables are by using the Left Outer Join directly on views to naturally form larger hierarchical views as in FROM ViewX LEFT JOIN ViewY ON X.x=Y.Y as discussed in Section 2.2.1. This joining of hierarchical views preserves and combines the hierarchical semantics and can even be preformed dynamically. When all the specified hierarchical logical and physical SQL views expand at runtime they automatically form into a seamless and consistent unified heterogeneous virtual hierarchical view. Defined naturally in SQL, this takes on significant synergistic proportions making hierarchical processing very user friendly.

 5.2) Hierarchical View Optimization

 The Left Outer Join’s natural hierarchical data preservation operation described earlier allows separate views or the entire unified heterogeneous view to be optimized hierarchically at runtime. This is an optimization to access only the nodes referenced or on a path to a referenced node, thereby saving on  unnecessary data access. This also significantly cuts down on data explosions caused by semantically incorrect and confusing data replication, and the inefficiency they cause in memory and CPU usage. This is demonstrated below in Figure 5.2 where nodes B and D from ViewX  are not accessed.

          ViewX                                                Nodes            Result
             A                                                   Accessed       Structure
            /  \               SELECT A.a, E.e              A                   A
          B   C    à     FROM ViewX         à      |        à         |         
               /  \                                                      C                   E
              D  E                                                     |
                                                                       
  E

        Figure 5.2: Hierarchical optimization

 You can also see in Figure 5.2 that node C while not referenced is still required for navigating from node A to node E. Node C will be removed from the final result since it was not selected for output. The lack of access  of  the optimized out nodes (B and D) has no negative influence on the result and improves upon the semantic accuracy of the result by reducing unnecessary data replications.

 This optimization’s power is significantly increased by how easily the SQL SELECT clause can be changed and the view can automatically adapt by eliminating unnecessary nodes from the defined structure. This is a form of semantic optimization driven by the data structure’s metadata instead of physical views that use procedural programming instructions with smaller code optimization windows.

 Global views of entire structures become very useful with this hierarchical optimization. Global views result in fewer specialized views, further increasing the reuse and hierarchical data abstraction for the user so they do not need to be concerned with details of the structure being processed. This also allows global views to be used without incurring any overhead. In a nonprocedural 4GL like SQL, global views automatically increase the database domain of the queries using these large views. This makes them very user friendly by eliminating the use of small views and having to know when and how to use which view.

 This powerful hierarchical optimization is not a replacement for the standard Inner or Outer join optimization. Nor does it have to be integrated into the current Inner or Outer join optimization. This powerful hierarchical optimization is simply applied before the standard optimization is performed. It can be applied externally and dynamically by rebuilding the query to eliminate the unneeded node paths.

 This hierarchical optimization can also be applied to accessing physical hierarchical structures described Section 3.1. It can be used to indicate paths that can avoid database access. This use made of this will vary for the type of physical structures and their access mechanisms. Structures comprised of physical pointers like IBM’s IMS database versus nested contiguous structures like XML and Structured VSAM will use different  access optimizations that can benefit from the hierarchical optimization described here.

 6) More on the Internal Processing of this Technology

 There are a number of automatic powerful internal ANSI SQL operations that play an essential role in ANSI SQL hierarchical processing that have not been fully covered yet. This was done to keep the previous descriptions more easily understood. At this point these powerful features can now be examined.

 6.1) Right Sided View Nesting

 

In section 2.2.2, Linking Below the Root was discussed, a closer examination of the expanded SQL may clear up a possible perceived problem. The simplified example used here contains View1:A LEFT JOIN B ON A.a=B.b and View2: X LEFT JOIN Y ON X.x=Y.y. It joins them with View1 over View2 with a link to View2 at a node below the root: View1 LEFT JOIN View2 ON B.Key=Y.key. The expanded view and new derived structure follows in Figure 6.1.

 

            SELECT * FROM View1 LEFT JOIN View2 ON B.Key=Y.key

 

                                                                                          Linkage    Result

            SELECT * FROM                                                       A             A

      A LEFT JOIN B ON A.a=B.b     ßView1          |               |

LEFT JOIN                                                                 B  X  à   B

     X LEFT JOIN Y ON X.x=Y.y      ßView2            \  |           |

ON B.Key=Y.key                               ßLink               Y          X 

                                                                                                     |     

                                                                                                    Y

Figure 6.1: Right sided view nesting 

 

 The apparent problem with linking below the root of the lower level joined structure (Vew2) is that it may appear in the unexpanded view  that node Y node is linked to before it is joined to root X. This would be logically invalid since node X determines node Y. On a closer look at the expansion in Figure 6.1, you will notice that the LEFT JOIN between View1 and View2 is delayed because its matching ON clause has been pushed to the far side of View2 which contains its own ON clauses. This causes expanded View2 to be nested (right sided view nesting) which causes it to be fully performed in isolation before being joined to View1. This means that all nodes of View2 are available to be referenced when joined to View1. Any depth of nested views can be automatically and transparently handled this way. The SQL coder does not even need to be aware of this powerful nesting taking place.

 

Each nested view is allocated to a new and separate working set while being processed. This also protects all other working sets from any destructive operation that the active working set may perform. This means for example that destructive Inner Joins can be performed in views without causing data to be discarded in other working sets. This is a very powerful capability that can be exploited in many ways and act as a safety valve that keeps everything running smoothly with the SQL data modeling, adding powerful SQL recombinant capabilities.

 

6.2) Automatic Lowest Common Ancestor Logic

 

In Section 1.1, the concept of Lowest Common Ancestor (LCA) was introduced.  It deserves closer examination in its intrinsic use in ANSI SQL hierarchical processing. LCA logic is used to help interpret and utilize the semantics between nodes in different legs  in a hierarchical structure. Finding the Lowest Common Ancestor of two nodes is not a trivial process and many papers have put forth fast algorithms. This presumes that the application that uses the LCA is aware of its importance and its uses.

 

The ANSI SQL hierarchical processing presented in this document is an inherent operation that was not intentionally designed into ANSI SQL. So if the LCA logic is necessary to hierarchical processing, how is it being performed if it was not designed  or coded into ANSI SQL?

 

The hierarchically restricted Cartesian product produced from a multi-table Left Outer Join that models the data structure automatically also naturally performs the logic of the LCA. It builds restricted Cartesian sub products under each Lowest Common Ancestor node at each hierarchical join point. This means the logic of determining the LCA for node pairs is naturally built into the Cartesian product processor of the relational processor. Using Figure 6.2 below, the predicate WHERE E=C would test all combinations of E=C generated under Lowest Common Ancestor node A, while WHERE F=G would test all combinations of F=G under Lowest Common Ancestor node C. WHERE E=C AND F=G  would still test the correct combinations of each LCA  thanks to the hierarchically restricted Cartesian product and its underlying inherent LCA logic.

 

                 ViewX

                        A  ß WHERE E=C  or   SELECT D + WHERE F=G

                      /    \

                   B       C   ß WHERE F=G

                 /   \     /   \

                D  E   F   G

 

          Figure 6.2: Determining Lowest Common Ancestors

 

SELECT clause operation also utilizes Lowest Common Ancestor logic when a WHERE clause is used to filter selected data as in SELECT D FROM ViewX WHERE F=G using ViewX in Figure 6.2 above. The range of selected D values is determined by its Lowest Common Ancestor Node derived by its matching WHERE clause tested node(s). In this case the WHERE clause participating node is the C node as determined by its LCA as shown above. This makes the Lowest Common Ancestor node for SELECT D the A node, derived from its  participating  D and C nodes. This means that all D node data occurrences under qualified A node occurrences are qualified for selection. Also note the WHERE clause testing is still performed using LCA combinations under the C node and not the A node. These LCA operations enable the WHERE and SELECT clauses hierarchical qualification to qualify a single row at a time which is required for relational processing and avoids a walking tree logic.

 

Lowest Common Ancestor (LCA) is also known as Nearest Common Ancestor (NCA) and Closest Common Ancestor. Lowest for lowest level common ancestor node in the structure, Nearest and Closest for the nearest/closest common ancestor node. They all derive the same Common  Ancestor node.

 

6.3) Logical and Physical Structure Processing Consistency

 

There is a hierarchical consistency that exists between logical and physical hierarchical structures, including between all the different types of physical hierarchical structures. The commonality is that the hierarchical structure maintains its basic principles and how it is logically operated upon irregardless of its makeup. The Left Outer Join syntax can model any conventional hierarchical structure and the associated semantics defines how it is operated on and the subsequent semantics of its result. This means all hierarchical structures can be defined in a global logical structure and accessed consistently by SQL.

 

The access of each different type hierarchical structure will require its own access routine use and the result would be converted to a relational rowset that defines the hierarchical structure modeled by its Left Outer Join operation. The Left Outer Join hierarchical operation performed on tables does this by preserving the hierarchical structure by inserting  nulls to keep the variable length leg segments aligned.  These same data modeling Left Outer Joins can model physical structures and their access routines will duplicate the Left Outer Join format in their returned rowset. In this way the entire unified heterogeneous hierarchical structure is seamlessly defined by the fully expanded SQL.  SQL operates consistently across all the joined rowsets because their make up is physically the same while their physical source may not have been.

 

These rowsets add relational flexibility that may not have been previously recognized. Section 2.3.2 demonstrated  structure transformation. This may seem simple to expect of relational data, but transforming physical data may seem a lot more difficult, having to pull different data formats apart, filter, and reassembling it. But since the data from physical structures is now also in rowsets, it can also be easily accessed and reassembled by SQL as shown in Figure 2.3.2.

 

Section 3.1, XML Joinless Access, explained why physical structures do not need to be accessed by performing joins. The physical structure processing operation defined by the Left Outer Join view  also requires the natural processing of the Right Sided View Nesting describe in Section  6.1. This is because a full hierarchical physical structure must be fully materialized before it can be processed below the root.  The Right Sided View Nesting enables this to happen. While directly  accessing the root node of a physical structure does not present a problem, hierarchically locating all nodes out of physically stored order can be a problem because of the required dynamic navigation required of SQL. This is why physical structures need to be materialized before being processed.

 

6.4) Data Structure Extraction Technology

 

Almost all of the hierarchical processing capabilities are occurring automatically in ANSI SQL because of the Left Outer Join data modeling support. The ANSI SQL relational engine is not even aware of this hierarchical processing. This means it can not automatically extend this naturally capability to other areas such as seamless access to external hierarchical data sources to automatically take advantage of the  natural hierarchical processing. The relational engine would also need to be cognitive of the dynamically modeled hierarchical structure being accessed to automatically format the data to hierarchically structured output data types like XML. The Data Structure Extraction (DSE) technology developed by Advanced Data Access Technologies can automatically derive the dynamically generated structure of the expanded unified hierarchical structure being processed. With this information, advanced capabilities can be naturally added to the SQL engine to extend its inherent nonprocedural hierarchical processing naturally and transparently to XML and legacy data as discussed in this document.

 

Conclusion

 ANSI SQL makes a very powerful, complete, and flexible nonprocedural 4GL hierarchical querying language for relational and native XML data. It is unique because it automatically utilizes the semantics in the data structure to automatically process the most semantically complex multi-leg hierarchical queries without extending the relational model. This means the user does not have to know the data structure or supply the hierarchical semantic processing logic, increasing the value of the data. The results are simultaneously hierarchical and relationally accurate making the result mathematically correct. This allows it to integrate logical or physical, and heterogeneous hierarchical structures transparently at a full interactive conceptual hierarchical processing level. SQL naturally performs hierarchical processing so accurately  that it can be used as a model for determining correct hierarchical processing. This has already been done to determine and validate the semantics of linking below the root of the lower level structure when joining hierarchical data structures.

References

 [Abiteboul]; Serge Abiteboul and Nicole Bidoit; Non First Normal Form Relations to Represent Hierarchically Organized Data; PODS; 1984.

 [Chen 1];Ya Bing Chen, Tok Wang Ling, and Mong Li Lee; Automatic Generation of XQuery View Definitions from ORA-SS Views; School of Computing National University of Singapore, 2004.

 [Chen 2]; Zhuo Chen, Tok Wang Ling, Mengchi Liu, and Gillian Dobbie; XTree Declaritive XML Quering, School of Computing National University of Singapore; 2004

 [Cohen]; Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv; XSEarch: A Semantic Search Engine for XML; Proceedings of the 29th VLDB Conference, 2003.

 [David 1]; Michael M David; Advanced Capabilities of the Outer Join; ACM SIGMOD Record; March 1992.

 [David 2]; Michael M David; ANSI SQL Hierarchical Processing Can Fully Integrate Native XML; ACM SIGMOD Record; March 2003.

 [Leven]; Mark Leven and George Loizou; Semantics for Null Extended Nested Relations; ACM Transactions on Database Systems; Sept 1993.

 [Li]; Yunyao Li, Cong Yu, and H. V. Jagadish; Schema-Free XQuery; 30th VLDB Conference, Toronto, Canada, 2004.

 [Sengupta]; Arijit Sengupta and Melmet Dalkilie; DSQL – An SQL for Structured Documents; Department of Accounting and Information Systems Kelley School of Business Indiana University; 2002.

 [Ullman]; J. D. Ullman, A. V. Aho, J. E. Hopcroft; On Finding Lowest Common Ancestors in Trees; Annual ACM Symposium on Theory of Computing, 1973.

Michael M David is CTO and founder of Advanced Data Access Technologies, Inc.  Previously a staff scientist and the lead XML architect for NCR/Teradata and their representative to the SQLX group, he has over twenty years experience designing commercial nonprocedural  heterogeneous database access products. Based on this experience, he has authored the book Advanced ANSI SQL Data Modeling and Structure Processing and many papers and articles on this subject. Mike can be reached at mike@adatinc.com. More information on Mike’s company and research can be found at www.adatinc.com.

[ICM/Frames/JCMCopyright.html]