JMay
2005 Issue: 35
Journal of Conceptual Modeling
www.inconcept.com/jcm
This paper will explain and show how standard ANSI SQL processors can naturally model and automatically process complex multi-leg hierarchical data structures at a full conceptual hierarchical level. This also means the query user does not need to have structure knowledge of the hierarchical structures involved. The data modeling capability includes dynamically combining logical hierarchical relational and physical XML data structures at a full hierarchical level. This also includes the ability to link below the root of the lower level structure intuitively forming a valid unified hierarchical structure. As will be shown, ANSI SQL’s high level hierarchical data processing allows the flexible conceptual control of hierarchical node promotion, fragment processing, structure transformation, and variable structure creation. Also described is how the complex multi-leg hierarchical data filtering and processing is automatically carried out by the relational engine’s Cartesian product processing which inherently utilizes Lowest Common Ancestor (LCA) logic. Other ANSI SQL hierarchical processing features covered include the natural ability to maintain hierarchical processing operation in distributed processing environments across heterogeneous SQL systems and to apply all of these natural hierarchical processing abilities to the transparent and full integration of XML.
1) Hierarchical 4GL Capabilities
Fourth generation languages are nonprocedural languages. They are also known as declarative languages because the user only needs to specify what they want and not how to get it. In order for this to work, the 4GL needs to know the hierarchical data structure being processed. Using the inherent meta information derived directly from hierarchical structures definitions, the 4GL can automatically determine the semantics in the data structure. This can be naturally applied to the processing of the query without having to be supplied with the hierarchical processing instructions or logic.
With the advent and popularity of XML today, hierarchical data structures are coming back in a big way. Unfortunately, maybe because of the lack of knowledge about hierarchical structures and their capabilities, XML query languages have not taken full advantage of hierarchical structures and the powerful structural semantics they contain. This results in XML query languages having to be procedurally supplied with hierarchical semantic logic causing significant difficulty in specifying hierarchical queries that operate across multiple legs of the hierarchical structure. The programming complexity level increases as the number of legs involved in processing the query increases. This can make semantically complex queries not worth the time or effort to specify procedurally. It also prevents the ability to specify ad hoc queries, and increases the chances of inaccurate results from faulty hierarchical processing logic.
Multi-leg queries greatly increase the hierarchical semantics involved and complexity level necessary for processing them. While more complex to process, multi-leg queries processed by nonprocedural 4GLs are easy and intuitive for the user to specify, are useful in decision support, and can avoid using procedurally complex queries. More importantly, being able to freely query the entire structure nonprocedurally allows any query to be processed regardless of the number of legs involved. This frees the ad hoc user or query developer of any concerns or knowledge of the structure. The 4GL user does not need to know the location of the data or hierarchical relationships of the structure being queried. These complex hierarchical operations which are determined by the interaction of the query semantics and the hierarchical data structure semantics are not well understood or documented today, but can still be performed automatically by a 4GL like SQL.
1.1) Hierarchical Semantics
Full hierarchical structures as described in this paper support the use of multiple node types and twins. Twins mean that there can be multiple data occurrences under their related parent node data occurrence. Hierarchical structures built on top of relational databases do not usually support these capabilities, but ANSI SQL’s inherent hierarchical support naturally supports these capabilities allowing complex hierarchical structures to be supported. Twins (multiple data occurrences) are more tightly related than siblings (multiple legs) and have different semantics. Siblings affect the hierarchical structure model and twins do not. For this reason twin semantics are often overlooked in properly processing hierarchical data meaningfully. Node occurrences not related by the same parent node occurrence are not usually considered meaningfully related and if treated as they are will usually produce an incorrect result. This is why the use of “occurrence” or “data occurrence” is used liberally and meaningfully in this paper.
Hierarchical data structures have an unambiguous semantics because they have only a single path to every node. This allows a nonprocedural query language to navigate them automatically and process the request utilizing the unambiguous semantics in the structure. The different legs of a hierarchical structure are independent of one another, but are related indirectly by the Lowest Common Ancestor (LCA) [Ullman] nodes that connect them. This makes the nodes on different legs indirectly related which is referred to here as cousins. This means that a single node data occurrence on one leg of the structure is also related to all other node data occurrences for sibling legs under their Lowest Common Ancestor node data occurrence. Each Lowest Common Ancestor data node occurrence naturally coordinates the hierarchical control logic necessary for hierarchical processing across legs. The number of Lowest Common Ancestors nodes necessary to process the query increases with the number of legs referenced in a query.
Hierarchical data structure semantics today are currently being utilized only on a single leg at a time. A primary advantage of a full hierarchical structure is that it can define any combination of the legs necessary to process any multi-leg query. This semantics applied across the multiple legs means that the semantics between every node in the structure are meaningful and can be utilized for hierarchical processing. Typically a 4GL user querying data nonprocedurally from a full hierarchical structure is naturally going to reference multiple legs. This is because the user does not need to know the structure or be restricted by it. This multi-leg processing can be handled automatically by a 4GL utilizing the inherent semantics and principles that occur inherently between all nodes in the entire structure. Hierarchical semantics remain consistent independent of how the structure is stored and represented physically or logically.
SQL’s underutilized and overlooked inherent hierarchical processing capabilities offers proof of what has been stated above about current underutilized hierarchical capabilities in XML. By modeling hierarchical structures directly in SQL using the ANSI SQL Left Outer Join operation where tables and XML elements are nodes in the hierarchical structure, the relational engine automatically operates at a full hierarchical level by inherently utilizing the semantics in the structure [David 1]. This means hierarchical processing is a valid subset of relational processing. This makes XML integration in SQL seamless and capable of full hierarchical coverage carried out completely nonprocedurally and automatically.
SQL’s nonprocedural processing utilizes the semantics naturally present in hierarchical data structures. This drives the hierarchical processing which can automatically correlate information throughout the entire structure and can also be performed in ad hoc mode to support powerful decision support. This significantly increases the value of the data by automatically being able to dynamically process very semantically complex queries. This takes on staggering importance and value with the goldmine of hierarchical semantics naturally existing in XML data structures and also available for legacy hierarchical data sources which can also be processed seamlessly by SQL hierarchical processing.
2) ANSI SQL as a Hierarchical 4GL Query Processor
With ANSI SQL being a 4GL, data modeling SQL syntax and semantics automatically instruct the SQL relational engine how to perform the hierarchical query processing. This is because all nodes are inherently related semantically and can be automatically processed as stated earlier. This enables SQL to perform hierarchical processing naturally by following the semantics in the data structure being processed. The data structure is defined hierarchically by the Left Outer Join operation which unlike the standard and default Inner Join operates hierarchically. This enables complex hierarchical multi-leg queries to be processed nonprocedurally and automatically. For example, this allows an SQL query that selects data from one area of a structure based on data in another area of the structure to be performed automatically. The logic involved uses the Lowest Common Ancestor nodes between the referenced legs to coordinate and correlate the multi-leg processing based on the hierarchical semantics.
Prior SQL hierarchical processing research involved different approaches for extending SQL syntax and semantics, a representative example can be found in [Abiteboul], [Leven] and [Sengupta]. The SQL hierarchical processing research described in this current paper, unlike previous hierarchical attempts, utilizes unrealized inherent ANSI SQL hierarchical processing capabilities [David 1] and its seamless extension to XML [David 2] to avoid the use of SQL extensions to perform full hierarchical processing against XML. The research in this current paper may be the first to trace and explain how Lowest Common Ancestor logic is being inherently performed by the relational engine.
A simple SQL query involving more than two legs would involve multiple Lowest Common Ancestors and their interrelationships making the internal logic increasingly complex. Fortunately, nonprocedural relational languages such as SQL can do this accurately and automatically regardless of the semantic complexity. This is because the relational hierarchically restricted Cartesian product processing automatically generates all valid hierarchical combinations between the legs of each Lowest Common Ancestor node data occurrence which automatically performs this hierarchical logic inherently and accurately. Current XML query languages such as XQuery are controlled procedurally by user supplied multi-leg hierarchical semantic logic. This can make them impracticable for semantically complex multi-leg hierarchical operations. For this reason there are a number of academic projects adding Lowest Common Ancestor logic on top of XQuery for the nonprocedural processing of queries across multiple legs [Chen 1], [Chen 2], [Cohen], and [Li].
With hierarchical structure data preservation, each separate leg of a hierarchical structure can vary dynamically in number of data types from leg occurrence to leg occurrence. Surprisingly, this does not present a problem for fixed column relational data. Variable column legs in SQL relational databases are logically represented correctly hierarchically in form and length in standard fixed column relational rowsets. This is because the variable missing trailing columns of every leg occurrence are automatically padded with Nulls as a normal operation of the SQL Outer Join data preservation operation. This allows a variable number of columns in hierarchical leg occurrences to be processed accurately in fixed length rowsets dynamically preserving the fixed alignment. This enables seamless and transparent operation of variable length legs.
SQL’s automatic processing of full multi-leg structures has all the same operational characteristics and capabilities as any physical or logical hierarchical structure being processed by older proven nonprocedural hierarchical processors. This relationally backed processing also validates that this is the correct mathematically correct way of hierarchical processing. Hierarchical processing and its semantics operate the same for physical (XML) or logical (relational) structures. This enables a consistent and seamless operation across heterogeneous structure formats when SQL is automatically performing hierarchically. There will be further coverage on physical and logical structures later.
2.1) Hierarchical Query Specification
A typical SQL request specification uses the familiar SELECT, FROM, and WHERE clause syntax. The following description explains how this syntax naturally applies to hierarchical processing when SQL is processing hierarchically modeled structures. Since the SELECT clause relies on the data specified in the FROM clause, the FROM clause will be discussed first. The FROM clause specifies the location of data types that form a pool of nodes consisting of relational tables and XML elements which will be placed in the input working set as needed. All required data types for processing the query must be included there which the FROM clause satisfies. The Left Outer Join operation is specified in the FROM clause to specify how the nodes are related hierarchically into a single structure. This can be abstracted into separate SQL views representing meaningful substructures that can be combined in multiple ways.
The SELECT clause specifies which data items in the modeled structured defined in the FROM clause are to be returned. If any data item in a node is returned then that node is represented in the returned structure, otherwise that node is excluded in the returned structure. The WHERE clause is used to return only the qualified data occurrences. If no WHERE clause is specified than no data filtering is performed.
2.2) FROM Clause Controls Hierarchical Data Modeling
The data associated with the left argument of the Left Outer Join is hierarchically preserved over the right data argument because it is preserved even when the right argument’s data is not present. Using this capability, any hierarchical structure can be modeled and processed hierarchically by the relational engine performing the hierarchical semantics associated with the SQL hierarchical modeling syntax. This is shown below in the definition of the ViewX structure in Figure 2.2 below. You can notice how legs are created by modeling them going down the structure and how multiple legs are formed when ON clauses link back to nodes already with a formed or partial formed leg. This SQL hierarchical Left Outer Join view can be easily generated automatically from existing metadata sources such as XML schemas.
ViewX CREATE VIEW ViewX AS
A SELECT * FROM A
/ \ LEFT JOIN B ON A.a=B.a
B C LEFT JOIN C ON A.a=C.a
/ \ LEFT JOIN D ON C.c=D.d
D E LEFT JOIN E ON C.c=E.c
Figure 2.2: SQL hierarchical data modeling
The SQL hierarchical data modeling syntax in Figure 2.2 above is also directly executable by SQL to perform hierarchical processing. This is possible because the associated SQL semantics of the data modeling SQL is specifying the basic principles of hierarchical structures and their processing. This makes for an extremely tight and seamless bond between SQL hierarchical data modeling and its associated hierarchical processing assuring data modeling accuracy, efficient processing, and an open and available data modeling language and its accompanying hierarchical processor. Also realize that this SQL data modeling syntax is a self defining data structure definition that accompanies the SQL were ever it may go. These powerful characteristics have very useful implications that will be examined further.
2.2.1) Joining Hierarchical Structures
In the same way that the hierarchical data modeling was performed a node at time using Left Outer Joins in Section 2.2, hierarchical substructures defined in SQL views can be joined (linked) by Left Outer Joins deriving a hierarchical superstructure. The left structure is joined over the right structure linked by the ON clause join criteria as demonstrated in Figure 2.2.1 below. The ON clause takes on added importance because it also specified the join points in each structure, in this case it is the C node linked to the X node.
Relational
XML
ViewX ViewY
A
X
/ \
/ \
B C
Y Z
/ \
D E
↓
↓
SELECT * FROM ViewX LEFT JOIN ViewY
ON C.Key=X.Key
Joined Structure
A
/
\
B C
/ |
\
D E X
/
\
Y Z
Figure 2.2.1: Joining hierarchical data structures
The SQL input views in Figure 2.2.1 above are modeling hierarchical structures. Since logical and physical structures have the same semantics and operational principles, this means that an SQL view can hierarchically model relational or XML data. XML data can be retrieved and returned as a rowset which is mapped by its hierarchical SQL view. This allows it to be accessed seamlessly at the SQL data item level. This is why the data structures referenced in Figure 2.2.1 above can consist of relational and XML data. More information on XML access can be found in Section 3.1.
Since the SQL hierarchical modeling views are SQL and the invoking statement is SQL, they combine naturally into a unified SQL executable statement. This means the SQL joining of the Left Outer Join views (ViewX and ViewY) in Figure 2.2.1 above automatically expand into a heterogeneous unified virtual SQL view that correctly maps the combined heterogeneous hierarchical structure being processed in the SQL query. This enables seamless access and processing across the heterogeneous global virtual structure. Also notice that the level of data modeling is at a very high conceptual hierarchical modeling level. It can manipulate structures as whole entities as can be seen visually above in Figure 2.2.1 and in examples to follow.
2.2.2) Linking Below the Lower Level Structure’s Root
The joining of the two hierarchical structures in Figure 2.2.1 above is standard for hierarchical structures since the lower level structure was linked to its root node. The semantics of the combined hierarchical structure are very intuitive as shown. But, what would the semantics of the combined structure if the lower level structure was linked to its lower Y node instead of its root node X ? Interestingly, this capability is naturally supported in ANSI SQL and the combined result structure generated remains the same as if the actual root was linked to producing the same structure as shown above in Figure 2.2.1. This is easily understood and makes sense since the root still exhibits the same inheritance and subordination effect on its structure. It also makes a good standard way of supporting this useful capability which is backed by ANSI SQL natural hierarchical processing and will be applied consistently across logical and physical structures. Prior solutions to this semantic problem usually required awkward transformations of the lower level structure. Another example is show below in Figure 2.2.2.
SELECT * FROM View1 LEFT JOIN View2
ON A.a=Y.y
Result
View1
Structure
A A
| \ View2
à / \
B \ X
B X
| \ / \
| / \
C Y Z
C Y Z
Figure 2.2.2: Linking below the root
In Figure 2.2.2 above, View1 is linked over View2 where the lower level View2 is linked to below its root at its Y node. As you might expect, data filtering is applied to the joined lower level structure at the lower link point Y node, but for determining the result structure, node X remains the root. This means this capability is useful if there is no link available to the lower level structure’s root node or there is a reason for filtering on a non-root node(s). This capability also increases the overall data abstraction and ease of use for SQL hierarchical processing. More info on linking below the root will be discussed in Section 6.1 on Right Sided View Nesting.
2.3) SELECT Clause Controls Node Promotion
Data selection, known as projection in relational terms, is specified by the SQL SELECT clause. It controls which data types from the input working set defined by the FROM clause are moved to the output structure (a fixed vertical/column kind of filtering). Nodes with no data selected are not moved to the output structure which is standard relational processing. This slicing out of unselected output nodes is also standard hierarchical processing and is known as node promotion because the removed node’s selected descendent nodes are preserved by being naturally promoted up and around the removed nodes following the hierarchical form of the data structure. This is shown below in Figure 2.3 where node C was not selected for output.
View X Node
Promotion
A SELECT A.a, B.b, D.d, E.e
A
/ \ à
FROM ViewX à
/ | \
B
C B
D E
/ \
D E
Figure 2.3: Hierarchical node promotion and collection
The selected nodes in Figure 2.3 above also maintain their hierarchical semantics because the remaining nodes still exert the same hierarchical effect and semantics on each other (subordination is maintained). This is because the structure is naturally condensed in a hierarchically controlled manner. Also notice that nodes D and E with separate paths directly under node A is an example of node collection. All of these operations are basic hierarchical processing and are represented in relational flat hierarchical structures.
2.3.1) Fragment Isolation
Fragments are pieces of a hierarchical structure similar to substructures but more dynamic. This definition is extended here to include node promotion caused by node exclusion. By extending upon the same data selection principles used in node promotion demonstrated above in Figure 2.3, it is possible to isolate a hierarchically structured fragment in a view. This is demonstrated below in Figure 2.3.1 where a fragment is isolated by only selecting on data in the C, D, and E nodes.
View X Fragment
Isolation
A SELECT C.c, D.d, E.e
C
/ \ à
FROM ViewX à
/ \
B C
D E
/
\
D E
| |
F G
Figure 2.3.1: Hierarchical fragment isolation
2.3.2) Structure Transformation
By utilizing fragment processing shown above in Section 2.3.1 and hierarchical structure joining in Section 2.2.1, it is possible to isolate multiple structure fragments in a single structure and then manipulate them with join operations. This is fairly simple and intuitive, but involves another level of complexity by using high level prefixes (V1 and V2) to distinguish the two fragments. This is demonstrated in Figure 2.3.2 below where two fragments are formed and separately reconstructed from the same SQL view. These are fragments A-B and C-D, which can be used to independently manipulate and perform structure transformation by rejoining the fragments differently which is also shown below.
ViewX
Transformed
A
Structure
/ \ SELECT V2.C.c, V2.D.d,
V1.A.a, V1.B.b C
B C à
FROM V2.ViewX LEFT JOIN V1.ViewX
à / \
/ \ ON V2.C.c=V1.A.a
D A
D
E
|
B
Figure 2.3.2: Structure transforming using fragment processing
As shown in Figure 2.3.2 above, transforming data structures in standard SQL is possible by defining different fragments in the same structure by using the SQL alias/rename high level prefix capability to keep them separate and distinct. Both logical and physical structures once retrieved to the working set in memory are stored as contiguously fixed structures. This allows any logical fragment in the working set to be naturally isolated and moved separately to the result set using standard relational processing. This is possible because all accessed data types, such as logical relational, contiguous XML, and even linked IMS data forms are now stored in a homogenous rowset in the relational working set. This means the structured data in the working set can be treated as a fixed contiguous rowset and can be logically joined with other rowsets irregardless of the make up of the source data format. See Section 6.3 for more information on logical and physical structure operation.
2.4) WHERE Clause Controls Hierarchical Data Filtering
Hierarchical data filtering is controlled by the WHERE clause. It specifies which data row occurrences are filtered out based on their data content. This is a dynamic horizontal filtering of rows or path data occurrences. Data filtering will affect which data is in the result set on a row by row, path by path data occurrence basis.
One of a hierarchical query’s most powerful capabilities is its hierarchical WHERE clause data filtering which is applied to the full hierarchical structure. To fully understand hierarchical query data filtering which is more involved than flat structure filtering, it is more easily understood if examined as data qualification. This is because it is simpler to demonstrate a positive rather than negative operation when demonstrating the natural hierarchical filtering operation. This means that when hierarchical data filtering is used, it is examined as qualifying data rather than specifically filtering out data. When no WHERE clause filtering is specified, all the queried data is qualified.
2.4.1) Hierarchical Data Qualification
WHERE clause data qualification operates on the entire hierarchical structure in a hierarchical fashion based on how all the nodes relate to each other This same hierarchical process is occurring in standard SQL and relational table processing, but takes on hierarchical meaning (semantics) when the entire structure is recognized and the result is examined against its hierarchical structure. This semantics is more easily traced through the hierarchical structure following the related data qualified by the WHERE clause which is described below.
ViewX below in Figure 2.4.1 consists of tables A, B, C, D, E joined into the hierarchical structure as shown. The tables become nodes in the structure when modeled in SQL. If “A”, “B”, “C”, “D” and “E” are selected from this structure qualified on some value for “C” as in: SELECT A,B,C,D, E FROM ViewX WHERE C=5, what are the data selection semantics of this query? The qualification process starts at the WHERE clause condition that directly qualifies the C data node occurrence(s) with C=5. Then all path data occurrences under the qualified C data node occurrence (D and E node occurrences) and the path data occurrence above each qualified C data node occurrence (A node related data occurrence) qualify.
ViewX
A SELECT A.a, B.b, C.c, D.e, E.e
↓ ↑ FROM ViewX
B C=5 ß
WHERE C.x=5
↓ ↓
D E
The above described WHERE clause in data selection logic in Figure 2.4.1 covers directly related qualification logic familiar today, but multi-leg qualification includes any node path data occurrences indirectly related across the legs. These are connected to a qualified node data occurrence such as the related B node cousin data occurrences related by the Lowest Common Ancestor (LCA) qualified A node data occurrence in ViewX above in Figure 2.4.1. This same hierarchical result exactly reflects the result found in the relational result, with the meaning obscured by the flat representation. The hierarchical semantics and associated data are still available for hierarchical use. It can enable structured XML output to be nonprocedurally produced. See Section 3.1 for more on information on XML integration.
2.4.2) Multi-leg Filtering Semantics
When multiple legs take part in data filtering as in WHERE D=1 AND E=0 or arithmetic operations as WHERE D=E, using the ViewX below in Figure 2.4.3, all combinations across the qualified data value occurrences under the Lowest Common Ancestor data occurrences of node C are processed for a matching combination that qualifies. If lower level nodes are also selected, than all combinations that qualify must be processed since the lower level nodes can be qualified separately by different qualified combinations. This is also the same logic performed in hierarchical processing which is naturally reproduced by the relational Cartesian product processing.
Many queries will have data filtering criteria applied to more than two legs which can produce more than one Lowest Common Ancestor node filtering processes as described above. Using the hierarchical structure ViewX below in Figure 2.4.3, data filtering as in WHERE D=1 AND E=0 AND B=3 will cause a more complex processing of data filtering. In this example it requires nested processing of Lowest Common Ancestor logic in this compound WHERE clause. Common ancestor node C is derived from D and E nodes. It is located under common ancestor node A node which is derived from B node and C node’s sub result of D and E. This requires nested common ancestor processing of C under A node that naturally follows correct hierarchical processing logic.
Regardless of the number of common ancestors involved, standard SQL operating hierarchically will perform this hierarchical logic perfectly thanks to the hierarchically restricted Cartesian product processing controlled by the WHERE clause. It generates the correct combination of row values to automatically emulate this type of hierarchical nested Lowest Common Ancestor processing a single row at a time. More information of SQL’s automatic use of Lowest Common Ancestor logic can be founding Section 6.2.
2.4.3) Multi-leg Variable Semantics
As a further example of the subtleties and complexities of hierarchical processing, the automatic hierarchical query internal semantic processing can become more powerful and complex with OR decision processing in the WHERE clause. This enhances and complicates the natural semantics of hierarchical data filtering processing, making it dynamically variable. Using the hierarchical structure from ViewX in Figure 2.4.3 below, what would the semantics of the query: SELECT D, E From ViewX WHERE D=1 OR E=0 ? If D=1 is true and E=0 is not, then every occurrence of E under the qualified Lowest Common Ancestor C node occurrence is qualified (thanks to D=1 being true) and only the D node data occurrence with the qualified D=1 qualifies since its only qualification is more specific.
As you would expect from the data qualification results directly in Figure 2.4.3 above, the opposite results occurs if only E is true and D is not. In this case every occurrence of D under the qualified common ancestor C node occurrence is qualified and only the E node data occurrence that tested true qualifies. If both conditions are true, then all occurrences of the D and E nodes qualify because of cross qualification. These results also demonstrate why it is easier to examine data filtering as data qualification, because both sides of the OR operation have an additive affect. While these output qualification semantics are complex, the results they produce are logical and intuitive to the user.
The WHERE clause OR processing semantics described above means that both sides of the OR condition must always be tested because the qualification semantics can dynamically change depending on which side of the OR operation is true. SQL handles this advanced hierarchical processing automatically with its Cartesian product processing that generates all combinations of hierarchical relations which automatically checks both sides of the OR operation. The correctness of this operation can be logically verified by replacing the OR operation with separate filtering operations on two queries and unioning the results.
2.4.4) Path Data Filtering (Data Model Rules)
WHERE clause hierarchical data filtering is very powerful operating intuitively on the entire multi-leg structure, but you may still need a more restrictive (path only) type filtering similar to XPath. This is where ON clause data filtering can be used. While the ON clause is used to specify the join condition, it can also include a data filtering condition. Being on the ON clause it only affects the join operation of the node it is specified on and the related lower level data nodes which can not exist without its parent data occurrence existence (causing a cascading delete). All nodes on other legs are not affected in any way. As you probably realize, all ON clauses are processed before the WHERE clause is processed since the WHERE clause affects the entire structure which requires all the ON clauses to have been processed (at least logically).
SELECT Emp, Dpnd FROM Emp LEFT JOIN Dpnd ON EmpId=DpndEmpId…
WHERE DpndAge<18 ON…AND
DpndAge<18
↓ ↓
Results: Emp2 Dpnd2 Emp1
Null
Emp2 Dpnd2
Figure 2.4.4: Difference between WHERE and ON clause data filtering
3) Other Hierarchical Based Capabilities
3.1) XML Joinless Access
While logical structures composed of relational tables use joining operations when retrieved, physical structures like XML do not need to simulate costly relational joins in the retrieval process when they are modeled by Outer Joins views. This is because the Left Outer Join hierarchical data modeling syntax for physical structures only represents the data structure metadata semantics. This means physical structures can be accessed and processed directly without joins because the hierarchical semantics of the Left Outer Join are reflected naturally in the physical hierarchical structures themselves and do not require expensive and needless joins to physically model their hierarchical structure. Query languages that simulate physical join views to integrate with relational data relationally will have join processing overhead for XML to support SQL/XML integration. But because the natural solution described in this document utilizes ANSI SQL’s hierarchical processing to perform SQL/XML integration at a hierarchical level, it can efficiently process physical hierarchical structures hierarchically. This allows constructing the rowset that matches and models the Left Outer Join data modeling without performing expensive relational joins.
3.2) Natural Distributed Hierarchical Processing
3.3) Hierarchical Structure Construction Order
Hierarchal structures can generally be built or processed in any starting order without changing their semantics or result. This processing can be top-down, left to right, bottom-up, or in any combination. This is also true of the hierarchical structure’s data modeling definition. The same hierarchical structure can be modeled in a top-down, left to right, or bottom-up order which controls the order built. The completed hierarchical structure has the same semantics and derives the same results in what ever order it is built or processed. As an indication of SQL’s hierarchical processing capability and flexibility, all of these same capabilities and operational characteristics remain the same as hierarchical processing because the semantics are identical.
Generally speaking, hierarchical structures are most easily defined and built top-down as all the previous examples have intuitively demonstrated. Some automatic SQL data modeling definition or view expansion processes may change the standard data modeling order while remaining semantically the same. This can produce inefficiencies that cause throwaway data from dangling tuples (no matching row) that are avoided with top-down processing. This is avoided by rewriting the expanded outer join for top-down processing at runtime.
3.4) Full Ad Hoc Nonprocedural Processing Supported
4) Processing Unconventional Hierarchical Structures
XML hierarchical structures are not quite as fixed as conventional hierarchical structures. This is because XML is composed of semistructured data where their structure is defined in the data allowing it to change dynamically or even define logical network data structures. These present problems for relational processing. Relational solutions to these cases based on SQL hierarchical processing are described below. While it is nice to support all XML capabilities derived from its dynamically flexible structure definition, no XML processor is capability of supporting all possible capabilities. Each has their own strengths and limitations. For this reason, SQL should not be expected to support all XML capabilities.
4.1) Variable Structures
Variable structures are hierarchical structures whose structure can change dynamically between different structure occurrences and even within a given structure occurrence. This is allowed in XML. SQL can support variable structures by using variable values in the ON clause that hierarchically model the structures differently depending on the value of the variable. Since separate ON clauses are used at each join point they can dynamically control whether the join at any point in the structure building process is performed or not. By having a test condition based on a higher level data value located higher on the current path occurrence (hierarchical data inheritance is supported in SQL), the variable generation of the data structure can be dynamically controlled. This is shown below in Figure 4.1 where either a D node or an E node data occurrence is generated, but not both, controlled by the value in the C.x data field. This is similar to COBOL’s Depending ON clause in its File Definition section. This is demonstrated below with an overly simple example, but can give you an idea of how this capability is supported and utilized.
ViewX-1
ViewX-2
A SELECT * FROM
A A
/ \ LEFT JOIN B ON A.a=B.a
/ \
B C LEFT JOIN C ON A.a=C.a B C
| ß
LEFT JOIN D ON C.c=D.c AND C.x=1 |
D LEFT JOIN E ON C.c=E.c AND C.x=2
à E
Figure 4.1: Variable structures
4.2) Mapping Network Structures to Hierarchical Structures
Some of XML’s advanced features require or create a logical network structure. Network structures unlike hierarchical structures allow a node type to be accessed from more than one path. This is demonstrated directly below in the XML IDRef Structure in Figure 4.2. This makes XML hierarchical structures ambiguous for nonprocedural, navigationless languages like SQL because a node can have multiple path entries each with its own semantics. A similar problem occurs with the use of duplicate element type nodes which is also permitted in XML and shown below. It is possible to model these network type structures as unambiguous hierarchical structures in SQL. SQL has an alias/rename ability that can enable a network to hierarchical structure mapping capability as shown below in Figure 4.2.
IDRef Structure Hierarchical Structure Duplicate Element
Dept
Dept Dept
/ \ /
\ / \
Cust Emp à
Cust Emp ß
Cust Emp
\
/
| |
| |
Addr AddrC AddrE
Addr Addr
Figure 4.2: Hierarchical mapping solutions for network structures
While the underlying storage of the Addr node is different in both usages (shared or separate) above in Figure 4.2, the semantics of both structures are basically the same and can be mapped into the same remodeled hierarchical structure which is unambiguous as shown above.
5) Hierarchical SQL View Use and Importance
Hierarchical outer join definitions of logical relational table structures and physical XML documents can be placed in standard SQL views. These ANSI SQL hierarchical views have unique capabilities that come together synergistically producing even more powerful capabilities making SQL hierarchical processing more powerful and very user friendly as described below.
5.1) Hierarchical View Synergies
Hierarchical SQL views naturally map logical relational and physical XML structures using the Left Outer Join operations. This makes hierarchical processing flexible, reusable, and intuitive because of these hierarchical views’ powerful hierarchical view abstraction and conceptual processing. These hierarchical views can be embedded or hierarchically joined with other hierarchical SQL views. This is performed the same way tables are by using the Left Outer Join directly on views to naturally form larger hierarchical views as in FROM ViewX LEFT JOIN ViewY ON X.x=Y.Y as discussed in Section 2.2.1. This joining of hierarchical views preserves and combines the hierarchical semantics and can even be preformed dynamically. When all the specified hierarchical logical and physical SQL views expand at runtime they automatically form into a seamless and consistent unified heterogeneous virtual hierarchical view. Defined naturally in SQL, this takes on significant synergistic proportions making hierarchical processing very user friendly.
5.2) Hierarchical View Optimization
The Left Outer Join’s natural hierarchical data preservation operation described earlier allows separate views or the entire unified heterogeneous view to be optimized hierarchically at runtime. This is an optimization to access only the nodes referenced or on a path to a referenced node, thereby saving on unnecessary data access. This also significantly cuts down on data explosions caused by semantically incorrect and confusing data replication, and the inefficiency they cause in memory and CPU usage. This is demonstrated below in Figure 5.2 where nodes B and D from ViewX are not accessed.
ViewX Nodes Result
A
Accessed Structure
/ \ SELECT A.a, E.e A
A
B C à
FROM ViewX à
| à |
/ \
C E
D E
|
E
Figure 5.2: Hierarchical optimization
You can also see in Figure 5.2 that node C while not referenced is still required for navigating from node A to node E. Node C will be removed from the final result since it was not selected for output. The lack of access of the optimized out nodes (B and D) has no negative influence on the result and improves upon the semantic accuracy of the result by reducing unnecessary data replications.
This optimization’s power is significantly increased by how easily the SQL SELECT clause can be changed and the view can automatically adapt by eliminating unnecessary nodes from the defined structure. This is a form of semantic optimization driven by the data structure’s metadata instead of physical views that use procedural programming instructions with smaller code optimization windows.
Global views of entire structures become very useful with this hierarchical optimization. Global views result in fewer specialized views, further increasing the reuse and hierarchical data abstraction for the user so they do not need to be concerned with details of the structure being processed. This also allows global views to be used without incurring any overhead. In a nonprocedural 4GL like SQL, global views automatically increase the database domain of the queries using these large views. This makes them very user friendly by eliminating the use of small views and having to know when and how to use which view.
This powerful hierarchical optimization is not a replacement for the standard Inner or Outer join optimization. Nor does it have to be integrated into the current Inner or Outer join optimization. This powerful hierarchical optimization is simply applied before the standard optimization is performed. It can be applied externally and dynamically by rebuilding the query to eliminate the unneeded node paths.
This hierarchical optimization can also be applied to accessing physical hierarchical structures described Section 3.1. It can be used to indicate paths that can avoid database access. This use made of this will vary for the type of physical structures and their access mechanisms. Structures comprised of physical pointers like IBM’s IMS database versus nested contiguous structures like XML and Structured VSAM will use different access optimizations that can benefit from the hierarchical optimization described here.
6) More on the Internal Processing of this Technology
There are a number of automatic powerful internal ANSI SQL operations that play an essential role in ANSI SQL hierarchical processing that have not been fully covered yet. This was done to keep the previous descriptions more easily understood. At this point these powerful features can now be examined.
ANSI SQL makes a very powerful, complete, and flexible nonprocedural 4GL hierarchical querying language for relational and native XML data. It is unique because it automatically utilizes the semantics in the data structure to automatically process the most semantically complex multi-leg hierarchical queries without extending the relational model. This means the user does not have to know the data structure or supply the hierarchical semantic processing logic, increasing the value of the data. The results are simultaneously hierarchical and relationally accurate making the result mathematically correct. This allows it to integrate logical or physical, and heterogeneous hierarchical structures transparently at a full interactive conceptual hierarchical processing level. SQL naturally performs hierarchical processing so accurately that it can be used as a model for determining correct hierarchical processing. This has already been done to determine and validate the semantics of linking below the root of the lower level structure when joining hierarchical data structures.
References
[Abiteboul]; Serge Abiteboul and Nicole Bidoit; Non First Normal Form Relations to Represent Hierarchically Organized Data; PODS; 1984.
[Chen 1];Ya Bing Chen, Tok Wang Ling, and Mong Li Lee; Automatic Generation of XQuery View Definitions from ORA-SS Views; School of Computing National University of Singapore, 2004.
[Chen 2]; Zhuo Chen, Tok Wang Ling, Mengchi Liu, and Gillian Dobbie; XTree Declaritive XML Quering, School of Computing National University of Singapore; 2004
[Cohen]; Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv; XSEarch: A Semantic Search Engine for XML; Proceedings of the 29th VLDB Conference, 2003.
[David 1]; Michael M David; Advanced Capabilities of the Outer Join; ACM SIGMOD Record; March 1992.
[David 2]; Michael M David; ANSI SQL Hierarchical Processing Can Fully Integrate Native XML; ACM SIGMOD Record; March 2003.
[Leven]; Mark Leven and George Loizou; Semantics for Null Extended Nested Relations; ACM Transactions on Database Systems; Sept 1993.
[Li]; Yunyao Li, Cong Yu, and H. V. Jagadish; Schema-Free XQuery; 30th VLDB Conference, Toronto, Canada, 2004.
[Sengupta]; Arijit Sengupta and Melmet Dalkilie; DSQL – An SQL for Structured Documents; Department of Accounting and Information Systems Kelley School of Business Indiana University; 2002.
[Ullman]; J. D. Ullman, A. V. Aho, J. E. Hopcroft; On Finding Lowest Common Ancestors in Trees; Annual ACM Symposium on Theory of Computing, 1973.
Michael
M David is CTO and founder of Advanced Data Access Technologies, Inc.
Previously a staff scientist and the lead XML architect for NCR/Teradata and
their representative to the SQLX group, he has over twenty years experience
designing commercial nonprocedural heterogeneous database access products.
Based on this experience, he has authored the book Advanced ANSI SQL Data
Modeling and Structure Processing and many papers and articles on this subject.
Mike can be reached at
mike@adatinc.com. More information on Mike’s company and research can be
found at
www.adatinc.com.
![]()