May 2002        Issue: 25

Journal of Conceptual Modeling
www.inconcept.com/jcm


The Logical-Physical Confusion
by Fabian Pascal

 Note: Each article in this series will start with a very carefully selected real-world quote representative of a major fallacy so entrenched in the industry that few practitioners are even aware of it. To test your ability to see thru these misconceptions, the reader should try to figure out what's wrong with the quote before reading the article. Quotes will be posted on this site in advance of the articles clarifying them and those who identify any of the misconceptions prior to their clarification in subsequent articles are invited to submit their explanation via email to the editor. The best one for each quote will receive a copy of my PRACTICAL ISSUES IN DATABASE MANAGEMENT, courtesy of the Journal of Conceptual Modeling. Here are the first five quotes.

 “The more you drift away from any physical implementation, the more performance is going to suffer ... The choice is between best logical structure or best physical structure, or a compromise.” 

A confusion between logical and physical levels of data representation has always plagued the database field. It usually is implicit in most of what is said and done in the field and, therefore, not readily discerned by the uneducated eye -- read: most of the industry -- but the above quote is quite explicit (though I still doubt that it's recognized as an example of the confusion).

 In order to appreciate how fallacious the quote is, let's use an analogy. Suppose you are to retrieve a set of books from a library and you have a choice between a set of five books and a set of ten books. If asked which set will take you longer to retrieve, would you be able to say? When I ask database practitioners this question, there is instinctive tendency to say the five set, but upon further thought most realize they can't say, because how long a set will take will be determined not by the number of books per se, but by how books are shelved. It is entirely possible for ten books to take less time if, for example, they are all shelved together, while the five are scattered across the library; or if the ten books are on the first floor, near the elevator, while the five are on the seventh floor, further away from it; or if the ten are cataloged, but the five require contacting a librarian (if any of the books were in use, retrieval could take a very long time indeed). You get the picture: retrieval performance is determined by how books are physically stored and accessed. Smaller numbers of books won't necessarily minimize retrieval time relative to larger numbers. 

Note: Applying the quote to the library case would yield: "The more you drift away from the library's book storage and access methods, the more performance is going to suffer; the choice is between the best number of books to be retrieved, the best storage and access methods, or a compromise". What does it mean to "drift away" from the physical storage in a library? And how can one compromise between the number of books and their storage? The notion of choosing between the logical and physical levels is absurd on its face.  

The same principle holds for database retrieval: it is not the number of logical tables involved in a query, but the data storage and access methods -- the physical implementation of the DBMS and the database -- that determine perfomance. If so, why has widespread confusion between the two levels persisted for so long in the industry? It underlies most pronouncements by practitioners and pundits (without them realizing it) and much of what is published. It is so entrenched that even when they recognize the similarity between libraries and databases, practitioners still defend the best known example of the confusion, the "denormalize for performance" argument (see Chapter 5 in UNDERSTANDING RELATIONAL DATABASES and several articles in the Against the Grain  series), according to which decreasing the number of (logical) tables -- which is what denormalization does -- enhances performance. When I question this argument, I usually get a response along the following line: "Well, it's all fine and dandy 'in theory', but we live in the real world and with current DBMS products I cannot get acceptable performance unless I denormalize my databases."  

Let us assume, for the sake of argument, that (a) a SQL DBMS always performs poorly with a fully normalized database and performance always improves when it is denormalized and (b) all other potential performance factors have been carefully ruled out (these are very strong and, therefore, quite unrealistic assumptions). Should the conclusion in even such circumstances be that it is the degree of normalization -- the number of logical tables -- which determines performance? Of course not. 

The only conclusion must, logically, always be that there is something in the physical implementation of the database, the DBMS, or both that is responsible for the poor performance of a fully normalized database, and for improved performance when the database is denormalized. This can be readily demonstrated in the SQL world. One characteristic of current SQL implementations (and the non-SQL products that preceded them) is that they maintain a one-to-one relationship between logical rows and physical records -- there is a physically stored file for every logical table. The obvious consequence is, of course, that when a database is further normalized (number of logical tables increases), so does the number of physical files; and when a database is denormalized (number of logical tables decreases), so does the number of physical files. And it is the physical variation that affects performance, not the logical one! Hence the confusion between the two levels and the weird notion of compromising between them: SQL DBMSs simply fail to adequately separate the two.

 This is a very good example of an implementation flaw of which the industry is not aware due to insufficient foundational knowledge. SQL DBMSs' poor insulation of the logical from the physical level is a direct result of this inadequacy and a major contributor to the confusion between the two levels. Nothing requires a DBMS to maintain the 1:1 relationship and, in fact, a new technology has recently been devised that can be used to implement truly relational DBMSs and databases that do away with physically stored tables altogether. Such a true RDBMS promises to be more optimizable, much better performing and easier to administer -- e.g. indexes may no longer be necessary! -- than the current SQL crop (unfortunately, I cannot say more about the technology at this point, but stay tuned to DATABASE DEBUNKINGS  

What is ironic about all this is that the whole concept of denormalization for performance is actually an illusion. While performance does improve sometimes, this is possible only if the integrity implications of denormalization are ignored. The industry does not realize that denormalized databases suffer from redudancy and, thus, have a considerably higher risk of data corruption, which imposes a highly prohibitive integrity burden on users. But if the necessary additional integrity constraints are imposed to prevent corruption, they will cancel out ( if not overtake) performance gains, if any, from denormalization (see Chapter 5 again and What You Don't Know about Denormalization Can Hurt you, Parts 1 and II, in the Against the Grain series).

 As long as practitioners believe, erroneously, that normalization inhibits performance, they will continue to blissfully denormalize their databases without controlling the ensuing redundancy, oblivious to corruption risks, and will never express, via their collective purchasing pattern, a desire for the real solution -- product improvements -- and will never get it. This holds true for each and every one of the many misconceptions prevalent in the database field. 

For more of those, stay tuned.

Fabian Pascal has a national and international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management.  He was affiliated with Codd & Date and for more than 15 years held various analytical and management positions in the private and public sectors, has taught and lectured at the business and academic levels, and advised vendor and user organizations on database technology, strategy and implementation. Clients include IBM, Census Bureau, CIA, Apple, Borland, Cognos, UCSF, IRS. He is founder and editor of DATABASE DEBUNKINGS a web site dedicated to dispelling prevailing fallacies and misconceptions in the database industry, where C.J. Date is a senior contributor. He has contributed extensively to most trade publications, including Database Programming and Design, DBMS, DataBased Advisor, Byte, Infoworld and Computerworld and is author of the contrarian column Against the Grain.  His third book, PRACTICAL ISSUES IN DATABASE MANAGEMENT serves as text for a seminar bearing the same name.

 

© Copyright, 1998-2004 InConcept (Information Conceptual Modeling, Inc.) All Rights Reserved. Privacy Statement. ISSN: 1533-3825