May 2003        Issue: 28

Journal of Conceptual Modeling
www.inconcept.com/jcm

 

What Do You Mean?

by Fabian Pascal

“I am designing a database system but I am a little unsure about normalization - could anyone tell me if the information below is in third normal form? If it isn't could anyone suggest where I've gone wrong.

·            PROTOCOL (PROTOCOLREF,TITLE,GUIDELINELINK,URL,FORMAT,CATEGORY,ABSTRACT)

·            USERS (PROTOCOLREF,USERNO) `

·            USERPROFILE(USERNO,FIRSTNAME,SURNAME,E-MAIL,HOMEHOSPITAL,NHSTRUST,POSITION)

·            COMMENTS (PROTOCOLREF,USERNO,DATEMADE,MAINBODY)

·            DATES (PROTOCOLREF,DATECREATED,NEXTREVIEW,LASTREVIEW)

·            AUTHOR (PROTOCOLREF,FIRSTNAME,SURNAME,PUBLISHER)

·            OTHERAUTHORS (PROTOCOLREF,FIRSTNAME,SURNAME)

·            WATERMARK (PROTOCOLREF,AUTHORITYSCORE,ORGANISATION)

·            OTHERPROTOCOLS (PROTOCOLREF,TITLE,RELATEDPROTOCOLREF)
where PROTOCOLREF and USERNO are key fields.”

 

Most common in the IT world are requests for advice on specific database designs, particularly ascertaining whether a particular database is fully normalized. Such requests, of which the above example is representative, indicate lack of fundamental knowledge necessary for database management. Chris Date interprets such requests to mean “I don’t know how to do my job and I am looking for somebody to do it for me.”

A logical model represents in the database a business model. The business model is the users’ understanding of what the data means—formally, the external predicate—while the logical model is the meaning of the data as understood by the DBMS—formally, the internal predicate. It is desirable, of course, to capture as much of the meaning in the business model into the logical model, and that is exactly what a data model is intended for: it is a general, formal theory of data via which we map enterprise-specific business models (such as the one on the left in Figure 1) to their logical representations in databases (on the right in Figure 1).

Note: The data model is to logical models what a programming language is to programs. In the same manner in which programs are task-specific and the language is generic (that is, it is used to write many programs), the logical models are enterprise-specific and the data model is generic. The data model instills meaning into data by structuring or organizing it, otherwise it would be random noise rather than data, because it would carry no informational value.

Figure 1:  Business(left) and Logical (right) Models

It follows that without the business-logical mapping that the data model facilitates, the DBMS won’t understand what the data means and, therefore, won’t be able to guarantee correctness of representation and manipulation. Otherwise put, there cannot be data management without some data model. And the only known data model that has all four desirable properties—generality, completeness, formality and simplicity—is the relational data model, which maps business models (external predicates) to R-tabular logical models (internal predicates).

Note: Before we had DBMSs, management of data  (integrity and manipulation) was performed by users in each and every application, a redundant, error-prone, and costly to maintain approach.

Loosely speaking, an entity type in the business model has a uniquely identifying attribute and, consequently, all its other attributes are “about”—dependent on—the identifier e.g. customers have a customer-ID and all customer attributes are dependent on it. The identifying attribute is represented logically in a relational database by a key column e.g. CUSTID. Full normalization means, again loosely, “one entity-type per table”. If this design principle is adhered to, all non-key columns in a table are about “the key, the whole key (if the key is composite), and nothing but the key (no dependence on other non-key columns)--KWKNK for short. For this reason fully normalized databases avoid

a)     Database bias for certain applications and against others

b)     Update anomalies: certain reasonable data operations will not be possible

c)     Redundancy: raises the risk of data corruption and, thus, the integrity burden to control it

If a database is undernormalized--multiple entity-types are “bundled” into one table--the KWKNK property will no longer hold, causing the three problems above (see The Dangerous Illusion: Normalization, Performance and Integrity, Parts 1 and 2 and PRACTICAL ISSUES IN DATABASE MANAGEMENT.)

Note very carefully, therefore, that it is impossible to establish the level of normalization of a database by just looking at a set of tables, without knowing what they are supposed to represent, what they mean—the business model. Consequently, requests such as the one at the beginning of this article are meaningless.

Normalization applies only to R-tables, which have the following properties:
1.
      No meaningful ordering of rows and columns
2.
      No duplicate rows
3.
      No nulls
4.
      Single-valued columns (SVC)—that is, R-tables are in first normal form (1NF) or normalized, by definition (see Papers #1 and 2 in the recently launched DATABASE FOUNDATION SERIES at DATABASE DEBUNKINGS).

It should be obvious that the request fails to provide the necessary business model information on the basis of which we can ascertain whether the tables are first and foremost R-tables.

1.      We cannot tell whether columns or rows are meaningfully ordered or not;
2.
      We do not know the unique identifiers and, thus, the keys and whether they were properly selected to prevent duplicates or not;
3.
      We don’t know whether certain attribute values can be unknown and, thus, whether the pertinent columns permit nulls or not;
4.
      We do not know what the attributes mean and how the data types from which the columns representing those attributes are defined, hence we cannot tell whether they are SVCs or multi-valued columns (MVC);

In other words, without knowing what the tables mean—that is, their predicates—we cannot tell whether they are R-tables and, therefore, normalized or not!

Moreover, even if we assume they are normalized (or in 1NF), we still lack the information necessary to establish whether they are fully normalized:

1.      We don’t know what the entity types and their relationships in the business model mean, and therefore, whether the tables “bundle” entity types or not.

2.      We don’t know if any identifiers are multi-attribute and, thus, if some keys are composite. We cannot tell, therefore, whether tables with such keys that are in third normal form (3NF) are automatically in fifth normal form (5NF) or not.

What is more, neither can we tell whether the table designer’s understanding of the business model is accurate and complete, and we should not assume that it is.

So next time when you encounter such requests, particularly via email, and see somebody fulfilling them in the absence of a business model properly understood by both parties, the only conclusion you can draw is that neither party knows what it’s doing.

Fabian Pascal has a national and international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management.  He was affiliated with Codd & Date and for more than 15 years held various analytical and management positions in the private and public sectors, has taught and lectured at the business and academic levels, and advised vendor and user organizations on database technology, strategy and implementation. Clients include IBM, Census Bureau, CIA, Apple, Borland, Cognos, UCSF, IRS. He is founder and editor of DATABASE DEBUNKINGS a web site dedicated to dispelling prevailing fallacies and misconceptions in the database industry, where C.J. Date is a senior contributor. He has contributed extensively to most trade publications, including Database Programming and Design, DBMS, DataBased Advisor, Byte, Infoworld and Computerworld and is author of the contrarian column Against the Grain.  His third book, PRACTICAL ISSUES IN DATABASE MANAGEMENT serves as text for a seminar bearing the same name.

 

© Copyright, 1998-2004 InConcept (Information Conceptual Modeling, Inc.) All Rights Reserved. Privacy Statement. ISSN: 1533-3825