August 1998        Issue: 4

Journal of Conceptual Modeling
www.inconcept.com/jcm

Normalization and ORM
by Scot A. Becker

Introduction

I have been advised not to write this article.

You see, I read a lot of books and anything else about modeling that I can get my hands on. I am on a first name basis with the folks at Amazon, Barnes and Noble, and Borders. Needless to say, I tend to bore some of my colleagues with my interest in modeling.

Due to my fifty-dollar-a-week habit, I see a lot of errors. I don't even want to get into them all here because it would be impolite to pick on respected authors. For that exact reason, many people have advised me not to lead this topic. And, for a while, I listened to those folks.

This all finally bubbled to the surface on a client site when I watched someone try to explain normalization to the customer project lead. Now this guy (the customer) is bright; he didn't get to where he was by being dumb. But still, he couldn't grasp the discussion of the Boyce-Codd normal form, functional dependencies, tuples, entities, attributes, relationships, foreign keys, cardinality, multi-valued attributes, referential integrity, or, God forbid, relational algebra. And why couldn't he grasp that? Because our lingo is arcane and he has bigger and better things to worry about. But still, the person leading this hopeless lecture had to try to explain these concepts in order to validate our existence.

There has to be a better way to do this. In fact there is, but you don't see it talked about very often (except in pioneering publications such as this, of course!). And that (Object-Role Modeling, or ORM) is my main focus for this article. But first, a disclaimer:

Disclaimer

It is not my intention to criticize anyone. The folks who have written the stuff I am about to talk about have been doing this a lot longer than me. But still, there are massive problems with the models and approaches given to us in the traditional publications.

I would rather have just presented the problems without using any names. However, due to copyrights and ethics, I have to name my sources because I am going to use the same models for illustrative purposes. Thus, you will find the sources in the bibliography at the end of this article, but that is the only place I will 'point fingers'. If you happen to be the author of one of these publications, I would be happy to continue this discussion with you and to even publish a rebuttal.

The real intention behind this piece (which, by the way, is intended for data modelers and business experts alike) is to show how a different approach and a different thought pattern can lead to equivalent (and often, better) results. Hopefully, someone who has never seen the concepts I will talk about below will both understand them and realize there is a better way to do this stuff. For the folks who already use ORM (who will probably mutter something about 'preaching to the choir' while reading this), I hope I am giving you some fodder for the next time you have to convince someone of this approach. And, finally, for you die hard ER/Normalization folks, I hope I can convince you. (If I can save just one bad model, then it is all worth it!)

The Traditional Normalization Approach

It astounds me every time I see a book listing traditional normalization as a technique to figuring out your models. By traditional normalization, I mean the approach where you take actual data, lumped in one huge, ugly table, and ferret out repeating groups and functional dependencies. For example [1, pp.25-50]:

We start with some data (taken from a form) of employees, their departments, and their college degrees:

Emp # Name Dept # Dept Name Location Qualification1 Qualification2 Qualification3
01267 Clark 05 Auditing HO BA 1970 MA 1973 PhD 1976
70964 Smith 12 Legal MS BA 1969        
22617 Walsh 05 Auditing HO BA 1972 MA 1977    
50607 Black 05 Auditing HO            

Table One: Sample data

The first step is to remove the multi-valued attributes, or, in other words, remove the repeating groups and insert a cross-reference (foreign key) to the employee, like below:

Emp # Name Dept # Dept Name Location
01267 Clark 05 Auditing HO
70964 Smith 12 Legal MS
22617 Walsh 05 Auditing HO
50607 Black 05 Auditing HO

 

Emp # Qualification Year
01267 BA 1970
01267 MA 1973
01267 PhD 1976
70964 BA 1969
22617 BA 1972
22617 MA 1977

Table(s) Two: Removing repeating groups

Next, we want to remove duplicated data to eliminate redundancy, wasted space, and update anomalies (having to make a change to a piece of data to all instances of that data). We also spot a functional dependency that needs to be removed, as below:

Emp # Name Dept #
01267 Clark 05
70964 Smith 12
22617 Walsh 05
50607 Black 05

 

Dept # Dept Name Location
05 Auditing HO
12 Legal MS

 

Emp # Qualification Year
01267 BA 1970
01267 MA 1973
01267 PhD 1976
70964 BA 1969
22617 BA 1972
22617 MA 1977

Table(s) Three: Eliminating redundancy and separating functional dependencies

Let's do the same 'model' in Object-Role Modeling (ORM). First we need to make some initial assumptions. The first assumption is that the sample data given in table one is a complete and representative set of the Universe of Discourse (UoD). We all know that it isn't, but humor me for a bit. The second assumption is that we have access to a business expert who can answer our questions and speak a language.

The sample data and a little bit of knowledge of the UoD leads to the formation of the following facts:

Employee has Name
Employee was granted Qualification during Year
Department is in Location
Department has Name
Employee works in Department

Now, let's figure out some details. We need some identifiers, so we can also add the facts:

Employee is identified by Emp #
Department is identified by Dept #

We'll also need to know how the data is populated. Using the data provided, we note the following:

Emp # is unique
Dept # is unique
An employee can have a given qualification (degree) only once
An employee must have a Emp #, a name, and belong to a department
A Department must have a Dept #, a name, and a location
Department name is unique

I am going to add one other rule that was not reflected in the original model but is needed to illustrate the comparison: a person cannot have the same degree twice.

The data also leads to a few more conclusions which knowledge of similar Universe of Discourses would rule out. For example, the data suggests that no location can have more than one department, and that only one qualification can be granted per year. It also suggests that Employee Name is also unique. If we were taking a restrictive approach and/or we believed the data set to be representative of this UoD, we would have to include these constraints in the model. Because this is a UoD I am familiar with, and because I have access to a subject matter expert who rebuked the above restrictions, I left them out (see figure 1).

Image33.gif (7170 bytes)

Figure One: An ORM model of the data in Table one

Note that at no time did I mention 'functional dependencies' and/or 'redundant tuples'. I never care about foreign keys or multi-valued attributes. And still, when I derive a schema from the model in figure 1, I get the same results. Further, our users can talk to us about their UoD and lead to the formation of a better model. Your users will understand English (or whatever your preferred language) better than Relational Algebra, I guarantee it.

You'll note that the preceding normalization example took us up to third normal form (actually, it is fully normalized). We never discussed the other steps of normalization. It turns out that ORM natively handles these as well, as we'll see in the next section.

Higher Normal Forms

Most people only talk about third normal form (3NF). Many people say this is because most models that are in 3NF also happen to be in the higher normal forms as well. While that may be true, I personally think that most people only talk about third normal form because they don't understand the formal definitions of the higher normal forms. And for good reason; they are quite obscure.

The first higher normal form is Boyce-Codd Normal Form (BCNF). The shortest definition of BCNF is that (in addition to already being in third normal form) 'Every determinant must be a candidate key.' Let's see an example [1, pp. 177-200]:

Image34.gif (2148 bytes)

Figure Two: An ER diagram of a model in 3NF but not BCNF

Figure two is an ER (Entity-Relationship) diagram of a model in 3NF but not BCNF. There are several problems here. The first problem is the conceptually awkward middle entity representing the relationship between a Branch and a Customer (but I digress). The normalization problem lies in the relationship between a salesperson and a branch; you end up salesperson and their branch appearing in every row such as below:

BRANCH-CUSTOMER RELATIONSHIP (Customer No, Branch No, Visiting Frequency, Date Relationship Established, Salesperson No)

A brief explanation of this notation may be in order. The above notation is used by the original author (of this example) to denote a relational table. The underlined columns are used in the key. And, before you start picking on me, I know there are no attributes in the ER diagram but there are attributes listed in the relational table. I have pretty much left this as the original author did. I think the ER diagram in Figure Two is supposed to represent a 'conceptual' model in ER, which is clumsy to say the least.

The original author proceeds to hack out the branch-salesperson relationship into the following tables, which satisfy BCNF:

CUSTOMER-SALESPERSON RELATIONSHIP (Customer No, Salesperson No, Visiting Frequency, Date Relationship Ended)
SALESPERSON (Salesperson No, Branch No)

Let's do this in ORM, but I'll speed this process up and do a bit less explaining of the ORM mechanics. All we need to do is start with some facts:

Customer establishes a relationship with Salesperson.
This relationship has a beginning date.
This relationship has a total number of visits.
Salesperson works in (exactly one) branch.

These rules yield an ORM model that can be used to create the same relational schema above. This ORM conceptual schema is shown in figure three.

Image35.gif (3593 bytes)

Figure Three: An ORM schema which generates and equivalent BCNF schema. Can you spot the problem?

This model is not entirely correct, however, nor is the second relational schema (in BCNF) that was created by the original author. Take a look at the original ER schema in figure two and its corresponding relational schema. Then, take a look at the original author's revised (BCNF) schema and the ORM schema above in figure three. Did you find the problem? The original ER schema also implied another rule: a customer can only be served by one salesperson at a branch. In creating the second relational schema above, the author eliminated the BCNF problem but lost a business rule. This is undesirable, as you can imagine. We can capture this rule by including one more constraint (an external uniqueness constraint) in our ORM conceptual schema, as below in figure four.

Image36.gif (3677 bytes)

Figure Four: The correct and complete conceptual schema, which yields an equivalent BCNF relational schema.

Not only did we create a BCNF schema without worrying about normalization rules, but we also avoided an error!

The author did realize this business rule loss. He states, "We have traded the enforcement of a rule for the advantages of normalization. It is almost certainly a good trade, because it is likely to be easier to enforce the rule within program logic than to live with the problems of redundant data, update complexity, and unwanted data dependencies." [1, p. 181]

I disagree. Let me address those arguments one at a time. InfoModeler generates this rule with a push of a button, which is certainly as easy as, if not easier than, passing the buck to your programmers. Further, you know this rule will be enforced with every revision of the schema; you do not have this assurance from your future programming staff. The problems of redundant data are gone as well, as the schema is now in BCNF. The same goes for unwanted data dependencies. The only argument left is update complexity. I believe that this is a moot point with modern technology. Update complexity may have been an issue in older RDBMS's, but not anymore. And when it is an issue, you are likely implementing a different structure, such as a data warehouse or data mart, and I doubt you will care about BCNF (or, in many cases, even 3NF) then.

We still have to talk about fourth and fifth normal forms (4NF and 5NF, respectively). These issues only arise in tables that have 3 or more columns (and occurs seldom even then), all of which comprise the key. I can cover the resolution of these normal forms (by using ORM) quickly.

4NF and 5NF basically entail the splitting of all key tables into smaller subsets without any information loss. This is nothing more than the concept of elementary facts as used in ORM. Splitting a key table into smaller tables is analogous to splitting a ternary (or higher fact) into smaller facts. For example, if a ternary was not elementary, it should be split into two binary facts.

There are 2 sufficient conditions for splitting a fact type. First, a fact type should split if it has more than one role excluded from the key (a.k.a. 'n-1 rule', a uniqueness constraint must span at least n-1 roles where n is the number of roles in the fact). Second, a nested object type should split if it has a shorter key. [2, 117] In other words, if you can split the fact into smaller facts without any information loss, then the first fact is not elementary, nor will it be in 4/5NF. This is an essential stage in the ORM data modeling process and occurs long before you ever worry about relational tables. Halpin discusses a formal proof of this technique, the projection-join check, in section 4.6. That's it: 4th and 5th normal forms just by following the fundamental rule of ORM.

Conclusion

We know that ORM is vastly superior to any other modeling technique at the conceptual level. It's rich constraints and intuitive approach lead to better models that even your least technical users can understand and verify. We've also seen that at the logical level, ORM still prevails as it generates equivalent fully normalized schemas without ever wondering what a functional dependency was. We were even able to catch an error produced by manual normalization. The decision to use ORM should be a simple one to reach.

Bibliography

[ 1 ] Data Modeling Essentials: Analysis, Design, and Innovation by Graeme Simsion. 1994 International Thompson Computer Press; ISBN: 1-850-32877-3.

[ 2 ] Conceptual Schema and Relational Database Design, Second Edition by Terry Halpin. 1995 Prentice Hall Australia; ISBN 0-13-355702-2.

Scot A. Becker is a software consultant and the founder of Orthogonal Software Corporation. He is also a certified ORM consultant and trainer, a certified Visio trainer, and former Editor of the Journal of Conceptual Modeling.  

Contact Information:

Scot A. Becker
Orthogonal Software Corporation
scot@orthogonalsoftware.com

www.orthogonalsoftware.com

© Copyright, 1998-2004 InConcept (Information Conceptual Modeling, Inc.) All Rights Reserved. Privacy Statement.
ISSN: 1533-3825