June 1999                   Issue: 9

Journal of Conceptual Modeling
www.inconcept.com/jcm

Data Schema Normalization
By Scot A. Becker

Preface

This paper is being simultaneously published in the June, 1999 issue of the Journal of Conceptual Modeling (JCM) and as a Virtual FoxPro User Group (VFUG) "lecture" (see the Internet Resources section at the end of this paper).

A VFUG lecture is a combination of a pre-published paper followed by an internet-based (chat) Q & A session to summarize the paper, provide additional information, and answer questions. In addition, during the Q & A session I will also discuss some other topics that are a bit outside of the scope of this paper. Such topics may include de-normalization, which normal forms are the most important (in practice), and a bit more about ORM’s role in normalization.

All VFUG members and JCM readers are invited (and encouraged!) to join us for the Q&A session on Thursday, June 10, 1999.

Server information, times, and software instructions are in the footnotes below [1].

Click here for an (edited) transcript of the online lecture.

Abstract

The purpose of this white paper is to outline the various normal forms used in ("manual") data schema normalization. This paper begins with a discussion as to why a database designer is concerned with normalization. Immediately following is a discussion of functional dependencies and an introduction to the eight normal forms. This paper is closed with a discussion as to how normalization fits in with the data modeling technique, Object-Role Modeling (ORM). The reader should be familiar with basic database concepts (such as tables, columns, rows, and keys) before reading this paper.

Introduction

The main reasons that a designer normalizes data structures include:

  1. Semantic grouping of related elements, reduction of redundant values in tuples (one or more columns) which can cause insert, delete and update anomalies.

  2. Reduction of null values in tuples.

  3. Disallow spurious tuples (incorrect join combinations of data values due to improper structures).

Further, normalization tends to aid in the discovery process, allows more precise capture of business logic, and aids in model management and integration (even across the entire Enterprise).

Functional Dependencies

The main root of a normalization technique centers on functional dependencies [2]. This simply means that the values of a component "Y" in a relationship depend on, or are determined by, the value of a component "X". Conversely, one could state that the values of X determine the values of Y (as a side note, this functional dependency is also denoted: X ® Y). Further, a functional dependency is invalid if we have two records with the same "X" value but different "Y" values.

For example, in many data structures, a ‘Social Security Number (SSN)’ determines a ‘Person’, which is commonly (but usually not uniquely) referenced by a ‘NAME’. Thus, we say that NAME is functionally dependent on SSN. Keep in mind that the attributes X and Y may be composite structures (more than one attribute or, in other words, column).

The root of normalization centers on ensuring that non-key attributes of a table are functionally dependent on the key of the same table.

The Normal Forms

A given data structure can be at one of several levels, or stages of completeness, of normalization. These stages are known as normal forms. The eight normal forms are First Normal Form, Second Normal Form, Third Normal Form, Elementary Key Normal Form, Boyce-Codd Normal Form, Fourth Normal Form, Fifth Normal Form, and Project-Join Normal Form.

First Normal Form

First Normal Form (1NF) is now generally considered part of the formal definition of a relation. Historically, 1NF was intended to disallow multi-valued attributes. 1NF dictates that the domains (allowable values) of attributes must include only atomic (simple, indivisible) values and that any given value of an instance of an attribute must be a single value from the domain of that attribute. In short, a given cell of a column in a table can contain only one value.

The following table violates 1NF because the second row contains more than one value in the COLOR column:

CAR

COLOR

123 ABC

Blue

321 CBA

Black
Grey
White

Figure 1: This table violates 1NF.

To ensure a table is in 1NF, one simply needs to decompose grouped attributes into separate rows (or in some case, tables). The following representation of the above data is in 1NF (and others):

CAR

COLOR

123 ABC

Blue

321 CBA

Black

321 CBA

Grey

321 CBA

White

Figure 2: This table is in (at least) 1NF.

Second Normal Form

Second Normal Form (2NF) is based on the concept of "full" functional dependency. A functional dependency, X ® Y, is a full functional dependency if removal of any attribute from X means that the dependency does not hold anymore. For example, Given a table that tracks hours (HOURS) a given employee (SSN) devotes to a given project (PROJNUM), we note that HOURS is functionally dependent on the combination of SSN and PROJNUM because a given employee can work on more than one project. Removal of either SSN or PROJNUM from the functional dependency results in an incorrect relationship. For example, if we were to remove SSN from the previous functional dependency, we are left with HOURS and PROJNUM, but we don’t know which SSN worked those HOURS! Thus, we say SSN, PROJNUM is a full functional dependency of HOURS.

A table is in 2NF if that table is in 1NF and every non-prime (is not involved in a primary key of the table) attribute in that table is fully functionally dependent on the primary key of the table.

For example, the following table is in 1NF but is not in 2NF because PNAME and PLOCATION are dependent on only part of the primary key (PROJNUM and SSN). Likewise, ENAME is also only dependent on SSN. (Primary keys will be denoted in this document by underlining the included columns.)

Employee_Project_Table

SSN

PROJNUM

HOURS

ENAME

PNAME

PLOCATION

Figure 3: This table violates 2NF.

To correct this schema, we need to create additional tables and decompose the partial dependencies into these new tables, as in figure 4.

(Please note that, for simplicity, these diagrams will not denote the resulting referential integrity, such as foreign keys, that would need to be added to these decomposed schemas.)

Employee_Table

SSN

ENAME

Project_Table

PROJNUM

PNAME

PLOCATION

Hours_Table

SSN

PROJNUM

HOURS

Figure 4: The table from Figure 3 has been decomposed to form a 2NF schema.

Third Normal Form

Third Normal Form (3NF) is based on the concept of "transitive dependency". A transitive dependency can be loosely defined as a dependency that does not involve the primary key. For example, in the table below, we see that while all elements have functional dependencies on the key, SSN, there also exist other, transitive, dependencies. Namely, DEPTNAME is dependent on DEPTNUM.

Employee_Department_Table

SSN

ENAME

BDATE

DEPTNUM

DEPTNAME

Figure 5: This table violates 3NF.

A table is in 3NF if it is in 2NF and no non-key attributes are dependent on other non-key attributes.

We can decompose the above table into 3NF by creating a second table for department. Thus, the following structure is in 3NF:

Employee_Table

SSN

ENAME

BDATE

DEPTNUM

Department_Table

DEPTNUM

DEPTNAME

Figure 6: The table from Figure 5 has been decomposed to form a 3NF schema.

There is a subtle difference between 2NF and 3NF. In 2NF, we were concerned about non-key fields being dependent on subsets of the key. In 3NF, we are concerned about non-key fields being dependent on other non-key fields. Another way to say this has been nicely summarized as: any non-key field must be "… Dependent on the key, the whole key, and nothing but the key." [Kent]

Elementary Key Normal Form

Elementary Key Normal Form (EKNF) is a subtle enhancement on 3NF (by definition, EKNF tables are also in 3NF) that most often occurs when there is more than one unique composite key (more than one column) which overlap (one or more columns are involved in both keys) in a table 3. Such cases can cause redundant information in the overlapping column(s). For example, in the following table, let’s assume that a subject title (SUBJECTTITLE) is also a unique identifier for a given subject in the following table:

Enrollment_Table

STUDENTNUM

SUBJECTCODE

SUBJECTTITLE

1

CS100

ER

1

CS114

ORM

2

CS114

ORM

Figure 7: This table, although it is in 3NF, violates EKNF.

The primary key of the above table is the combination of STUDENTNUM and SUBJECTCODE. However, we can also see a (non-primary) uniqueness constraint (alternate key) that should span the STUDENTNUM and SUBJECTTITLE columns as well. The above schema could result in update and deletion anomalies because values of both SUBJECTCODE and SUBJECTTITLE tend to be repeated for a given subject.

The following schema is a decomposition of the above table in order to satisfy EKNF:

Subject_Table

SUBJECTCODE

SUBJECTTITLE

CS100

ER

CS114

ORM

Enrollment_Table

STUDENTNUM

SUBJECTCODE

1

CS100

1

CS114

2

CS114

Figure 8: The table from Figure 7 has been decomposed to form an EKNF schema.

For reasons that will become obvious in the following section, ensuring a table is in EKNF is usually skipped, as most designers will move directly on to Boyce-Codd Normal Form after ensuring that a schema is in 3NF. Thus, EKNF is included here only for reasons of historical accuracy and completeness.

Boyce-Codd Normal Form

Like EKNF, the only time a table is in 3NF but is not in Boyce-Codd Normal form (BCNF) is when the table contains two or more candidate keys that overlap. Beyond that, there is only a subtle difference between EKNF and BCNF, which I will outline below.

Consider the same example we used to illustrate EKNF, but we now add a column (GRADE) to denote a student’s grade received in the course. (Further, for illustrative simplicity, let’s assume that a student can only take a course once.)

Enrollment_Grade_Table

STUDENTNUM

SUBJECTCODE

SUBJECTTITLE

GRADE

1

CS100

ER

C

1

CS114

ORM

A

2

CS114

ORM

A

Figure 9: This table violates BCNF.

We see here that the GRADE column is dependent only on a given enrollment pair and that the keys are now elementary 3 (which satisfies EKNF). However, SUBJECTTITLE is dependent on SUBJECTCODE. Since the key of the table is STUDENTNUM and SUBJECTCODE, we decompose this structure into the following two tables which satisfy BCNF:

Course_Table

SUBJECTCODE

SUBJECTTITLE

CS100

ER

CS114

ORM

Enrollment_Grade_Table

STUDENTNUM

SUBJECTCODE

GRADE

1001

CS100

C

1001

CS114

A

1002

CS114

A

Figure 10: The table from Figure 9 has been decomposed to form a BCNF schema.

One may note that this would also have happened to solve our EKNF problem in the previous section. For that very reason, most designers seldom worry about EKNF and move straight on to BCNF.

Fourth Normal Form

The final normal forms are concerned with multi-valued facts. We can also note that they are concerned with composite keys, as they tend to minimize the number of fields involved in a composite key.

A table is in Fourth Normal form if it is in BCNF and all functional dependencies are "single valued". Another way to state this is to say that a table cannot contain two or more independent "multi-valued" 4 facts [Kent]. By "independent", we mean to say that there is no direct connection between the two (or more) multi-valued facts. This vague definition is better handled by example. In the following table (in BCNF, since it is entirely composed of attributes involved in the key), we record people (NAME), instruments they play (INSTRUMENT), and music styles (MUSICSTYLE) they play.

NAME

INSTRUMENT

MUSICSTYLE

Hallock

Piano

Classical

Hallock

French Horn

Classical

Hallock

Kazoo

Blues

Barden

Trumpet

Jazz

Hallock

Piano

Blues

Figure 11: This table violates 4NF.

We see that redundancy occurs because a given person (NAME) can play more than one INSTRUMENT and play more than one MUSICSTYLE (the fact that ‘Hallock’ plays the ‘Piano’ is repeated, as is the fact that he plays the ‘Blues’ and ‘Classical’). Further, this table seems to suggest a link between instruments and music styles. Can ‘Hallock’ play ‘Blues’ with a French Horn 5? (Yes, and if you know Hallock, you know he plays the blues with anything, including spoons!)

In other words, we see that there are two independent multi-valued facts in the above table. The first is that a person (NAME) can play more than one INSTRUMENT while the second is that a person (NAME) can play more than one MUSICSTYLE. These facts are independent because these two facts have no bearing on each other. Decomposing this table into two tables (below) solves the problem.

Plays_Table

NAME

INSTRUMENT

Hallock

Piano

Hallock

French Horn

Hallock

Kazoo

Barden

Trumpet

Styles_Table

NAME

MUSICSTYLE

Hallock

Classical

Hallock

Blues

Barden

Jazz

Figure 12: The table in Figure 11 has been decomposed into a 4NF schema.

One should note that 4NF only applies to tables with three or more attributes (it eliminates overlapping multi-valued dependencies, which, by definition, require three or more attributes) and only when all attributes compose the primary key of the table.

Fifth Normal Form and Project Join Normal Form

Cases where a table is in 4NF but is not in Fifth Normal Form (5NF) are extremely rare. Further, Project Join Normal Form (PJNF) is a slightly stronger (although this is debated) case of 5NF, and in virtually all cases it can be treated as an equivalent. Therefore, PJNF is included here for completeness.

As in 4NF, 5NF considerations apply only to tables with three or more attributes, all of which comprise the primary key.

The formal definition of 5NF and PJNF requires that we must first define a "projection". A projection of a table is a subset of the total number of columns with no duplicate rows. For example, the following table:

Trains_Table

EMPLOYEE

CLASSTYPE

COMPANY

Hallock

ORM

Visioâ

Hallock

UML

Visio

Hallock

ER

Visio

Hallock

Diagramming

Visio

Becker

ER

Oracleâ

Becker

ORM

Visio

Becker

UML

Visio

Becker

ER

Visio

Becker

Diagramming

Visio

Figure 13: This table, we shall see later, can violate 5NF.

has the following projections:

Trains1

EMPLOYEE

CLASSTYPE

Hallock

ORM

Hallock

UML

Hallock

ER

Hallock

Diagramming

Becker

ER

Becker

ORM

Becker

UML

Becker

Diagramming

Trains2

EMPLOYEE

COMPANY

Hallock

Visio

Becker

Visio

Becker

Oracle

Trains3

CLASSTYPE

COMPANY

ORM

Visio

UML

Visio

ER

Visio

Diagramming

Visio

ER

Oracle

Figure 14: Projections of the table in Figure 13.

The training table in Figure 13 may or may not be in 5NF depending on the business rules. Say we have to enforce the rule:

An EMPLOYEE trains a CLASSTYPE for a COMPANY if and only if an EMPLOYEE trains a CLASSTYPE, the EMPLOYEE trains for a COMPANY, and the COMPANY the EMPLOYEE trains for makes a tool that implements the CLASSTYPE the EMPLOYEE trains.

If we enforce the above rule the table in Figure 13 is not in 5NF and must be reduced to three tables represented by the above projections of the original table.

To achieve 5NF, one checks all-key tables for decompositions whose joins result in the same information. A cautionary note, however, is that such decompositions can lead to a loss of constraint knowledge. For example, in the above case, we need to create database code to handle the specified rule between an EMPLOYEE, the CLASSTYPEs they train, and the COMPANY who makes the tool that implements the CLASSTYPE.

The root concept behind 4NF, 5NF, and PJNF is that the tables not in these normal forms can be derived from simpler, more fundamental relationships. [Simsion] Further, 5NF does not differ from 4NF unless there are other rules (symmetric constraints) that dictate correct data population [Kent]. Lastly, 5NF differs from 4NF in that the fact combinations we are concerned with are no longer independent from each other (due to the semantic constraints) [Kent].

Normalization and Data Modeling

The previous discussion centers on what I call "manual normalization". As you may now realize, this normalization technique can be tricky and difficult to explain (the fact that I used five references, not counting my own, to help me explain this topic is a good indication of this). I prefer to use a data modeling technique that, in addition to many, many other benefits, also happens to completely normalize my data structures with little extra effort on my part.

This technique, called Object-Role Modeling (ORM), centers around objects playing roles. We don’t worry about entities (tables) and attributes (columns) at the level that ORM is concerned with (called the conceptual level). Because of this, dependencies and symmetric constraints are rooted out right away, and the resulting (generated) logical and physical schemas are in "optimal normal form" (assuming the fact types are "elementary" [Becker]). The details behind ORM are beyond the scope of this paper, but I encourage the reader to explore this technique further. In particular, the reader should reference "Normalization and ORM" [Becker] and see the Internet Resources listed in the last section of this document.

Summary

It is important to note that while the normal forms are subsets of each other (for example, a schema in 3NF is in 2NF by definition) it is not necessary to move from one to the other in a sequential fashion. In most cases, initial schemas already meet the criteria for 1NF and 2NF. Further, as we have seen, many structures that are decomposed to meet the requirements of 3NF also happen to meet the requirements of EKNF, BCNF, 4NF, 5NF, and PJNF with no additional decompositions. For this reason, many data modelers strive to ensure that their designs meet BCNF and only worry about ensuring the higher normal forms in special circumstances. Further, the reader will note that authors seldom delve into discussions of EKNF, 4NF, 5NF, and PJNF, as they are often met by the earlier normal forms (up to BCNF).

Further Reading

The reader is encouraged to explore these topics further. More information on normalization and Object-Role Modeling (ORM) can be found in the following References and Internet Resources.

Foot Notes

1) Primary IRC server: 200.37.94.194 (the most reliable way) or www.continental.edu.pe Port: 6667

In the event that we have problems with the primary server, please connect to us.sorcery.net, Port 6667.

Chat Room: #VFUG

Time: 19:00 UTC (GMT) (some time zone translations are below)

Europe: Most of Europe including España, France, Germany, etc. 7 PM (19:00)
Europe: Great Britain, Portugal, Canary Islands at 6 PM (18:00)
Americas: (US & Canada): 3 PM EDT (15:00), 2 PM CDT (14:00), 1 PM MDT (13:00), and 12 PM PDT (12:00 Noon)
Philippines- Manila: Friday 2 AM (02:00)

IRC Software:

There are many different companies producing IRC software. One you may already have installed is Microsoftâ Chatä (often comes packaged with Microsoft Internet Explorerä ). If you don’t already have it, you can download it from the Microsoft website at http://www.microsoft.com/msdownload/iebuild/chat25_win32/en/chat25_win32.htm.

Once installed and started, select Room | Connect. Then, enter ‘200.37.94.194’ (no quotes) in the "Server" box and ‘#VFUG’ (no quotes) in the "Go to chat room" box. On the "Personal Info" tab, fill out as much information as you desire (nickname is mandatory).

Lastly, click "OK" to connect to the server.

2) A functional dependency is formally defined as follows: A functional dependency, denoted by X ® Y, between two sets of attributes X and Y that are subsets of R specifies a constraint on the possible tuples that can form a relation instance r of R. The constraint states that, for any two tuples t1 and t2 in r such that t1[X] = t2[X], that we must also have t1[Y] = t2[Y] [Elmasri, et al.].

3) A more formal definition of EKNF requires some additional definitions. First, a functional dependency, X ® Y, is said to be "trivial" if and only if X contains (is the same as) Y. The same functional dependency is said to be "full" (as opposed to "partial") if and only if there is a functional dependency from part of X to part of Y (where X and Y are composite attributes, or, in other words, have more than one column). Now, we can define an "elementary functional dependency" as a full, non-trivial functional dependency. Further, a key is an "elementary key" if there exists some elementary functional dependency from it to some other attribute in the table. We can now say a table is in EKNF if and only if all of its elementary functional dependencies begin at whole keys or end at elementary key attributes.

4) A functional dependency is said to be "multi-valued" in the following case: Given a set of (usually, all-key) attributes X, Y, and Z, X is involved in a multi-valued dependency if and only if the set of Y values is only dependent on X. Similarly, Z will also be involved in a multi-valued dependency (is "multi-dependent") with X [Halpin].

5) A bit more explanation may be required here. Many texts, when detailing 4NF, will use an example set that contains obviously unrelated facts. For example, one text uses Person, the Languages they speak, and Sports they play [Halpin]. This often leads to the rebuttal, "Well, I’d never have lumped languages and sports in the same table." True enough. Thus, I have attempted to form a table with a more likely mistake in it. Therefore, this example may initially be a bit confusing, as one tends to associate instruments with a music style. For example, have you ever heard a Steel Guitar played in any form of music other than Country/Western? Perhaps not, but the fact remains that a good steel Guitar player could attempt belt out Beethoven’s Ninth Symphony, if they wanted to (shudder at the thought!).

References

In addition to the above references, the author would like to thank (in no particular order) Dr. Terry Halpin, A.J. Durham, Mike Bisek, Pat Hallock, Dr. Gordon Everest, and Nicole Birdsall for providing valuable feedback about this article while it was still in draft format. Any remaining errors are, of course, my own.

Internet Resources

"Visio" is either a registered trademark or a trademark of Visio Corp. in the United States and/or other countries. "Oracle" is either a registered trademark or a trademark of Oracle Corp. in the United States and/or other countries. "Microsoft", "Microsoft Chat", and "Internet Explorer" are either registered trademarks or trademarks of Microsoft Corp. in the United States and/or other countries.

Scot A. Becker is a software consultant and the founder of Orthogonal Software Corporation. He is also a certified ORM consultant and trainer, a certified Visio trainer, and former Editor of the Journal of Conceptual Modeling.  

Contact Information:

Scot A. Becker
Orthogonal Software Corporation
scot@orthogonalsoftware.com

www.orthogonalsoftware.com

© Copyright, 1998-2004 InConcept (Information Conceptual Modeling, Inc.) All Rights Reserved. Privacy Statement.
ISSN: 1533-3825