June
1999
Issue: 9
Journal of Conceptual Modeling
www.inconcept.com/jcm
Data
Schema Normalization
By Scot A. Becker
Preface
This paper is being simultaneously published in the June, 1999 issue of the Journal of Conceptual Modeling (JCM) and as a Virtual FoxPro User Group (VFUG) "lecture" (see the Internet Resources section at the end of this paper).
A VFUG lecture is a combination of a pre-published paper followed by an internet-based (chat) Q & A session to summarize the paper, provide additional information, and answer questions. In addition, during the Q & A session I will also discuss some other topics that are a bit outside of the scope of this paper. Such topics may include de-normalization, which normal forms are the most important (in practice), and a bit more about ORMs role in normalization.
All VFUG members and JCM readers are invited (and encouraged!) to join us for the Q&A session on Thursday, June 10, 1999.
Server information, times, and software instructions are in the footnotes below [1].
Click here for an (edited) transcript of the online lecture.
Abstract
The purpose of this white paper is to outline the various normal forms used in ("manual") data schema normalization. This paper begins with a discussion as to why a database designer is concerned with normalization. Immediately following is a discussion of functional dependencies and an introduction to the eight normal forms. This paper is closed with a discussion as to how normalization fits in with the data modeling technique, Object-Role Modeling (ORM). The reader should be familiar with basic database concepts (such as tables, columns, rows, and keys) before reading this paper.
Introduction
The main reasons that a designer normalizes data structures include:
Semantic grouping of related elements, reduction of redundant values in tuples (one or more columns) which can cause insert, delete and update anomalies.
Reduction of null values in tuples.
Disallow spurious tuples (incorrect join combinations of data values due to improper structures).
Further, normalization tends to aid in the discovery process, allows more precise capture of business logic, and aids in model management and integration (even across the entire Enterprise).
Functional Dependencies
The main root of a normalization technique centers on functional dependencies [2]. This simply means that the values of a component "Y" in a relationship depend on, or are determined by, the value of a component "X". Conversely, one could state that the values of X determine the values of Y (as a side note, this functional dependency is also denoted: X ® Y). Further, a functional dependency is invalid if we have two records with the same "X" value but different "Y" values.
For example, in many data structures, a Social Security Number (SSN) determines a Person, which is commonly (but usually not uniquely) referenced by a NAME. Thus, we say that NAME is functionally dependent on SSN. Keep in mind that the attributes X and Y may be composite structures (more than one attribute or, in other words, column).
The root of normalization centers on ensuring that non-key attributes of a table are functionally dependent on the key of the same table.
The Normal Forms
A given data structure can be at one of several levels, or stages of completeness, of normalization. These stages are known as normal forms. The eight normal forms are First Normal Form, Second Normal Form, Third Normal Form, Elementary Key Normal Form, Boyce-Codd Normal Form, Fourth Normal Form, Fifth Normal Form, and Project-Join Normal Form.
First Normal Form
First Normal Form (1NF) is now generally considered part of the formal definition of a relation. Historically, 1NF was intended to disallow multi-valued attributes. 1NF dictates that the domains (allowable values) of attributes must include only atomic (simple, indivisible) values and that any given value of an instance of an attribute must be a single value from the domain of that attribute. In short, a given cell of a column in a table can contain only one value.
The following table violates 1NF because the second row contains more than one value in the COLOR column:
|
CAR |
COLOR |
|
123 ABC |
Blue |
|
321 CBA |
Black |
Figure 1: This table violates 1NF.
To ensure a table is in 1NF, one simply needs to decompose grouped attributes into separate rows (or in some case, tables). The following representation of the above data is in 1NF (and others):
|
CAR |
COLOR |
|
123 ABC |
Blue |
|
321 CBA |
Black |
|
321 CBA |
Grey |
|
321 CBA |
White |
Figure 2: This table is in (at least) 1NF.
Second Normal Form
Second Normal Form (2NF) is based on the concept of "full" functional dependency. A functional dependency, X ® Y, is a full functional dependency if removal of any attribute from X means that the dependency does not hold anymore. For example, Given a table that tracks hours (HOURS) a given employee (SSN) devotes to a given project (PROJNUM), we note that HOURS is functionally dependent on the combination of SSN and PROJNUM because a given employee can work on more than one project. Removal of either SSN or PROJNUM from the functional dependency results in an incorrect relationship. For example, if we were to remove SSN from the previous functional dependency, we are left with HOURS and PROJNUM, but we dont know which SSN worked those HOURS! Thus, we say SSN, PROJNUM is a full functional dependency of HOURS.
A table is in 2NF if that table is in 1NF and every non-prime (is not involved in a primary key of the table) attribute in that table is fully functionally dependent on the primary key of the table.
For example, the following table is in 1NF but is not in 2NF because PNAME and PLOCATION are dependent on only part of the primary key (PROJNUM and SSN). Likewise, ENAME is also only dependent on SSN. (Primary keys will be denoted in this document by underlining the included columns.)
Employee_Project_Table
|
SSN |
PROJNUM |
HOURS |
ENAME |
PNAME |
PLOCATION |
Figure 3: This table violates 2NF.
To correct this schema, we need to create additional tables and decompose the partial dependencies into these new tables, as in figure 4.
(Please note that, for simplicity, these diagrams will not denote the resulting referential integrity, such as foreign keys, that would need to be added to these decomposed schemas.)
Employee_Table
|
SSN |
ENAME |
Project_Table
|
PROJNUM |
PNAME |
PLOCATION |
Hours_Table
|
SSN |
PROJNUM |
HOURS |
Figure 4: The table from Figure 3 has been decomposed to form a 2NF schema.
Third Normal Form
Third Normal Form (3NF) is based on the concept of "transitive dependency". A transitive dependency can be loosely defined as a dependency that does not involve the primary key. For example, in the table below, we see that while all elements have functional dependencies on the key, SSN, there also exist other, transitive, dependencies. Namely, DEPTNAME is dependent on DEPTNUM.
Employee_Department_Table
|
SSN |
ENAME |
BDATE |
DEPTNUM |
DEPTNAME |
Figure 5: This table violates 3NF.
A table is in 3NF if it is in 2NF and no non-key attributes are dependent on other non-key attributes.
We can decompose the above table into 3NF by creating a second table for department. Thus, the following structure is in 3NF:
Employee_Table
|
SSN |
ENAME |
BDATE |
DEPTNUM |
Department_Table
|
DEPTNUM |
DEPTNAME |
Figure 6: The table from Figure 5 has been decomposed to form a 3NF schema.
There is a subtle difference between 2NF and 3NF. In 2NF, we were concerned about non-key fields being dependent on subsets of the key. In 3NF, we are concerned about non-key fields being dependent on other non-key fields. Another way to say this has been nicely summarized as: any non-key field must be " Dependent on the key, the whole key, and nothing but the key." [Kent]
Elementary Key Normal Form
Elementary Key Normal Form (EKNF) is a subtle enhancement on 3NF (by definition, EKNF tables are also in 3NF) that most often occurs when there is more than one unique composite key (more than one column) which overlap (one or more columns are involved in both keys) in a table 3. Such cases can cause redundant information in the overlapping column(s). For example, in the following table, lets assume that a subject title (SUBJECTTITLE) is also a unique identifier for a given subject in the following table:
Enrollment_Table
|
STUDENTNUM |
SUBJECTCODE |
SUBJECTTITLE |
|
1 |
CS100 |
ER |
|
1 |
CS114 |
ORM |
|
2 |
CS114 |
ORM |
Figure 7: This table, although it is in 3NF, violates EKNF.
The primary key of the above table is the combination of STUDENTNUM and SUBJECTCODE. However, we can also see a (non-primary) uniqueness constraint (alternate key) that should span the STUDENTNUM and SUBJECTTITLE columns as well. The above schema could result in update and deletion anomalies because values of both SUBJECTCODE and SUBJECTTITLE tend to be repeated for a given subject.
The following schema is a decomposition of the above table in order to satisfy EKNF:
Subject_Table
|
SUBJECTCODE |
SUBJECTTITLE |
|
CS100 |
ER |
|
CS114 |
ORM |
Enrollment_Table
|
STUDENTNUM |
SUBJECTCODE |
|
1 |
CS100 |
|
1 |
CS114 |
|
2 |
CS114 |
Figure 8: The table from Figure 7 has been decomposed to form an EKNF schema.
For reasons that will become obvious in the following section, ensuring a table is in EKNF is usually skipped, as most designers will move directly on to Boyce-Codd Normal Form after ensuring that a schema is in 3NF. Thus, EKNF is included here only for reasons of historical accuracy and completeness.
Boyce-Codd Normal Form
Like EKNF, the only time a table is in 3NF but is not in Boyce-Codd Normal form (BCNF) is when the table contains two or more candidate keys that overlap. Beyond that, there is only a subtle difference between EKNF and BCNF, which I will outline below.
Consider the same example we used to illustrate EKNF, but we now add a column (GRADE) to denote a students grade received in the course. (Further, for illustrative simplicity, lets assume that a student can only take a course once.)
Enrollment_Grade_Table
|
STUDENTNUM |
SUBJECTCODE |
SUBJECTTITLE |
GRADE |
|
1 |
CS100 |
ER |
C |
|
1 |
CS114 |
ORM |
A |
|
2 |
CS114 |
ORM |
A |
Figure 9: This table violates BCNF.
We see here that the GRADE column is dependent only on a given enrollment pair and that the keys are now elementary 3 (which satisfies EKNF). However, SUBJECTTITLE is dependent on SUBJECTCODE. Since the key of the table is STUDENTNUM and SUBJECTCODE, we decompose this structure into the following two tables which satisfy BCNF:
Course_Table
|
SUBJECTCODE |
SUBJECTTITLE |
|
CS100 |
ER |
|
CS114 |
ORM |
Enrollment_Grade_Table
|
STUDENTNUM |
SUBJECTCODE |
GRADE |
|
1001 |
CS100 |
C |
|
1001 |
CS114 |
A |
|
1002 |
CS114 |
A |
Figure 10: The table from Figure 9 has been decomposed to form a BCNF schema.
One may note that this would also have happened to solve our EKNF problem in the previous section. For that very reason, most designers seldom worry about EKNF and move straight on to BCNF.
Fourth Normal Form
The final normal forms are concerned with multi-valued facts. We can also note that they are concerned with composite keys, as they tend to minimize the number of fields involved in a composite key.
A table is in Fourth Normal form if it is in BCNF and all functional dependencies are "single valued". Another way to state this is to say that a table cannot contain two or more independent "multi-valued" 4 facts [Kent]. By "independent", we mean to say that there is no direct connection between the two (or more) multi-valued facts. This vague definition is better handled by example. In the following table (in BCNF, since it is entirely composed of attributes involved in the key), we record people (NAME), instruments they play (INSTRUMENT), and music styles (MUSICSTYLE) they play.
|
NAME |
INSTRUMENT |
MUSICSTYLE |
|
Hallock |
Piano |
Classical |
|
Hallock |
French Horn |
Classical |
|
Hallock |
Kazoo |
Blues |
|
Barden |
Trumpet |
Jazz |
|
Hallock |
Piano |
Blues |
Figure 11: This table violates 4NF.
We see that redundancy occurs because a given person (NAME) can play more than one INSTRUMENT and play more than one MUSICSTYLE (the fact that Hallock plays the Piano is repeated, as is the fact that he plays the Blues and Classical). Further, this table seems to suggest a link between instruments and music styles. Can Hallock play Blues with a French Horn 5? (Yes, and if you know Hallock, you know he plays the blues with anything, including spoons!)
In other words, we see that there are two independent multi-valued facts in the above table. The first is that a person (NAME) can play more than one INSTRUMENT while the second is that a person (NAME) can play more than one MUSICSTYLE. These facts are independent because these two facts have no bearing on each other. Decomposing this table into two tables (below) solves the problem.
Plays_Table
|
NAME |
INSTRUMENT |
|
Hallock |
Piano |
|
Hallock |
French Horn |
|
Hallock |
Kazoo |
|
Barden |
Trumpet |
Styles_Table
|
NAME |
MUSICSTYLE |
|
Hallock |
Classical |
|
Hallock |
Blues |
|
Barden |
Jazz |
Figure 12: The table in Figure 11 has been decomposed into a 4NF schema.
One should note that 4NF only applies to tables with three or more attributes (it eliminates overlapping multi-valued dependencies, which, by definition, require three or more attributes) and only when all attributes compose the primary key of the table.
Fifth Normal Form and Project Join Normal Form
Cases where a table is in 4NF but is not in Fifth Normal Form (5NF) are extremely rare. Further, Project Join Normal Form (PJNF) is a slightly stronger (although this is debated) case of 5NF, and in virtually all cases it can be treated as an equivalent. Therefore, PJNF is included here for completeness.
As in 4NF, 5NF considerations apply only to tables with three or more attributes, all of which comprise the primary key.
The formal definition of 5NF and PJNF requires that we must first define a "projection". A projection of a table is a subset of the total number of columns with no duplicate rows. For example, the following table:
Trains_Table
|
EMPLOYEE |
CLASSTYPE |
COMPANY |
|
Hallock |
ORM |
Visioâ |
|
Hallock |
UML |
Visio |
|
Hallock |
ER |
Visio |
|
Hallock |
Diagramming |
Visio |
|
Becker |
ER |
Oracleâ |
|
Becker |
ORM |
Visio |
|
Becker |
UML |
Visio |
|
Becker |
ER |
Visio |
|
Becker |
Diagramming |
Visio |
Figure 13: This table, we shall see later, can violate 5NF.
has the following projections:
Trains1
|
EMPLOYEE |
CLASSTYPE |
|
Hallock |
ORM |
|
Hallock |
UML |
|
Hallock |
ER |
|
Hallock |
Diagramming |
|
Becker |
ER |
|
Becker |
ORM |
|
Becker |
UML |
|
Becker |
Diagramming |
Trains2
|
EMPLOYEE |
COMPANY |
|
Hallock |
Visio |
|
Becker |
Visio |
|
Becker |
Oracle |
Trains3
|
CLASSTYPE |
COMPANY |
|
ORM |
Visio |
|
UML |
Visio |
|
ER |
Visio |
|
Diagramming |
Visio |
|
ER |
Oracle |
Figure 14: Projections of the table in Figure 13.
The training table in Figure 13 may or may not be in 5NF depending on the business rules. Say we have to enforce the rule:
An EMPLOYEE trains a CLASSTYPE for a COMPANY if and only if an EMPLOYEE trains a CLASSTYPE, the EMPLOYEE trains for a COMPANY, and the COMPANY the EMPLOYEE trains for makes a tool that implements the CLASSTYPE the EMPLOYEE trains.
If we enforce the above rule the table in Figure 13 is not in 5NF and must be reduced to three tables represented by the above projections of the original table.
To achieve 5NF, one checks all-key tables for decompositions whose joins result in the same information. A cautionary note, however, is that such decompositions can lead to a loss of constraint knowledge. For example, in the above case, we need to create database code to handle the specified rule between an EMPLOYEE, the CLASSTYPEs they train, and the COMPANY who makes the tool that implements the CLASSTYPE.
The root concept behind 4NF, 5NF, and PJNF is that the tables not in these normal forms can be derived from simpler, more fundamental relationships. [Simsion] Further, 5NF does not differ from 4NF unless there are other rules (symmetric constraints) that dictate correct data population [Kent]. Lastly, 5NF differs from 4NF in that the fact combinations we are concerned with are no longer independent from each other (due to the semantic constraints) [Kent].
Normalization and Data Modeling
The previous discussion centers on what I call "manual normalization". As you may now realize, this normalization technique can be tricky and difficult to explain (the fact that I used five references, not counting my own, to help me explain this topic is a good indication of this). I prefer to use a data modeling technique that, in addition to many, many other benefits, also happens to completely normalize my data structures with little extra effort on my part.
This technique, called Object-Role Modeling (ORM), centers around objects playing roles. We dont worry about entities (tables) and attributes (columns) at the level that ORM is concerned with (called the conceptual level). Because of this, dependencies and symmetric constraints are rooted out right away, and the resulting (generated) logical and physical schemas are in "optimal normal form" (assuming the fact types are "elementary" [Becker]). The details behind ORM are beyond the scope of this paper, but I encourage the reader to explore this technique further. In particular, the reader should reference "Normalization and ORM" [Becker] and see the Internet Resources listed in the last section of this document.
Summary
It is important to note that while the normal forms are subsets of each other (for example, a schema in 3NF is in 2NF by definition) it is not necessary to move from one to the other in a sequential fashion. In most cases, initial schemas already meet the criteria for 1NF and 2NF. Further, as we have seen, many structures that are decomposed to meet the requirements of 3NF also happen to meet the requirements of EKNF, BCNF, 4NF, 5NF, and PJNF with no additional decompositions. For this reason, many data modelers strive to ensure that their designs meet BCNF and only worry about ensuring the higher normal forms in special circumstances. Further, the reader will note that authors seldom delve into discussions of EKNF, 4NF, 5NF, and PJNF, as they are often met by the earlier normal forms (up to BCNF).
Further Reading
The reader is encouraged to explore these topics further. More information on normalization and Object-Role Modeling (ORM) can be found in the following References and Internet Resources.
Foot Notes
1) Primary IRC server: 200.37.94.194 (the most reliable way) or www.continental.edu.pe Port: 6667
In the event that we have problems with the primary server, please connect to us.sorcery.net, Port 6667.
Chat Room: #VFUG
Time: 19:00 UTC (GMT) (some time zone translations are below)
Europe: Most of Europe including España, France, Germany, etc. 7 PM
(19:00)
Europe: Great Britain, Portugal, Canary Islands at 6 PM (18:00)
Americas: (US & Canada): 3 PM EDT (15:00), 2 PM CDT (14:00), 1 PM MDT (13:00), and 12
PM PDT (12:00 Noon)
Philippines- Manila: Friday 2 AM (02:00)
IRC Software:
There are many different companies producing IRC software. One you may already have installed is Microsoftâ Chatä (often comes packaged with Microsoft Internet Explorerä ). If you dont already have it, you can download it from the Microsoft website at http://www.microsoft.com/msdownload/iebuild/chat25_win32/en/chat25_win32.htm.
Once installed and started, select Room | Connect. Then, enter 200.37.94.194 (no quotes) in the "Server" box and #VFUG (no quotes) in the "Go to chat room" box. On the "Personal Info" tab, fill out as much information as you desire (nickname is mandatory).
Lastly, click "OK" to connect to the server.
2) A functional dependency is formally defined as follows: A functional dependency, denoted by X ® Y, between two sets of attributes X and Y that are subsets of R specifies a constraint on the possible tuples that can form a relation instance r of R. The constraint states that, for any two tuples t1 and t2 in r such that t1[X] = t2[X], that we must also have t1[Y] = t2[Y] [Elmasri, et al.].
3) A more formal definition of EKNF requires some additional definitions. First, a functional dependency, X ® Y, is said to be "trivial" if and only if X contains (is the same as) Y. The same functional dependency is said to be "full" (as opposed to "partial") if and only if there is a functional dependency from part of X to part of Y (where X and Y are composite attributes, or, in other words, have more than one column). Now, we can define an "elementary functional dependency" as a full, non-trivial functional dependency. Further, a key is an "elementary key" if there exists some elementary functional dependency from it to some other attribute in the table. We can now say a table is in EKNF if and only if all of its elementary functional dependencies begin at whole keys or end at elementary key attributes.
4) A functional dependency is said to be "multi-valued" in the following case: Given a set of (usually, all-key) attributes X, Y, and Z, X is involved in a multi-valued dependency if and only if the set of Y values is only dependent on X. Similarly, Z will also be involved in a multi-valued dependency (is "multi-dependent") with X [Halpin].
5) A bit more explanation may be required here. Many texts, when detailing 4NF, will use an example set that contains obviously unrelated facts. For example, one text uses Person, the Languages they speak, and Sports they play [Halpin]. This often leads to the rebuttal, "Well, Id never have lumped languages and sports in the same table." True enough. Thus, I have attempted to form a table with a more likely mistake in it. Therefore, this example may initially be a bit confusing, as one tends to associate instruments with a music style. For example, have you ever heard a Steel Guitar played in any form of music other than Country/Western? Perhaps not, but the fact remains that a good steel Guitar player could attempt belt out Beethovens Ninth Symphony, if they wanted to (shudder at the thought!).
Becker, Scot A., Normalization and ORM, The Journal of Conceptual Modeling (Issue 4, August 1998, http://www.inconcept.com/JCM/August1998/becker.html)
Elmasri and Navathe, Fundamentals of Database Systems, Second Edition (Chapter 12)
Halpin, Terry, Conceptual Schema and Relational Database Design, Second Edition (Section 10.2)
Kent, William, A Simple Guide to Five Normal Forms in Relational Database Theory, Communications of the ACM, 26(2), February, 1983, 120-125
Reingruber and Gregory, The Data Modeling Handbook: A Best-Practice Approach to Building Quality Data Models (Chapter 8)
Simsion, Graeme, Data Modeling Essentials: Analysis, Design, and Innovation (Chapters 2 and 7)
In addition to the above references, the author would like to thank (in no particular order) Dr. Terry Halpin, A.J. Durham, Mike Bisek, Pat Hallock, Dr. Gordon Everest, and Nicole Birdsall for providing valuable feedback about this article while it was still in draft format. Any remaining errors are, of course, my own.
The Journal of Conceptual Modeling
InConcept, http://www.inconcept.com
Object Role Modeling: The Official Site for Conceptual Data Modeling
The Virtual FoxPro User Group
"Visio" is either a registered trademark or a trademark of Visio Corp. in the United States and/or other countries. "Oracle" is either a registered trademark or a trademark of Oracle Corp. in the United States and/or other countries. "Microsoft", "Microsoft Chat", and "Internet Explorer" are either registered trademarks or trademarks of Microsoft Corp. in the United States and/or other countries.
![]()
Scot A. Becker is a software consultant and the founder of Orthogonal Software Corporation. He is also a certified ORM consultant and trainer, a certified Visio trainer, and former Editor of the Journal of Conceptual Modeling.
Contact Information:
Scot A. Becker
Orthogonal Software Corporation
scot@orthogonalsoftware.com
www.orthogonalsoftware.com
![]()
© Copyright, 1998-2004 InConcept
(Information Conceptual Modeling, Inc.) All
Rights Reserved. Privacy Statement.
ISSN: 1533-3825