Battle of the Modeling Techniques

By Warren Keuffel
DBMS, August 1996

A Look at the Three Most Popular Modeling Notations for Distilling the Essence of Data.

For a response to this article, see "What About ORM?"


Modeling the world around us is an inherently human activity, whether it is building ships in bottles, crafting dolls and their houses, or drawing a map. The process of creating a model is an attempt to capture the essence of things both concrete and abstract, to make order of the chaos inherent in the world around us. It is no different for those of us who work within the abstract world of computer systems - in order to understand and control a system's size and complexity, we must reduce it to a model that we can fit our brains around. This article tracks the development of three modeling notations that computer professionals have developed for distilling the essence of data: entity relationship, Nijssen's Information Analysis Methodology, and Semantic Object Modeling.

In the Beginning!

Entity relationship (ER) is the grandfather of the other two data modeling methods. Developed by Peter Chen in the 1970s, ER modeling was first created to provide a diagramming notation that could directly map to a relational database schema. In the ensuing years, it has taken on many disparate flavors, each emphasizing an individual developer's priorities for modeling data. Nijssen's Information Analysis Methodology (NIAM) was developed by G.M. Nijssen 10 years later to emphasize the importance of describing data in the rigorous, testable statements that a scientific environment required. Semantic Object Modeling (SOM), developed by David Kroenke during the same period of time as Nijssen's development of NIAM, emphasizes the importance of considering how people conceptualize data in an everyday setting, and applying that knowledge to create models designed to be "intuitive" for people who do not necessarily understand what goes on under the hood of an information system. In the discussion that follows, I attempt to explain the features of each of the modeling methods by using the original words of each method's creator.

Entities. Attributes. Relationships. Their definitions have become increasingly sophisticated as we refine our knowledge of data modeling and pursue the "best" method (a pursuit that, like the pursuit for the "best" operating system, depends on a host of factors - some objective, some subjective). Chen's original monograph on ER modeling (The Entity-Relationship Approach to Logical Database Design, QED Information Sciences, 1977) assumes that we all know what an entity is: "An entity is a 'thing' which can be distinctly identified." Chen continues, "There are many 'things' in the real world. It is the responsibility of the database designer to select the entity types which are most suitable for his/her company." Relationships receive similarly helpful definitions: "Relationships may exist between entities." Chen includes another ever-so-helpful admonition that it is the job of the database designer to select the appropriate relationships for the model. Attributes, similarly, are described without reference to how they are discovered.

Although Chen's ER work is considered seminal to the development of data modeling as an analytic activity, he was tightly focused on developing a graphical notation that would help database designers better understand the problem space. He was not focused on how to discover the best representation of the users' needs. Indeed, the last half of the monograph is devoted to explaining how to translate the conceptual schema that the ER diagram represents into the physical incarnation.

Chen's notation style for representing cardinality - that is, the trivial debate over how symbology for conveying one-to-one, one-to-many, and many-to-many relationships between entities should be represented - has probably spawned more controversy than any fundamental characteristic of the notation. Chen was content to represent cardinality with simple numbers or "n" to indicate many, but many of the people who adopted and improved upon the Chen notation added symbology at either end of the line representing the relationship. The most popular such version is found in Information Engineering, in which crow's feet and crossbars are used to indicate cardinality.

Second Wave

If Peter Chen represents the alpha of the ER movement, Tom Bruce can be considered the prophet of the current ER omega. Bruce's book on IDEF1X information models (Designing Quality Databases with IDEF1X Information Models, Dorset House, 1992) is generally regarded as the preeminent guidebook to what is arguably the most widely used of the current flavors of ER diagramming. IDEF1X notation is supported by several data modeling and CASE tools, including Logic Works' ERwin. Bruce writes that, in IDEF1X ER modeling, entities are "abstractions of real-world things." He goes on to say that an entity is "any distinguishable person, place, thing, event, or concept about which information is kept." Attributes are the properties of the entities, and relationships represent the "connections, links, or associations" between entities. Still, despite the increasing specificity and despite the pages of details describing how to model a wide variety of specific situations, Bruce can offer no more advice on the fundamental problems of discovering and classifying the individual entities, attributes, and relationships than could Chen.

A simple data modeling problem will help demonstrate how an ER diagram is developed. The following common example serves as a baseline to which you can compare other data modeling approaches. Consider the popular department, building, and employee problem. You have three entities: department, building, and employee. Possible attributes of department are DepartmentName and DepartmentID; possible attributes of building are BuildingName, BuildingID, and BuildingAddress; possible attributes of employee are EmployeeName, EmployeeID, EmployeeDepartment, and EmployeeHireDate. The relationships between Department and Employee can be defined in noun/verb format as: "Department consists of zero or more Employees," "Employees belong to one and only one Department," and "Employees work in one and only one Building." These three entities, their attributes, and the relationships between them are illustrated in Figure 1.

Enter NIAM

Disbelief in the assumption that entities, attributes, and relationships would reveal themselves to the persistent analyst led to the development of NIAM, which the inventors call a "fact-based" approach to discovering entities, attributes, and relationships. As described in the "bible" of NIAM written by G.M. Nijssen and Terry Halpin (Conceptual Schema and Relational Database Design: A Fact-Oriented Approach, Prentice Hall, 1989), the nine steps an analyst follows to derive the design of a system are called the CSDP, for Conceptual Schema Design Procedure. (I should note that Halpin also authored Conceptual Schema and Relational Database Design [2nd Edition, Prentice Hall, 1995].) These steps are:

  1. Transform familiar information examples into elementary facts and apply quality checks.
  2. Draw a first draft of the conceptual schema diagram and apply a population check.
  3. Eliminate surplus entity types and common roles and identify any derived fact types.
  4. Add uniqueness constraints for each fact type.
  5. Check that fact types are of the right "arity."
  6. Add entity type, mandatory role, subtype, and occurrence frequency constraints.
  7. Check that each entity can be identified.
  8. Add equality, exclusion, subset, and other constraints.
  9. Check that the conceptual schema is consistent with the original examples, has no redundancy, and is complete.

To apply this method to the Employee, Building, and Department example introduced previously, you can begin by transforming what you know about this information into facts. For example:

Employee 1234 belongs to Department Finance
Employee 1234 works in Building Administration
Employee 1234 has the name Larry

and so on. The domain of the problem space, incidentally, is known in NIAM as the UoD, or Universe of Discourse. Within the UoD are any number of objects playing designated roles. The facts, such as the ones stated previously, are examples of two objects playing roles; the role describes the relationship between the two objects. By analyzing the sentences to determine how many "holes" they contain, we determine the "ary-ness" of the predicates, or verbs, for that sentence. For example, the first Employee example mentioned previously represents a binary predicate because there is a "hole" on each side of the verb into which an entity is inserted. Thus the binary predicate of the first example, "belongs to," is (entity) belongs to (entity). A predicate with three holes (a relationship connecting three entities) would be a ternary predicate, and thus an indeterminate number of predicates would be an n-ary predicate; the arity check in step five involves making sure that each fact is expressed in the right n-ary form. This conceptual framework, incidentally, is also called Object Role Modeling.

NIAM requires that the facts be expressed in a more rigorous manner than the examples given; thus the second example, "Employee 1234 works in Building Administration," must be rephrased as "The EMPLOYEE with number '1234' works in the BUILDING with name 'Administration'." You thus see in this sentence the beginnings of the entity, attribute, and relationship information that you need for building a model: the capitalized words (EMPLOYEE and BUILDING) are entities, the italicized words (number and name) are attributes, the quoted words '1234' and 'Administration') are specific instances or examples of those attributes, and the words "works in" represent the relationship.

However, you must complete several more steps before you can say you're finished. For example, you must perform quality checks to determine if the entities are well-identified or whether the facts can be further decomposed - that is, split into smaller atomic facts. You then draw a first draft of the conceptual schema diagram (which I'll return to shortly) and next perform other quality checks in which you make sure that the entity types are mutually exclusive (that is, you haven't represented the same entity by two different names). Finally, the remaining CSDP steps involve specifying constraints on the facts, such as uniqueness, arity (as described previously), entity type, mandatory role, subtype, and occurrence frequency constraints.

The foregoing discussion does not pretend to provide a detailed tutorial on how to use NIAM. Rather, it is intended to illustrate how NIAM moves beyond the freedom and lack of semantic discipline that characterizes the ER diagramming mode. However, some developers find the rigor of NIAM too confining and, indeed, when in an exploration mode, such rigor may be counterproductive. Nevertheless, when the CSDP rules of NIAM are followed faithfully, the resulting data models possess much more integrity than do models constructed without a semantic analysis component.

Asymetrix Corp. uses a proprietary implementation of NIAM called FORML (Formal Object Role Modeling Language). The package creates diagrams of facts automatically via drag-and-drop, and it also lets the user work directly with the diagram editor and create written facts from the constructs expressed diagrammatically.

Semantic Object Modeling

David Kroenke's SOM technique is characterized by a radically different kind of semantic analysis. (See Database Processing: Fundamentals, Design and Implementation, Prentice Hall, 1995.) Kroenke, who cut his teeth during the early years of the microcomputer revolution developing Rbase at Microrim, later moved over to Wall Data and has further developed his ideas about SOM in Wall Data's Salsa product (he is now chief technologist of Wall Data's Salsa Business Unit). Rather than focusing attention on a rigorous exploration of the problem space, Kroenke bids developers to focus their attention on how the user perceives the data. At its core, SOM resembles ER modeling in its terminology, with similar vaguely articulated definitions of entities, attributes, and relationships. What sets SOM apart from ER modeling is Kroenke's addition of the concept of group attributes. Group attributes can themselves be entities (objects in Kroenke's parlance, although he takes care to differentiate SOM objects from "real" objects).

Returning to my Department, Building, and Employee example, Kroenke would have us realize that although Employee exists as an object, it also exists as a group attribute within the Department object. That relationship is somewhat analogous to the primary-key/foreign-key relationship found in ER modeling. However, the distinction - subtle as it may be - is that from the user's point of view, when looking at the Department object, the employees that belong to that department logically belong within the Department object and not as separate entities connected with a line that you must follow to grasp the conceptual relationship between the two entities. Likewise, employees are group attributes within the Building object.

Figure 2 illustrates the three objects -Department, Building, and Employee. Note that the left-hand window contains a catalog of predefined objects that you can drag onto the work surface; I created the Department and Employee objects that way. I then dragged the title bar of the Employee object into the attributes area of the Department object and dropped it. Thus, you can see the Employee as an attribute of the Department, and also as an independent object. Finally, this figure illustrates how to create an object for which no definition exists. In an empty area of the canvas, an area was outlined, which Salsa then made into a generic object (note the label "object"). The next step was to rename it "Building" and to populate it with attributes.

Analysis

The major problem I find with SOM is that there is no simple way to ascertain which objects would be affected by a change to an object that participates as an attribute in other objects. Thus, without the visual clues that you have on an ER diagram, with the lines between entities, there is no easy and fast way to see that the Employee entity participates in a relationship with two other entities. Also, when viewing the container objects, it is not intuitively clear which attributes are links to other objects.

On a more positive note, SOM brings a form of hierarchical decomposition to data modeling. Data models, particularly ER diagrams, have always been far more unwieldy than process-oriented models such as data flow diagrams (DFDs). DFDs often begin with a very high-level context diagram from which subordinate diagrams are created, each representing a portion of the superordinate diagram, in a process that continues until the DFD's activities can be better described in a written mini-specification. ER diagrams offer no similar way to organize complexity, but SOM does. Consider that in a large system you can represent the whole enterprise with one object labeled "Enterprise" or "Company." Within those mega-objects appear the familiar Department and Building objects. Within the Department and Building objects appear the Employee object. Within the Employee objects appear the Address and Phone Number objects. Granted that each of the objects mentioned appears as an independent object in addition to its role as an attribute within the container object, you still have the ability to "collapse" a whole data model into one all-encompassing object that can be peeled like an onion to reveal the layers of objects below the surface.

Which to Choose?

The choice of which modeling technique to use is often based on preference and prior experience. However, when no preference controls the decision, you can attempt to make a semi-objective choice. Clearly, the Chen-derived ER diagramming method enjoys wide user acceptance, which means that there will be a greater likelihood that other developers will understand your design notation and will not require education to use it. On the other hand, Chen-based models are not "provably correct," so there is a greater chance of ambiguity and error sneaking into your analysis. The lack of rigor cuts both ways, however, because when you are in exploration or thinking mode and using the ER diagram as a doodling surface to capture your thoughts, you are not hindered by the necessity of thinking each relationship through completely at the moment of conception.

The rigor of NIAM's CSDP technique stands in stark contrast to the freedom of the ER diagram notation. For individuals who are more comfortable with word-oriented, left-brain thinking patterns, the fact-based approach may be more comfortable. Weighing in against the use of this modeling notation is the lack of support by tools other than Asymetrix's InfoModeler and the need to educate those who have not been exposed to object role modeling.

Finally, Kroenke's SOM offers the developer the unique ability to hide details in objects of increasing generality. Yet, as with NIAM, there is a lack of third-party tools other than Wall Data's Salsa. However, consider situations in which users must be in control of their own data, but to educate them in more traditional data modeling methods would be overkill. The psychological research underlying SOM could create a comfortable environment for those users whose need for a formal modeling language is balanced by a notation that is easier for the nontechnical end user to comprehend and utilize effectively.

Warren Keuffel is an independent software engineer and technology writer based in Salt Lake City, Utah. You can email him at 76702.525@compuserve.com.