CPSC 333 --- Lecture 7 --- Monday, January 22, 1996

                           Data Dictionaries

A *data dictionary* is an organized listing of all the data elements
that are pertinent to a system, with precise, rigorous
definitions. These help to ensure that both users (or "customers" or
"clients") and developers of systems have a common understanding of
all the inputs, outputs, components of stores, and intermediate
calculations used by a system.

"Warning:" Once again, this is a tool that "almost everybody" uses,
but that "almost nobody" defines in exactly the same way as anyone
else. Pressman introduces "data dictionaries" in both the "Beginner's
Guide" and "Practitioner's Approach." However, he introduces all the
data dictionary notation he might need "all at once;" I will be
introducing notation "as I need it" --- introducing the syntax needed
to define components of an entity relationship diagram now, and then
introducing additional syntax later, as I need it to define components
of data flow diagrams, etc.

In all cases a "data dictionary" is a sequence of definitions for data
elements that are part of the requirements, design, and test
specifications for a system as well as part of the implemented
"system" itself.

In order to make the "dictionary" easier to use --- and, also to
ensure that nothing is ever defined more than once --- these
definitions should be sorted, in alphabetical order (for the terms
that are being defined).

Data Dictionary Symbols --- that are needed to define components
                            of an entity relationship diagram:

 =   - "is composed of" --- this separates the term being defined
       (which is to the left of "=") from its definition (which
       is on the right)

 +   - "aggregation" used in definitions of (instances of) entities
       to define the entity as the "aggregation" of its attributes

 []  - used to enclose options if several alternatives are available.
       For this course (at least, for now) we'll *only* use these
       to specify "enumerated types" if these are needed to define
       types of attributes. (Other uses will be seen after data
       flow diagrams have been introduced.)

 |   - separator for the alternatives that are enclosed by "[]"

 @   - should appear before the name of every attribute that is
       part of the *primary key* in a definition for an entity,
       weak entity, associative object, or relationship

 * * - these enclose "comments"

There is no standard way to identify "constants," and there is
no "standard" set of basic "data types" that can be assumed.

For this course, we will enclose constants within double
quotes - " "

*For the moment* the only "basic types" we'll need are:

   - integer
   - string (a character string)
   - real (a real number)

In fact, we might find that only the first two are used (to start).
More of these may be introduced as needed.

The data dictionary for an entity relationship diagram must include a
definition for every attribute of anything in the diagram, as well as
every entity, weak entity, relationship, and associative object in the
diagram.


Definitions of Attributes: There are only two cases to be considered.

 a) The attribute has one of the "basic" data types.
 b) The attribute has an "enumerated" data types.

If the normalization rules have been obeyed, no other cases should
arise.

Case a) The attribute has one of the "basic" data types.

 In this case, the attribute's definition should have the general
 form

   <name of attribute> = <name of basic data type>

 and should be supplemented with one or comments.

 Comments for attributes should

  - say which entity (or entities) or associative object(s) this
    is an attribute of

  - (Possibly) say whether the attribute is part of the primary
    key

  - Provide upper and lower bounds on possible values of the
    attribute, if the attribute's type is "integer" or "real"
    --- or bounds on the length of the string if the basic
    type is "string," if these bounds exist (and are known)

  - Provide a bound on the "precision" of values that must be
    represented if the attribute's type is "real." For example,
    this might be a specification that two decimal digits of
    accuracy are required (if, for example, the attribute
    represents a cost in "dollars and cents")

  - Provide additional constraints on possible values
       
     ... in the last three cases, this information is needed
     if the system will be required to ensure that these constraints
     are obeyed and/or if the information might be helpful in
     designing and implementing the "data subsystem" ...

  - State a "unit of measure" (eg., feet, inches, metres,...)
    if the attribute has type "integer" or "real" is a measurement
    of something

  - Provide a *brief* statement of what the attribute represents
    *IF NECESSARY* --- Ideally this should be "clear" from the
    attribute's name, so that this kind of comment wouldn't be
    necessary

 Examples: The attributes "ID number" and "name" of "Student" in
 the student information system "might" have the following
 definitions.

  ID number = integer

   * key attribute of the entity Student *
   * should be exactly seven digits long *

  name = string

   * non-key attribute of the entity Student                 *
   * includes the student's full name --- first name,        *
   *  additional names or initials (if known), and last name *

 If, at some point, bounds on the lengths of names to be represented
 are discovered, then this could be specified as one of the comments
 in the definition for "name" as well.

Case b) The attribute has an "enumerated type"

Use the symbols for specifying options - [, |, ] - and literals
to define the enumerated type on the right hand side of the
definition of the attribute.

Comments should say which entity has this attribute, (possibly)
whether this is a key attribute, and should give a brief description
of what the attribute represents if necessary (again, this should
ideally be clear from the attribute's name).

Example: The entity "Student" in the ORIGINAL version of the
information system had an attribute "status" which might have the
following definition.

  status = [ "registered" | "passed" ]

   * non-key attribute of the entity Student *

In later versions of the information system, this is how the attribute
"grade" of the associative object "Course Credit" would be defined.


Definitions of Entities: Define an entity as an "aggregate" of its
attributes. In the definition, put an "@" before the name of each
attribute that is part of the entity's primary key.

The order in which attributes are listed isn't important, but (I
think) the definition is more readable if all the "key" attributes are
listed before any of the "non-key" attributes.

Comments should be used to identify the data item being defined as an
"Entity" and to give more information about what this data item
represents (if this is necessary).

Example: A definition for the version of "Student" that appears in the
very *first* version of the "student information system" would be:

 Student = @ID number + name + status

  * Entity; represents students who are currently registered in *
  * or who have passed the course that this system remembers    *
  * information about                                           *


Definitions of Relationships: These will resemble definitions for
entities --- relationships are defined as "aggregates" of attributes.

The attributes in the primary keys of all the entities connected by the
relationship should be included in the definition. As discussed in the
handout, "More About Entity Relationship Diagrams," the "primary key"
of a binary relationship will either be the same as the primary key
of one of the entities involved in the relationships, or will include
all the attributes in the primary keys for both --- it depends on the
"type" of the relationship.

In general (for 3-ary and n-ary relationships) the primary key for the
relationship be some subset of the attributes listed in the definition
of the relationship (that is, the attributes in the related entities'
primary keys). No two instances of the "relationship" should have the
same values for the attributes included in the key, and this key
should be "minimal" in the sense that you shouldn't be able to remove
any attributes from this set and still have a "key" for the
relationship.

*IF AT ALL POSSIBLE," this key should include all the attributes that
are in the primary keys for some *subset of the entities* that are
connected by the relationship.

Usually, you'll be dealing with binary relationships, and it will be
sufficient to follow the directions in the handout, "More About Entity
Relationship Diagrams," in order to define primary keys for
relationships.

Use comments to identify the item being defined as a "relationship,"
to name the entities it connects, to provide information about the
relationship's "type," and to give more information about what the
relationship "represents" if this isn't clear from the relationship's
name.

Consider, for example, the relationship "has passed" introduced in the
*second* version of the Student Information System. This was a
("plain") binary relationship between the entities "Student" and
"Course." The primary key of "Student" included a single attribute,
"ID number." The primary key of "Course" included a pair of
attributes, "discipline code" and "course number." Using the
description of the system, it is possible to decide that this
relationship has type "Mc:Mc" so that it is necessary to include the
attributes in *both* these primary keys in order to define a primary
key for the relationship. Therefore, the following data dictionary
definition would be correct:

  has passed = @ID number + @discipline code + @course number

   * Relationship between the entities, "Student" and "Course" *
   * Type Mc:Mc --- Each instance of "Student" may have passed *
   *  zero or more instances of "Course," and each instance of *
   *  "Course" may have been passed by zero or more instances  *
   *  of "Student"                                             *


EXCEPTIONAL CASE: Problems arise when a relationship connects an
instance of an entity to other instance(s) of the same entity.

Clearly, values for the key attributes of more than one instance
of this entity need to included in an instance of this kind of
"relationship," in order to model this kind of connection. Therefore
it is necessary to choose a new name for each "copy" of each attribute
that must be included and include all these in the definition of the
relationship. In order for the data dictionary to be complete, the new
names of the copies must be defined as well.

To define a new name for a copy of an attribute, you can list the
original name of the attribute on the right hand side, as the new
name's definition.

Consider the "prerequisite" relationship introduced in order to solve
Question 2(a) on Lab Exercise #1: This relationship has type Mc:Mc and
is a binary relationship that connects the entity "Course" to
itself. Suppose (as was the case for this lab exercise) that this is
being added to the entity relationship diagram for "Version 4" of the
system, so that "Course" is an entity whose primary key is "discipline
code" and "course number" (as above). Then the following definitions
could be used , assuming that the attributes for the entity "Course"
have already been defined (and providing that invention of new word,
"postrequisite," is acceptable --- since it *is* new, it certainly
*should* be explained in the comments):

  postrequisite course number = course number

   * Used in the relationship "is a prerequisite of," which    *
   *  connects the entity "Course" to itself.                  *
   * This refers to the course number of the course whose      *
   *  prerequisite is being defined, rather than to the course *
   *  number for the prerequisite itself                       *


  postrequisite discipline code = discipline code

   * Used in the relationship "is a prerequisite of," which    *
   *  connects the entity "Course" to itself.                  *
   * This refers to the discipline code of the course whose    *
   *  prerequisite is being defined, rather than to the        *
   *  discipline code for the prerequisite itself              *


  prerequisite course number = course number

   * Used in the relationship "is a prerequisite of," which    *
   *  connects the entity "Course" to itself.                  *
   * This refers to the course number for the prerequisite     *
   *  course itself, rather than for the course for which a    *
   *  prerequisite is being defined                            *


  prerequisite discipline code = discipline code

   * Used in the relationship "is a prerequisite of," which    *
   *  connects the entity "Course" to itself.                  *
   * This refers to the discipline code for the prerequisite   *
   *  course itself, rather than for the course for which a    *
   *  prerequisite is being defined                            *


  is a prerequisite of = @prerequisite discipline code
                         + @prerequisite course number
                         + @postrequisite discipline code
                         + @postrequisite course number

   * Relationship between the entity "Course" and itself           *
   * Type Mc:Mc - Each instance of "Course" can be a prerequisite  *
   *  for zero or more other instances of "Course," and each       *
   *  instance of "Course" can have zero or more other instances   *
   *  of "Course" as prerequisites                                 *

Note that, as the above example shows, you can extend a definition
over several lines if this improves readability.

Definitions of Weak Entities: Recall that each of these is "connected
to" or "associated with" another entity, and that it is necessary to
include the attributes in the primary key of the associated entity as
part of the primary key for the weak entity. These attributes must be
listed (as key attributes) in the definition of the weak entity; it
may be helpful to use comments to distinguish between the other
entity's "borrowed" attributes and the weak entity's "own" attributes.

For example, if the entity "Course" and weak entity "Course Section"
are as described for versions 3 and 4 of the student information
system (discussed in the Monday, Jan. 15 lecture) and the attributes
for both have already been defined, then a correct definition for
"Course Section" would be

  Course Section = @discipline code + @course number
                    + @term + @year + @section number
                    + location + day of week + start time
                    + duration

   * Weak Entity, associated with the entity "Course"       *
   *  Key attributes borrowed from "Course:" discipline     *
   *                                  code, course number   *
   * Represents a section of a course (in which students    *
   *  may be registered), under the assumption that each     *
   *  course section has a single weekly meeting           *


Definitions of Associative Objects: We've only considered one "kind"
of associative object (corresponding the following first "case") so
far. In fact, two kinds of associative objects can be useful.

 Case 1: (If there are k "entities" connected) --- each combination
  of k instances of the connected entities can be involved in *at
  most one* instance of the associative object at a time

  In this first case, a primary key for the associative object can
  be found by ignoring the associative object's own attributes,
  identifying the "type" of the relationship that would be obtained
  by throwing these attributes away, and then using the "primary
  key" that would be found for this relationship (using the
  instructions that have already been given) as the primary key
  of the associative object.

  Example: Consider the associative object "Course Credit" which
   connects entities "Student" and "Course" in version 4 of the
   student information system, and which has an attribute "grade."
   Assuming "grade," and and the entities "Student" and "Course"
   have been defined (as well as all their attributes), the
   following definition for "Course Credit" would be sufficient.

    Course Credit = @ID number + @discipline code + @course number
                    + grade

     * Associative object connecting the entities "Student" and *
     *  "Course"                                                *
     * Relationship Type Mc:Mc - Each instance of "Student"     *
     *  can have a "Course Credit" for zero or more instances   *
     *  of "Course," and there may be zero more instances of    *
     *  "Student" that have a "Course Credit" for each          *
     *  of "Course"                                             *
     * Note: Students can repeat courses after having passed    *
     *  them; the "Course Credit" for a given student on a      *
     *  given course should represent the highest grade earned  *
     *  by that student for that course                         *

 Case 2: (Again, assuming k "entities" are connected) --- it is
  possible for a combination of k instances of the connected
  entities to be connected by *several* instances of the
  associative object at once!

  One Example: Consider a "Medical Information System" that maintains
  information about different types of medication that can be
  prescribed (perhaps including the medication name, maximal allowable
  dosage, etc.,) and also about patients of the doctor(s) using the
  system. "Medication" and "Patient" might both be entities in the
  ERD for this system.

  It will probably be necessary to keep track of prescriptions of
  medication to patients. Since the "quantity" of the medication
  prescribed and the "prescription date" are probably both data
  items that must be remembered, "Prescription" will likely be
  an associative object connecting "Patient" and "Medication".

  It is possible to prescribe the same medication to the same
  patient more than once. It is also quite probable that a history
  of prescriptions should be maintained (since the amount of a drug
  consumed over some period of time is probably important, just as
  the maximum quantity consumed all at once is).

  Under these circumstances, it is probably necessary for the system
  to be able to remember more than once instance of a "Prescription"
  that connects a given (instance of a) "Patient" to a given
  (instance of) "Medication."

  To find a "primary key" in a case like this, start by ignoring this
  fact, and, again, determine a "relationship type;" used this
  to produce a first attempt at a primary key.

  If this has been done correctly, then values of the attributes
  in the "candidate" primary key will determine precisely which
  combination of k instances of entities are "involved in" (or
  "connected by") the given instance of the relationship. It will be
  necessary to extend this "preliminary primary key" using one or more
  of the associative object's "own" attributes, in order to produce
  a primary key that can be used to distinguish among *all* instances
  of this associative object.

  Continuing the "Medical Information System" example: The
  associative object "Patient" almost certainly has type "Mc:Mc"
  --- each instance of "Patient" can be prescribed zero or more
  instances of "Medication," and each instance of "Medication"
  can be prescribed to zero or more instances of "Patient."
  Therefore the "first candidate" for the primary key for
  "Prescription" should include all the attributes in the primary
  key for "Patient," *and" all the attributes in the primary key
  for "Medication."

  It is likely that two (or more) prescriptions of the same
  medication cannot be made to the same patient on any given date.
  If this is true then the "Prescription's" own attribute,
  "prescription date," could be added to the "candidate" primary
  key, in order to form a correct primary key for this associative
  object.

  NOTE: I suspect that this second kind of associative object is
  "unusual." Therefore, it would be a good idea to identify each
  associative object that is of this second "kind" by including
  this information in the comments for this kind of associative
  object's definition. Then, the first "kind" of associative
  object could safely could be assumed as a "default."

  Alternatively, you could include even more comments in your
  data dictionary, by explicitly saying (in comments) which
  "kind" of associative object you have, every time an
  associative object is defined.

Definitions of Supertypes and Subtypes: Define the supertype in the
same way as you would define an entity (stating that this is a
"supertype," and possibly listing all its subtypes, in the
definition's comments).

Define the subtype by listing the attributes that must be *added to*
those of the supertype in order to form it (and, of course,
identifying the supertype as well). State that you are defining a
subtype (and identify the supertype) in the comments for the
definition. Note that, if the supertype and all its subtypes have the
same primary key (as was required in the definitions given in this
course's lecture notes), then you won't be adding any "key" attributes
to the supertype in order to form the subtype.

For example: If the supertype "Course" and subtypes "Pass/Fail Course"
and "Graded Course" are described for "Version 5" of the Student
Information System (in the lecture notes for Monday, January 15 ---
"lecture_04") --- and if all attributes of this supertype and these
subtypes have already been defined --- then the definitions for the
supertype and its subtypes could be as follows.

 Course = @discipline code + @course number + title

  * Supertype, with subtypes "Graded Course" and "Pass/Fail Course" *


 Graded Course = Course + weight

  * Subtype of the supertype "Course"                               *
  * Represents courses for which students can receive grades        *


 Pass/Fail Course = Course

  * Subtype of the supertype "Course"                               *
  * Represents courses that students can only "pass" or "fail,"     *
  *  so that letter grades are not earned for these                 *


What to Include?

A complete data dictionary (corresponding to an ERD) will include all
the definitions needed to define the "type" of every attribute,
entity, relationship, weak entity, associative object, supertype and
subtype in the entity relationship diagram --- *and nothing more*.


"Wrapping Things Up"

In the second lecture, a number of "analysis principles" and
desirable properties of (informal) models for requirements were
stated.

Entity relationship diagrams and data dictionaries *do* satisfy at
least some of these principles, and meet some of these goals:

 - Each can be modeled one of the "view" or "aspects" of
   requirements that were identified. Entity relationship diagrams
   provide specifications of "information structure," while
   data dictionaries can be used to specify "information content."

 - If redundant data is not included in the ERD (as has been
   discussed) then the ERD models *essential requirements* for the
   system. *Implementation requirements* can be incorporated later
   on, and this could lead to additions to the ERD

 - Each provides a model that is relatively easy to create, easy
   to understand, and also easy to change.

It isn't immediately clear that an entity relationship diagram
provides a "hierarchical" or "partioned" view of requirements.
However, that it can be argued that a combination of an ERD, list of
attributes of every entity (etc.) in the diagram, *and* complete set
of data dictionary definitions *does* provide this --- especially if a
reasonably powerful "CASE Tool" is used to create, and inspect this
information:

 - The entity relationship diagram provides a system-wide, but
   extremely general picture of "stored data" requirements,
   without providing much detail

 - (Using a CASE Tool) one can "focus attention" on part of this
   --- possibly on a single entity, relationship, associative
   object, etc. At this point it is possible to view to a more
   "detailed view," by obtaining a list of the attributes which
   are the components of the entity (etc.) being constructed
   --- or by seeing a (slightly more detailed) data dictionary
   definition of the entity instead

 - In either case, one can continue to move "down" the hierarchy
   by focusing on a single attribute. Now, even more detail can
   be obtained by moving to this attribute's data dictionary
   definition.

 - Of course, you can also move "back up" toward the ERD any time
   you want to (either by using an index to flip from one part
   of a hard copy of a specification to another, or by using
   one of the features provided by a CASE tool).

 - ... and, you can move "horizontally" through this model as well
   as "vertically," by switching from one attribute of an entity
   to another attribute of the same entity (at one level of detail),
   or by following chains of entities, relationships, etc., in the
   ERD (at a higher level).

It is also possible to begin construction of this model early on, when
few details of system requirements are known. The process of
constructing the model helps you to discover which additional
information is required. It is then easy to "extend" the model by
incorporating this additional information as it becomes available.

*Unfortunately:*

 - We've had to resort to "comments" --- which are based on a
   "natural language" (English) rather than a formal one
   (like first order logic) in order to include "detailed" or
   "unusual" requirements. Any information specified using a
   "natural language" is inherently ambiguous/imprecise.

 - It's already been admitted, repeatedly, that different authors
   of "software engineering" texts all define entity relationship
   diagrams, data dictionaries, etc., in slightly different ways.
   The more detailed or unusual the requirement is (that you wish
   to specify), the more likely it becomes that you must use a
   "nonstandard" part of the model in order to document it!

Some developers using these tools have tried to "cope" with this by
developing and documenting their own "in house" standards, including
conventions for documenting "unusual" requirements. However, as more
and more of these are added, it becomes more and more difficult to
remember what these conventions are, and to follow them correctly.

For this reason (and others) it *can* desirable to use different tools
that can be used to specify a wider range of types of requirements in
a precise way. This is possible (or, at least, plausible) if a tool is
based on a formal language, rather than a natural language like
English and/or a set of pictures. These tools will be discussed later
in the term.