CPSC 333 --- Lecture 7 --- Monday, January 22, 1996 Data Dictionaries A *data dictionary* is an organized listing of all the data elements that are pertinent to a system, with precise, rigorous definitions. These help to ensure that both users (or "customers" or "clients") and developers of systems have a common understanding of all the inputs, outputs, components of stores, and intermediate calculations used by a system. "Warning:" Once again, this is a tool that "almost everybody" uses, but that "almost nobody" defines in exactly the same way as anyone else. Pressman introduces "data dictionaries" in both the "Beginner's Guide" and "Practitioner's Approach." However, he introduces all the data dictionary notation he might need "all at once;" I will be introducing notation "as I need it" --- introducing the syntax needed to define components of an entity relationship diagram now, and then introducing additional syntax later, as I need it to define components of data flow diagrams, etc. In all cases a "data dictionary" is a sequence of definitions for data elements that are part of the requirements, design, and test specifications for a system as well as part of the implemented "system" itself. In order to make the "dictionary" easier to use --- and, also to ensure that nothing is ever defined more than once --- these definitions should be sorted, in alphabetical order (for the terms that are being defined). Data Dictionary Symbols --- that are needed to define components of an entity relationship diagram: = - "is composed of" --- this separates the term being defined (which is to the left of "=") from its definition (which is on the right) + - "aggregation" used in definitions of (instances of) entities to define the entity as the "aggregation" of its attributes [] - used to enclose options if several alternatives are available. For this course (at least, for now) we'll *only* use these to specify "enumerated types" if these are needed to define types of attributes. (Other uses will be seen after data flow diagrams have been introduced.) | - separator for the alternatives that are enclosed by "[]" @ - should appear before the name of every attribute that is part of the *primary key* in a definition for an entity, weak entity, associative object, or relationship * * - these enclose "comments" There is no standard way to identify "constants," and there is no "standard" set of basic "data types" that can be assumed. For this course, we will enclose constants within double quotes - " " *For the moment* the only "basic types" we'll need are: - integer - string (a character string) - real (a real number) In fact, we might find that only the first two are used (to start). More of these may be introduced as needed. The data dictionary for an entity relationship diagram must include a definition for every attribute of anything in the diagram, as well as every entity, weak entity, relationship, and associative object in the diagram. Definitions of Attributes: There are only two cases to be considered. a) The attribute has one of the "basic" data types. b) The attribute has an "enumerated" data types. If the normalization rules have been obeyed, no other cases should arise. Case a) The attribute has one of the "basic" data types. In this case, the attribute's definition should have the general form = and should be supplemented with one or comments. Comments for attributes should - say which entity (or entities) or associative object(s) this is an attribute of - (Possibly) say whether the attribute is part of the primary key - Provide upper and lower bounds on possible values of the attribute, if the attribute's type is "integer" or "real" --- or bounds on the length of the string if the basic type is "string," if these bounds exist (and are known) - Provide a bound on the "precision" of values that must be represented if the attribute's type is "real." For example, this might be a specification that two decimal digits of accuracy are required (if, for example, the attribute represents a cost in "dollars and cents") - Provide additional constraints on possible values ... in the last three cases, this information is needed if the system will be required to ensure that these constraints are obeyed and/or if the information might be helpful in designing and implementing the "data subsystem" ... - State a "unit of measure" (eg., feet, inches, metres,...) if the attribute has type "integer" or "real" is a measurement of something - Provide a *brief* statement of what the attribute represents *IF NECESSARY* --- Ideally this should be "clear" from the attribute's name, so that this kind of comment wouldn't be necessary Examples: The attributes "ID number" and "name" of "Student" in the student information system "might" have the following definitions. ID number = integer * key attribute of the entity Student * * should be exactly seven digits long * name = string * non-key attribute of the entity Student * * includes the student's full name --- first name, * * additional names or initials (if known), and last name * If, at some point, bounds on the lengths of names to be represented are discovered, then this could be specified as one of the comments in the definition for "name" as well. Case b) The attribute has an "enumerated type" Use the symbols for specifying options - [, |, ] - and literals to define the enumerated type on the right hand side of the definition of the attribute. Comments should say which entity has this attribute, (possibly) whether this is a key attribute, and should give a brief description of what the attribute represents if necessary (again, this should ideally be clear from the attribute's name). Example: The entity "Student" in the ORIGINAL version of the information system had an attribute "status" which might have the following definition. status = [ "registered" | "passed" ] * non-key attribute of the entity Student * In later versions of the information system, this is how the attribute "grade" of the associative object "Course Credit" would be defined. Definitions of Entities: Define an entity as an "aggregate" of its attributes. In the definition, put an "@" before the name of each attribute that is part of the entity's primary key. The order in which attributes are listed isn't important, but (I think) the definition is more readable if all the "key" attributes are listed before any of the "non-key" attributes. Comments should be used to identify the data item being defined as an "Entity" and to give more information about what this data item represents (if this is necessary). Example: A definition for the version of "Student" that appears in the very *first* version of the "student information system" would be: Student = @ID number + name + status * Entity; represents students who are currently registered in * * or who have passed the course that this system remembers * * information about * Definitions of Relationships: These will resemble definitions for entities --- relationships are defined as "aggregates" of attributes. The attributes in the primary keys of all the entities connected by the relationship should be included in the definition. As discussed in the handout, "More About Entity Relationship Diagrams," the "primary key" of a binary relationship will either be the same as the primary key of one of the entities involved in the relationships, or will include all the attributes in the primary keys for both --- it depends on the "type" of the relationship. In general (for 3-ary and n-ary relationships) the primary key for the relationship be some subset of the attributes listed in the definition of the relationship (that is, the attributes in the related entities' primary keys). No two instances of the "relationship" should have the same values for the attributes included in the key, and this key should be "minimal" in the sense that you shouldn't be able to remove any attributes from this set and still have a "key" for the relationship. *IF AT ALL POSSIBLE," this key should include all the attributes that are in the primary keys for some *subset of the entities* that are connected by the relationship. Usually, you'll be dealing with binary relationships, and it will be sufficient to follow the directions in the handout, "More About Entity Relationship Diagrams," in order to define primary keys for relationships. Use comments to identify the item being defined as a "relationship," to name the entities it connects, to provide information about the relationship's "type," and to give more information about what the relationship "represents" if this isn't clear from the relationship's name. Consider, for example, the relationship "has passed" introduced in the *second* version of the Student Information System. This was a ("plain") binary relationship between the entities "Student" and "Course." The primary key of "Student" included a single attribute, "ID number." The primary key of "Course" included a pair of attributes, "discipline code" and "course number." Using the description of the system, it is possible to decide that this relationship has type "Mc:Mc" so that it is necessary to include the attributes in *both* these primary keys in order to define a primary key for the relationship. Therefore, the following data dictionary definition would be correct: has passed = @ID number + @discipline code + @course number * Relationship between the entities, "Student" and "Course" * * Type Mc:Mc --- Each instance of "Student" may have passed * * zero or more instances of "Course," and each instance of * * "Course" may have been passed by zero or more instances * * of "Student" * EXCEPTIONAL CASE: Problems arise when a relationship connects an instance of an entity to other instance(s) of the same entity. Clearly, values for the key attributes of more than one instance of this entity need to included in an instance of this kind of "relationship," in order to model this kind of connection. Therefore it is necessary to choose a new name for each "copy" of each attribute that must be included and include all these in the definition of the relationship. In order for the data dictionary to be complete, the new names of the copies must be defined as well. To define a new name for a copy of an attribute, you can list the original name of the attribute on the right hand side, as the new name's definition. Consider the "prerequisite" relationship introduced in order to solve Question 2(a) on Lab Exercise #1: This relationship has type Mc:Mc and is a binary relationship that connects the entity "Course" to itself. Suppose (as was the case for this lab exercise) that this is being added to the entity relationship diagram for "Version 4" of the system, so that "Course" is an entity whose primary key is "discipline code" and "course number" (as above). Then the following definitions could be used , assuming that the attributes for the entity "Course" have already been defined (and providing that invention of new word, "postrequisite," is acceptable --- since it *is* new, it certainly *should* be explained in the comments): postrequisite course number = course number * Used in the relationship "is a prerequisite of," which * * connects the entity "Course" to itself. * * This refers to the course number of the course whose * * prerequisite is being defined, rather than to the course * * number for the prerequisite itself * postrequisite discipline code = discipline code * Used in the relationship "is a prerequisite of," which * * connects the entity "Course" to itself. * * This refers to the discipline code of the course whose * * prerequisite is being defined, rather than to the * * discipline code for the prerequisite itself * prerequisite course number = course number * Used in the relationship "is a prerequisite of," which * * connects the entity "Course" to itself. * * This refers to the course number for the prerequisite * * course itself, rather than for the course for which a * * prerequisite is being defined * prerequisite discipline code = discipline code * Used in the relationship "is a prerequisite of," which * * connects the entity "Course" to itself. * * This refers to the discipline code for the prerequisite * * course itself, rather than for the course for which a * * prerequisite is being defined * is a prerequisite of = @prerequisite discipline code + @prerequisite course number + @postrequisite discipline code + @postrequisite course number * Relationship between the entity "Course" and itself * * Type Mc:Mc - Each instance of "Course" can be a prerequisite * * for zero or more other instances of "Course," and each * * instance of "Course" can have zero or more other instances * * of "Course" as prerequisites * Note that, as the above example shows, you can extend a definition over several lines if this improves readability. Definitions of Weak Entities: Recall that each of these is "connected to" or "associated with" another entity, and that it is necessary to include the attributes in the primary key of the associated entity as part of the primary key for the weak entity. These attributes must be listed (as key attributes) in the definition of the weak entity; it may be helpful to use comments to distinguish between the other entity's "borrowed" attributes and the weak entity's "own" attributes. For example, if the entity "Course" and weak entity "Course Section" are as described for versions 3 and 4 of the student information system (discussed in the Monday, Jan. 15 lecture) and the attributes for both have already been defined, then a correct definition for "Course Section" would be Course Section = @discipline code + @course number + @term + @year + @section number + location + day of week + start time + duration * Weak Entity, associated with the entity "Course" * * Key attributes borrowed from "Course:" discipline * * code, course number * * Represents a section of a course (in which students * * may be registered), under the assumption that each * * course section has a single weekly meeting * Definitions of Associative Objects: We've only considered one "kind" of associative object (corresponding the following first "case") so far. In fact, two kinds of associative objects can be useful. Case 1: (If there are k "entities" connected) --- each combination of k instances of the connected entities can be involved in *at most one* instance of the associative object at a time In this first case, a primary key for the associative object can be found by ignoring the associative object's own attributes, identifying the "type" of the relationship that would be obtained by throwing these attributes away, and then using the "primary key" that would be found for this relationship (using the instructions that have already been given) as the primary key of the associative object. Example: Consider the associative object "Course Credit" which connects entities "Student" and "Course" in version 4 of the student information system, and which has an attribute "grade." Assuming "grade," and and the entities "Student" and "Course" have been defined (as well as all their attributes), the following definition for "Course Credit" would be sufficient. Course Credit = @ID number + @discipline code + @course number + grade * Associative object connecting the entities "Student" and * * "Course" * * Relationship Type Mc:Mc - Each instance of "Student" * * can have a "Course Credit" for zero or more instances * * of "Course," and there may be zero more instances of * * "Student" that have a "Course Credit" for each * * of "Course" * * Note: Students can repeat courses after having passed * * them; the "Course Credit" for a given student on a * * given course should represent the highest grade earned * * by that student for that course * Case 2: (Again, assuming k "entities" are connected) --- it is possible for a combination of k instances of the connected entities to be connected by *several* instances of the associative object at once! One Example: Consider a "Medical Information System" that maintains information about different types of medication that can be prescribed (perhaps including the medication name, maximal allowable dosage, etc.,) and also about patients of the doctor(s) using the system. "Medication" and "Patient" might both be entities in the ERD for this system. It will probably be necessary to keep track of prescriptions of medication to patients. Since the "quantity" of the medication prescribed and the "prescription date" are probably both data items that must be remembered, "Prescription" will likely be an associative object connecting "Patient" and "Medication". It is possible to prescribe the same medication to the same patient more than once. It is also quite probable that a history of prescriptions should be maintained (since the amount of a drug consumed over some period of time is probably important, just as the maximum quantity consumed all at once is). Under these circumstances, it is probably necessary for the system to be able to remember more than once instance of a "Prescription" that connects a given (instance of a) "Patient" to a given (instance of) "Medication." To find a "primary key" in a case like this, start by ignoring this fact, and, again, determine a "relationship type;" used this to produce a first attempt at a primary key. If this has been done correctly, then values of the attributes in the "candidate" primary key will determine precisely which combination of k instances of entities are "involved in" (or "connected by") the given instance of the relationship. It will be necessary to extend this "preliminary primary key" using one or more of the associative object's "own" attributes, in order to produce a primary key that can be used to distinguish among *all* instances of this associative object. Continuing the "Medical Information System" example: The associative object "Patient" almost certainly has type "Mc:Mc" --- each instance of "Patient" can be prescribed zero or more instances of "Medication," and each instance of "Medication" can be prescribed to zero or more instances of "Patient." Therefore the "first candidate" for the primary key for "Prescription" should include all the attributes in the primary key for "Patient," *and" all the attributes in the primary key for "Medication." It is likely that two (or more) prescriptions of the same medication cannot be made to the same patient on any given date. If this is true then the "Prescription's" own attribute, "prescription date," could be added to the "candidate" primary key, in order to form a correct primary key for this associative object. NOTE: I suspect that this second kind of associative object is "unusual." Therefore, it would be a good idea to identify each associative object that is of this second "kind" by including this information in the comments for this kind of associative object's definition. Then, the first "kind" of associative object could safely could be assumed as a "default." Alternatively, you could include even more comments in your data dictionary, by explicitly saying (in comments) which "kind" of associative object you have, every time an associative object is defined. Definitions of Supertypes and Subtypes: Define the supertype in the same way as you would define an entity (stating that this is a "supertype," and possibly listing all its subtypes, in the definition's comments). Define the subtype by listing the attributes that must be *added to* those of the supertype in order to form it (and, of course, identifying the supertype as well). State that you are defining a subtype (and identify the supertype) in the comments for the definition. Note that, if the supertype and all its subtypes have the same primary key (as was required in the definitions given in this course's lecture notes), then you won't be adding any "key" attributes to the supertype in order to form the subtype. For example: If the supertype "Course" and subtypes "Pass/Fail Course" and "Graded Course" are described for "Version 5" of the Student Information System (in the lecture notes for Monday, January 15 --- "lecture_04") --- and if all attributes of this supertype and these subtypes have already been defined --- then the definitions for the supertype and its subtypes could be as follows. Course = @discipline code + @course number + title * Supertype, with subtypes "Graded Course" and "Pass/Fail Course" * Graded Course = Course + weight * Subtype of the supertype "Course" * * Represents courses for which students can receive grades * Pass/Fail Course = Course * Subtype of the supertype "Course" * * Represents courses that students can only "pass" or "fail," * * so that letter grades are not earned for these * What to Include? A complete data dictionary (corresponding to an ERD) will include all the definitions needed to define the "type" of every attribute, entity, relationship, weak entity, associative object, supertype and subtype in the entity relationship diagram --- *and nothing more*. "Wrapping Things Up" In the second lecture, a number of "analysis principles" and desirable properties of (informal) models for requirements were stated. Entity relationship diagrams and data dictionaries *do* satisfy at least some of these principles, and meet some of these goals: - Each can be modeled one of the "view" or "aspects" of requirements that were identified. Entity relationship diagrams provide specifications of "information structure," while data dictionaries can be used to specify "information content." - If redundant data is not included in the ERD (as has been discussed) then the ERD models *essential requirements* for the system. *Implementation requirements* can be incorporated later on, and this could lead to additions to the ERD - Each provides a model that is relatively easy to create, easy to understand, and also easy to change. It isn't immediately clear that an entity relationship diagram provides a "hierarchical" or "partioned" view of requirements. However, that it can be argued that a combination of an ERD, list of attributes of every entity (etc.) in the diagram, *and* complete set of data dictionary definitions *does* provide this --- especially if a reasonably powerful "CASE Tool" is used to create, and inspect this information: - The entity relationship diagram provides a system-wide, but extremely general picture of "stored data" requirements, without providing much detail - (Using a CASE Tool) one can "focus attention" on part of this --- possibly on a single entity, relationship, associative object, etc. At this point it is possible to view to a more "detailed view," by obtaining a list of the attributes which are the components of the entity (etc.) being constructed --- or by seeing a (slightly more detailed) data dictionary definition of the entity instead - In either case, one can continue to move "down" the hierarchy by focusing on a single attribute. Now, even more detail can be obtained by moving to this attribute's data dictionary definition. - Of course, you can also move "back up" toward the ERD any time you want to (either by using an index to flip from one part of a hard copy of a specification to another, or by using one of the features provided by a CASE tool). - ... and, you can move "horizontally" through this model as well as "vertically," by switching from one attribute of an entity to another attribute of the same entity (at one level of detail), or by following chains of entities, relationships, etc., in the ERD (at a higher level). It is also possible to begin construction of this model early on, when few details of system requirements are known. The process of constructing the model helps you to discover which additional information is required. It is then easy to "extend" the model by incorporating this additional information as it becomes available. *Unfortunately:* - We've had to resort to "comments" --- which are based on a "natural language" (English) rather than a formal one (like first order logic) in order to include "detailed" or "unusual" requirements. Any information specified using a "natural language" is inherently ambiguous/imprecise. - It's already been admitted, repeatedly, that different authors of "software engineering" texts all define entity relationship diagrams, data dictionaries, etc., in slightly different ways. The more detailed or unusual the requirement is (that you wish to specify), the more likely it becomes that you must use a "nonstandard" part of the model in order to document it! Some developers using these tools have tried to "cope" with this by developing and documenting their own "in house" standards, including conventions for documenting "unusual" requirements. However, as more and more of these are added, it becomes more and more difficult to remember what these conventions are, and to follow them correctly. For this reason (and others) it *can* desirable to use different tools that can be used to specify a wider range of types of requirements in a precise way. This is possible (or, at least, plausible) if a tool is based on a formal language, rather than a natural language like English and/or a set of pictures. These tools will be discussed later in the term.