CPSC 333 --- Lecture 9 --- Friday, January 26, 1996 Supporting Models: Data Dictionaries A data dictionary should be used to identify the "type" of each data flow and data store on a (set of) data flow diagram(s) for a system. As was true for data dictionaries for ERDs, a data dictionary is a listing of definitions for a set of terms. The definitions should be listed in alphabetical order (for the terms being defined). If both an ERD and a data flow diagram (or set of data flow diagrams) have been prepared to a system, then the data dictionary for the DFD(s) can be an extension of (that is, include) the data dictionary for the ERD. Consider, for example, Version 1 of the Student Information System. The ERD for this system included a single entity --- "Student" --- with three attributes: ID number (the primary key), name, and status. The data dictionary for the ERD would be: ID number = integer * key attribute of the entity Student; should be exactly * * seven digits long * name = string * non-key attribute of the entity Student; includes the * * student's full name --- first name, additional names * * and initials (if known), and last name * status = [ "passed" | "registered" ] * non-key attribute of the entity Student * Student = @ID number + name + status * Entity; represents students who are currently * * registered in or who have passed the course this * * system keeps track of * The data dictionary for the data flow diagram shown in the Wednesday, January 24 lecture for this system (and, in Postscript form, online in file "lecture_08_dfd.ps"), would extend this dictionary and would include definitions for the following nine terms: ID number (attribute in ERD; data flow in DFD) name (attribute in ERD; data flow in DFD) passed students (data flow in DFD) registered students (data flow in DFD) status (attribute in ERD; data flow in DFD) Student (entity in ERD) student (data flow in DFD) Students (data store in DFD) students (data flow in DFD) All of the data dictionary symbols that were introduced for dictionaries for ERDs are useful for data dictionaries for DFDs. Two additional bits of notation will be useful to define data flows and data stores, and one of these --- notation for definition of a *sequence* --- is needed to construct the above data dictionary: If "T" has been defined, then {T} can be used (on the right hand side of definitions) to denote a sequence of values of "type T". This can be used to define data stores that appear on data flow diagrams. Consider, for example, the data store "Students" given in the example data flow diagram; it corresponds to the same table of data as the entity "Student" in the system's ERD, and the data store can be defined as Students = { Student } * Data store * The notation can be extended to show upper and lower bounds on the number of values that can occur in the sequence that is being defined. Suppose i and j are non-negative integers such that i is less than or equal to j. Then i{ Student } --- defines a sequence of i or more "Student"'s { Student }j --- defines a sequence of at most j "Student"'s (possibly, a sequence of 0 "Student"'s) i{ Student }j --- defines a sequence of between i and j "Student"'s (inclusive) If you are typesetting your data dictionary, or writing it by hand, then you can show the lower bound (i) and the upper bound (j) differently --- by making j a superscript of the right brace "}" and by making i a subscript of this brace. Then _ _ j | __ | \ | | | _ _ | / / -- -+- | | -| |_ | | -+- \ | | | | | | | |_ | | | | - -- - - - i would be written or typeset instead of i{Student}j. Please use one or the other of these conventions, but *not both*, in the same data dictionary. A complete data dictionary for the example data flow diagram (and ERD) for Version 1 of the Student Information System could be: ID number = integer * key attribute of the entity Student in ERD, and data * * flow in DFD; should be exactly seven digits long * name = integer * non-key attribute of the entity Student in ERD, and * * data flow in DFD; includes the student's full name * * --- first name, additional names and initials (if * * known), and last name * passed student = { student } * data flow in DFD; includes all students in the data * * store Students for which the status is "passed" * registered student = { student } * data flow in DFD; includes all students in the data * * store Students for which the status is "registered" * status = [ "passed" | "registered" ] * non-key attribute of the entity Student in ERD, and * * data flow in DFD * Student = @ID number + name + status * Entity in ERD; represents students who are currently * * registered in or who have passed the course this * * system keeps track of * student = Student * data flow in DFD; corresponds to a single instance of * * the entity Student * Students = { Student } * data store in DFD corresponding the the entity Student * students = { student } * data flow in DFD * Another useful piece of data dictionary notation (which wasn't needed to construct the data dictionary for the previous example) provides a way to indicate "optional" information. This data is enclosed in round brackets --- "(" and ")" --- in a definition. Suppose, for example, that you wish to define a telephone number that can either be a local number or a long distance number (for use in North America, so that country codes aren't needed). Here is a set of three definitions that could be used to define "telephone number". area code = integer * should be exactly three digits long; first and third * * digit should be between 2 and 9 (inclusive), and second * * digit should be either 0 or 1 * local telephone number = integer * should be exactly seven digits long; first and second * * digits should be between 2 and 9 (inclusive) * telephone number = ( area code ) + local telephone number Consistency between Entity Relationship Diagram and Data Flow Diagrams Both the ERD and the DFDs provide a way the represent the "stored data" that the system must remember to function. The ERD doesn't really show anything else (but it does show "logical relationships" between different pieces of stored data clearly, and can be helpful later on, when you want to design the data stores for the system). The DFD(s) shows "stored data" using data stores. If the ERD and DFDs are developed "completely independently" then there's a danger that the two models of "stored data" they provide will be contradictory. In order to prevent this, Edward Yourdon (one of the DFD and structured analysis "gurus") has proposed a simple "consistency rule" for ERDs and DFDs, which can be found in his book "Modern Structured Analysis": Yourdon's Consistency Rule: - Every data store in DFD or set of DFDs for a system must correspond to exactly one entity (but not supertype), relationship, associative object, or weak entity in the ERD, and each entity (but not supertype), relationship, associative object, and weak entity in the ERD should correspond to exactly one data store in the DFD(s) If you follow this rule then developing data dictionary definitions for data stores is trivial, provided that you've already developed a data dictionary for the ERD: Simply use the "sequence" notation, and then refer to the definition for the entity (or relationship, etc.), as was done to define the data store "Students" from the object "Student" in the above data dictionary. There's a simple "naming convention" that you can use (if you wish to) in order to keep track of the parts of the ERD and DFDs: - On the ERDs, use a singular noun or noun phrase, with first letter capitalized, to name each entity, associative object, or weak entity - Use the plural form of the same name (again, with the first letter capitalized) as the name of the corresponding data store in the DFDs - On the ERDs, use a transitive verb or verb phase, in lower case, to name a relationship - Choose a corresponding noun, and use the plural form of this noun with first letter capitalized, as the name of the corresponding data store on the DFDs For example: If you have a relationship called "is registered in" on your ERD, you might call the corresponding data store "Registrations" If you do all this then every data store in the DFDs will be named by a plural form of a noun or noun phrase with the first letter capitalized. Most (perhaps all) data flows to and from the data store will "represent" either a single instance of the corresponding entity (or relationship, etc.) --- or some sequence of zero or more instances. - If a data flow represents a single instance of whatever part of the ERD corresponds to the data store, take the noun or noun phrase you used as the name for the data store. Make it singular, instead or plural, and begin it with a lower case letter (instead of the corresponding capital letter) --- and use that as the name for the data flow. - If, instead, the data flow corresponds to a sequence of zero or more of these instances, then use the plural form of the name, again with the first letter in lower case. - Try to avoid using the above names for anything else. This naming convention was used in the above example. Exceptions to Yourdon's Consistency Rule: - Data stores that correspond to registers instead of tables of data: The entity relationship diagram models stored data that can be represented as a set of data tables; the number of instances of each entity, relationship, etc., in the entity relationship diagram is supposed to be "unbounded" --- or, at least, it should be possible for the number of instances to be quite large. However, it may be necessary for asynchronous processes in a system to share access to one (or a very small, fixed number of) piece(s) of information. It wouldn't be appropriate to model this in the ERD, and you'd probably use a "register" rather than a "data base" to maintain this information as part of an implemented system. It certainly *should* be shown on a data flow diagram --- as a "data store" that won't correspond to the same system's entity relationship diagram. Please *do* include data stores on DFDs that don't correspond to anything on ERDs, when it's necessary to do so in order to model this kind of information. - When entity relationship diagrams and data tables were covered, it was noted that you *don't* always need to have a separate data table for every entity, relationship, associative object, etc., in the ERD. (This wasn't discussed in the lectures, but was covered in one of the handouts distributed when ERDs were being described.) In particular, an entity *and* a relationship, such that each instance of the entity participates in *exactly one* instance of the relationship, can both be "implemented" using a single table. It's tempting to represent them using a single data store on the DFDs as well, and this would produce slightly less "cluttered" DFDs and process specifications. On the other hand, it would also (slightly) complicate the process of developing the definition of the data store. A definition of the form = { + } wouldn't be quite right, because if suggests that each "instance" of the data store should contain two copies of the object's primary key (the "copy" that is part of the instance of the object, and the other "copy" that is part of the instance of the relationship). There is no "standard" (or, even "proposed") data dictionary syntax that deals with this problem. You could deal with this by explicitly naming all the attributes that are to be included, but this would (slightly) complicate the job of keeping the data dictionary consistent and up-to-date, because it would give you one more definition to search for and change every time the definition of the entity (or relationship) for the ERD was modified. Since the data flow diagrams we'll be dealing with are small, and there's no "standard" way to deal with the resulting minor problems that arise when data stores represent more than one piece of an ERD, I'd prefer that you follow Yourdon's rule, and *not* use one data store to represent more than one piece of the ERD, when working on assignments for CPSC 333. Leveled Sets of Data Flow Diagrams In order to make them easy to understand, process specifications should be quite simple --- if you need more than two or three pages in order to write a single process specification down, then you should consider replacing the process that this corresponds to with several simpler processes (that communicate with each other in order to do a job), and then produce process specifications for each of the resulting simpler processes instead. If you start with any nontrivial system and follow this rule, you will find that the number of processes you need to include will become quite large. However, it's also true that any single data flow diagram should be easy to understand quickly. It's advisable not to include more than (approximately) *seven* processes and data stores on any single diagram. Otherwise it will be difficult to read the diagram, and even harder to draw it neatly (and update it as system requirements change). Therefore, it's necessary to use more than one diagram in order to represent any nontrivial system, if both these rules are to be followed. We will use a hierarchical, "tree" structure to organize the set of data flow diagrams that represent system requirements. - A *context diagram* is at the root of the tree. This diagram represents the entire system as a single process. It also shows the "terminators" that represent people and other systems that "our" system must communicate with, and the data flows that pass between the system and these terminators (in each direction). - Each "node" in the tree is a data flow diagram. Each "non-root" node is a data flow diagram that "expands" or "refines" *exactly one* of the processes (bubbles) that are in data flow diagram at node's parent. - Every process in every data flow diagram in the tree is "refined" or "specified" by either - exactly one lower level data flow diagram (at a node that is one of the children of the node containing this process); or, - exactly one process specification but never by *both* a lower level DFD *and* a process specification Numbering Scheme: The context diagram is the only "level 0" diagram in the set. It has a single process that doesn't really need to be numbered. You can give this process the number 0, or you can treat it as the *one* process in the entire set of DFDs that isn't required to have a process number (either is acceptable). According to the above rules, the root node (containing the context diagram) will have exactly one child, since it included only one process, and that process is almost always too complicated to be described by a three-page process specification. This child is the only "level 1" diagram in the system. If there are n processes in this diagram (for some positive integer) then the process numbers used in this diagram should be 1, 2, ..., n. Of course, the process numbers should be distinct, so all these numbers will be used. Now, - Each "child" node/DFD of any "level k" diagram will be considered to be a "level k+1" diagram - Suppose a diagram refines process number "x" in a higher level diagram (and that higher level diagram isn't the context diagram, since we've already covered this case). If the lower level diagram includes m processes, for m greater than or equal to 1 (and, for every case I can think of, for m greater than or equal to 2), then these processes should be given the process "numbers" x.1, x.2, ..., x.m. So, a "process number" is actually a string of positive integers separated by periods. If k is bigger than or equal to one then the processes in a level k diagram will have numbers consisting of k integers (separated by k-1 periods) in the string. For example, if there is a "process 3" in the level 1 diagram, and it's complicated enough to be refined by a data flow diagram instead of a process specification, then the processes in this lower level ("refining") data flow diagram will have numbers 3.1, 3.2, 3.3, 3.4, etc. If process 3.2 is refined by a data flow diagram then the processes in *that* diagram will have numbers 3.2.1, 3.2.2, 3.2.3, 3.2.4, ... and so on. It's almost always necessary to give a "linear" order to the diagrams as well, because it's necessary to list the diagrams, one after another in some order, in a requirements specification. Therefore, we will also assign "Figure numbers" to the data flow diagrams in a leveled set (or "tree"). If there are m data flow diagrams in the set, then they will be given figure numbers 1, 2, ..., m. I recommend the following scheme for assigning figure numbers: - The context diagram is "Figure 1" - Its (only) child is "Figure 2" - For k bigger than or equal to 0, include (by giving low figure numbers to) all the level k diagrams, before including (by numbering) any of the level k+1 diagrams - For data flow diagrams at the same level use a dictionary- like ordering in order to choose figure numbers. In particular, if one diagram refines process number a1.a2. ... .ak (for integers a1, a2, down to ak) and the other refines process number b1.b2. ... .bk, then - give the diagram for process number a1.a2. ... .ak the first (lower) figure number if a1 is strictly less than b1 - give the diagram for process number b1.b2. ... .bk the first (lower) figure number if b1 is strictly less than a1 - otherwise --- when a1 equals b1 --- apply this method "recursively," considering the numbers a2 ... ak and b2 ... bk, in order to decide which of the two diagrams gets the lower figure number Note that, if you *did* arrange the diagrams in a tree with the context diagram at the root, then this would be the same as starting at the top (with the diagram) and then moving down, moving all the figures at one level before moving down to the next, and numbering all the diagrams in the same level in left-to-right order as you see them in the tree --- a "breadth first traversal" of the tree. Conservation of Flow --- This is *much* more important than the numbering scheme, and in CPSC 333, no exceptions to the following rule will be allowed. For all integers k, the set of input data flows and output data flows going into and out of a process X in a level k diagram must be *exactly* the same as the inputs coming into, and outputs going out of, the processes in the lower level diagram that refines X. That is: - for every data flow into process X, there should be one or more data flows with the same name, coming from "the outside world" (or the rest of the system) into one or more of the processes in the lower level diagram that refines X - if there are two (or more) input flows coming into process X that have the same name then they will all have different "sources" (since you shouldn't *have* two or more data flows, with the same name, from the same source and to the same destinations). In this case, each "data flow with name N from source S" into the process X should correspond to one or more data flows into the processes in the lower level diagram (that refines X), with the same name N *and* from the same source S. You shouldn't actually *draw* S on the lower level diagram, but it would be a good idea to write "from S" at the end of each arrow in the lower level diagram that corresponds to this data flow, so that it's clear precisely which incoming data flows in the lower level diagram correspond to which inputs coming into process X - the same rules apply to flows going out of X in the higher level diagram and going from processes in the lower level diagram to "the outside world" in the lower level diagram that refines X One amendment is necessary: You *might* have more than one data flow from process X that has the same name and goes to the same destination --- if that destination is a data store, and the process does more than one of the jobs of creating, modifying, and deleting instances of whatever the data store contains. In this case the two (or three) outgoing flows would all be "decorated" differently: You wouldn't have two (or more) of these for creation of the same data to the same store, etc. Treat each kind of data flow with the same name and to the same data store separately (just as you treated incoming data flows with the same name but to different destinations separately, as described above). Then (if you had enough time, as well as paper or blackboard space!) you could take the entire set of leveled data flow diagrams, start with the context diagram, and repeatedly "cut and paste", replacing a process with the lower level data flow diagram that refines it (and matching up the data flows using the information described above), until there were no more lower level data flow diagrams to include. At that point you'd have one *huge* data flow diagram representing the entire system, such that every process shown in the diagram had a process specification corresponding to it. Of course, the leveled set of (simple) data flow diagrams is *much* easier to read (and maintain) than the single huge one would be.