CPSC 333 --- Lecture 9 --- Friday, January 26, 1996

Supporting Models: Data Dictionaries

A data dictionary should be used to identify the "type" of each data
flow and data store on a (set of) data flow diagram(s) for a system.

As was true for data dictionaries for ERDs, a data dictionary is a
listing of definitions for a set of terms. The definitions should be
listed in alphabetical order (for the terms being defined).

If both an ERD and a data flow diagram (or set of data flow diagrams)
have been prepared to a system, then the data dictionary for the
DFD(s) can be an extension of (that is, include) the data dictionary
for the ERD.

Consider, for example, Version 1 of the Student Information
System. The ERD for this system included a single entity --- "Student"
--- with three attributes: ID number (the primary key), name, and
status. The data dictionary for the ERD would be:

  ID number = integer
  * key attribute of the entity Student; should be exactly *
  * seven digits long                                      *

  name = string
  * non-key attribute of the entity Student; includes the  *
  * student's full name --- first name, additional names   *
  * and initials (if known), and last name                 *

  status = [ "passed" | "registered" ]
  * non-key attribute of the entity Student                *

  Student = @ID number + name + status
  * Entity; represents students who are currently          *
  * registered in or who have passed the course this       *
  * system keeps track of                                  *

The data dictionary for the data flow diagram shown in the Wednesday,
January 24 lecture for this system (and, in Postscript form, online
in file "lecture_08_dfd.ps"), would extend this dictionary and
would include definitions for the following nine terms:

  ID number           (attribute in ERD; data flow in DFD)
  name                (attribute in ERD; data flow in DFD)
  passed students     (data flow in DFD)
  registered students (data flow in DFD)
  status              (attribute in ERD; data flow in DFD)
  Student             (entity in ERD)
  student             (data flow in DFD)
  Students            (data store in DFD)
  students            (data flow in DFD)

All of the data dictionary symbols that were introduced for
dictionaries for ERDs are useful for data dictionaries for DFDs. Two
additional bits of notation will be useful to define data flows and
data stores, and one of these --- notation for definition of a
*sequence* --- is needed to construct the above data dictionary: 

 If "T" has been defined, then {T} can be used (on the right
 hand side of definitions) to denote a sequence of values of "type T".

This can be used to define data stores that appear on data flow
diagrams. Consider, for example, the data store "Students" given in
the example data flow diagram; it corresponds to the same table of
data as the entity "Student" in the system's ERD, and the data store
can be defined as

  Students = { Student }
  * Data store           *

The notation can be extended to show upper and lower bounds on the
number of values that can occur in the sequence that is being
defined. Suppose i and j are non-negative integers such that i is less
than or equal to j. Then

   i{ Student }   --- defines a sequence of i or more "Student"'s
    { Student }j  --- defines a sequence of at most j "Student"'s
                       (possibly, a sequence of 0 "Student"'s)
   i{ Student }j  --- defines a sequence of between i and j
                       "Student"'s (inclusive)

If you are typesetting your data dictionary, or writing it by hand,
then you can show the lower bound (i) and the upper bound (j)
differently --- by making j a superscript of the right brace "}" and
by making i a subscript of this brace. Then

      _                               _  j
     |    __                           |
      \  |     |        |  _  _   |   /
      /   --  -+- | |  -| |_ | | -+-  \
     |      |  |  | | | | |_ | |  |    |
      -   --       -   -              -  i

would be written or typeset instead of i{Student}j.

Please use one or the other of these conventions, but *not both*, in the
same data dictionary.

A complete data dictionary for the example data flow diagram (and ERD)
for Version 1 of the Student Information System could be:

 ID number = integer
 * key attribute of the entity Student in ERD, and data    *
 * flow in DFD; should be exactly seven digits long        *

 name = integer
 * non-key attribute of the entity Student in ERD, and     *
 * data flow in DFD; includes the student's full name      *
 * --- first name, additional names and initials (if       *
 * known), and last name                                   *

 passed student = { student }
 * data flow in DFD; includes all students in the data     *
 * store Students for which the status is "passed"         *

 registered student = { student }
 * data flow in DFD; includes all students in the data     *
 * store Students for which the status is "registered"     *

 status = [ "passed" | "registered" ]
 * non-key attribute of the entity Student in ERD, and     *
 * data flow in DFD                                        *

 Student = @ID number + name + status
 * Entity in ERD; represents students who are currently    *
 * registered in or who have passed the course this        *
 * system keeps track of                                   *

 student = Student
 * data flow in DFD; corresponds to a single instance of   *
 * the entity Student                                      *

 Students = { Student }
 * data store in DFD corresponding the the entity Student  *

 students = { student }
 * data flow in DFD                                        *


Another useful piece of data dictionary notation (which wasn't needed
to construct the data dictionary for the previous example) provides a
way to indicate "optional" information. This data is enclosed in round
brackets --- "(" and ")" --- in a definition.

Suppose, for example, that you wish to define a telephone number that
can either be a local number or a long distance number (for use in
North America, so that country codes aren't needed). Here is a set of
three definitions that could be used to define "telephone number".

 area code = integer
 * should be exactly three digits long; first and third    *
 * digit should be between 2 and 9 (inclusive), and second *
 * digit should be either 0 or 1                           *

 local telephone number = integer
 * should be exactly seven digits long; first and second   *
 * digits should be between 2 and 9 (inclusive)            *

 telephone number = ( area code ) + local telephone number


Consistency between Entity Relationship Diagram and Data Flow Diagrams

Both the ERD and the DFDs provide a way the represent the "stored
data" that the system must remember to function. The ERD doesn't
really show anything else (but it does show "logical relationships"
between different pieces of stored data clearly, and can be helpful
later on, when you want to design the data stores for the system). The
DFD(s) shows "stored data" using data stores.

If the ERD and DFDs are developed "completely independently" then
there's a danger that the two models of "stored data" they provide
will be contradictory.

In order to prevent this, Edward Yourdon (one of the DFD and
structured analysis "gurus") has proposed a simple "consistency rule"
for ERDs and DFDs, which can be found in his book "Modern Structured
Analysis":

 Yourdon's Consistency Rule:
 - Every data store in DFD or set of DFDs for a system must
   correspond to exactly one entity (but not supertype), relationship,
   associative object, or weak entity in the ERD, and each entity
   (but not supertype), relationship, associative object, and weak
   entity in the ERD should correspond to exactly one data store
   in the DFD(s)

If you follow this rule then developing data dictionary definitions
for data stores is trivial, provided that you've already developed a
data dictionary for the ERD: Simply use the "sequence" notation, and
then refer to the definition for the entity (or relationship, etc.), as
was done to define the data store "Students" from the object "Student"
in the above data dictionary.

There's a simple "naming convention" that you can use (if you wish to)
in order to keep track of the parts of the ERD and DFDs:

 - On the ERDs, use a singular noun or noun phrase, with first letter
   capitalized, to name each entity, associative object, or weak
   entity

 - Use the plural form of the same name (again, with the first letter
   capitalized) as the name of the corresponding data store in the
   DFDs

 - On the ERDs, use a transitive verb or verb phase, in lower case,
   to name a relationship

 - Choose a corresponding noun, and use the plural form of this
   noun with first letter capitalized, as the name of the
   corresponding data store on the DFDs

   For example: If you have a relationship called "is registered in"
   on your ERD, you might call the corresponding data store
   "Registrations"

 If you do all this then every data store in the DFDs will be named
 by a plural form of a noun or noun phrase with the first letter
 capitalized.

 Most (perhaps all) data flows to and from the data store will
 "represent" either a single instance of the corresponding entity
 (or relationship, etc.) --- or some sequence of zero or more
 instances.

 - If a data flow represents a single instance of whatever part
   of the ERD corresponds to the data store, take the noun or
   noun phrase you used as the name for the data store. Make it
   singular, instead or plural, and begin it with a lower case
   letter (instead of the corresponding capital letter) --- and
   use that as the name for the data flow.

 - If, instead, the data flow corresponds to a sequence of zero
   or more of these instances, then use the plural form of the
   name, again with the first letter in lower case.

 - Try to avoid using the above names for anything else.

This naming convention was used in the above example.


Exceptions to Yourdon's Consistency Rule:

 - Data stores that correspond to registers instead of tables of data:
   The entity relationship diagram models stored data that can be
   represented as a set of data tables; the number of instances of
   each entity, relationship, etc., in the entity relationship diagram
   is supposed to be "unbounded" --- or, at least, it should be
   possible for the number of instances to be quite large.

   However, it may be necessary for asynchronous processes in a system
   to share access to one (or a very small, fixed number of) piece(s)
   of information. It wouldn't be appropriate to model this in the
   ERD, and you'd probably use a "register" rather than a "data base" to
   maintain this information as part of an implemented system. It
   certainly *should* be shown on a data flow diagram --- as a "data
   store" that won't correspond to the same system's entity
   relationship diagram.

   Please *do* include data stores on DFDs that don't correspond to
   anything on ERDs, when it's necessary to do so in order to model
   this kind of information.

 - When entity relationship diagrams and data tables were covered, it
   was noted that you *don't* always need to have a separate data
   table for every entity, relationship, associative object, etc.,
   in the ERD. (This wasn't discussed in the lectures, but was covered
   in one of the handouts distributed when ERDs were being described.)
   In particular, an entity *and* a relationship, such that
   each instance of the entity participates in *exactly one* instance
   of the relationship, can both be "implemented" using a single
   table. It's tempting to represent them using a single data store
   on the DFDs as well, and this would produce slightly less
   "cluttered" DFDs and process specifications.

   On the other hand, it would also (slightly) complicate the process
   of developing the definition of the data store. A definition of
   the form

             <data store> = { <object> + <relationship> }

   wouldn't be quite right, because if suggests that each "instance"
   of the data store should contain two copies of the object's
   primary key (the "copy" that is part of the instance of the
   object, and the other "copy" that is part of the instance of
   the relationship). There is no "standard" (or, even "proposed")
   data dictionary syntax that deals with this problem.

   You could deal with this by explicitly naming all the attributes
   that are to be included, but this would (slightly) complicate the
   job of keeping the data dictionary consistent and up-to-date,
   because it would give you one more definition to search for and
   change every time the definition of the entity (or relationship)
   for the ERD was modified.

   Since the data flow diagrams we'll be dealing with are small,
   and there's no "standard" way to deal with the resulting
   minor problems that arise when data stores represent more
   than one piece of an ERD, I'd prefer that you follow Yourdon's
   rule, and *not* use one data store to represent more than one piece
   of the ERD, when working on assignments for CPSC 333.


Leveled Sets of Data Flow Diagrams

In order to make them easy to understand, process specifications
should be quite simple --- if you need more than two or three pages in
order to write a single process specification down, then you should
consider replacing the process that this corresponds to with several
simpler processes (that communicate with each other in order to do a
job), and then produce process specifications for each of the
resulting simpler processes instead.

If you start with any nontrivial system and follow this rule, you will
find that the number of processes you need to include will become
quite large.

However, it's also true that any single data flow diagram should be
easy to understand quickly. It's advisable not to include more than
(approximately) *seven* processes and data stores on any single
diagram. Otherwise it will be difficult to read the diagram, and even
harder to draw it neatly (and update it as system requirements
change).

Therefore, it's necessary to use more than one diagram in order to
represent any nontrivial system, if both these rules are to be
followed.

We will use a hierarchical, "tree" structure to organize the set of
data flow diagrams that represent system requirements.

 - A *context diagram* is at the root of the tree. This diagram
   represents the entire system as a single process. It also shows
   the "terminators" that represent people and other systems that
   "our" system must communicate with, and the data flows that pass
   between the system and these terminators (in each direction).

 - Each "node" in the tree is a data flow diagram. Each "non-root"
   node is a data flow diagram that "expands" or "refines" *exactly
   one* of the processes (bubbles) that are in data flow diagram at
   node's parent.

 - Every process in every data flow diagram in the tree is "refined"
   or "specified" by either
    - exactly one lower level data flow diagram (at a node that is one
      of the children of the node containing this process); or,
    - exactly one process specification
   but never by *both* a lower level DFD *and* a process specification


Numbering Scheme:

The context diagram is the only "level 0" diagram in the set. It has
a single process that doesn't really need to be numbered. You can
give this process the number 0, or you can treat it as the *one*
process in the entire set of DFDs that isn't required to have a
process number (either is acceptable).

According to the above rules, the root node (containing the context
diagram) will have exactly one child, since it included only one
process, and that process is almost always too complicated to be
described by a three-page process specification.

This child is the only "level 1" diagram in the system. If there are n
processes in this diagram (for some positive integer) then the process
numbers used in this diagram should be 1, 2, ..., n. Of course, the
process numbers should be distinct, so all these numbers will be used.

Now,

 - Each "child" node/DFD of any "level k" diagram will be considered
   to be a "level k+1" diagram

 - Suppose a diagram refines process number "x" in a higher level
   diagram (and that higher level diagram isn't the context diagram,
   since we've already covered this case). If the lower level diagram
   includes m processes, for m greater than or equal to 1 (and, for
   every case I can think of, for m greater than or equal to 2), then
   these processes should be given the process "numbers"
   x.1, x.2, ..., x.m.

   So, a "process number" is actually a string of positive integers
   separated by periods. If k is bigger than or equal to one then
   the processes in a level k diagram will have numbers consisting
   of k integers (separated by k-1 periods) in the string.

   For example, if there is a "process 3" in the level 1 diagram,
   and it's complicated enough to be refined by a data flow diagram
   instead of a process specification, then the processes in this
   lower level ("refining") data flow diagram will have numbers
   3.1, 3.2, 3.3, 3.4, etc. If process 3.2 is refined by a data flow
   diagram then the processes in *that* diagram will have numbers
   3.2.1, 3.2.2, 3.2.3, 3.2.4, ... and so on.

It's almost always necessary to give a "linear" order to the diagrams
as well, because it's necessary to list the diagrams, one after
another in some order, in a requirements specification. Therefore, we
will also assign "Figure numbers" to the data flow diagrams in a
leveled set (or "tree"). If there are m data flow diagrams in the set,
then they will be given figure numbers 1, 2, ..., m.

I recommend the following scheme for assigning figure numbers:

 - The context diagram is "Figure 1"
 - Its (only) child is "Figure 2"
 - For k bigger than or equal to 0, include (by giving low figure
   numbers to) all the level k diagrams, before including (by
   numbering) any of the level k+1 diagrams
 - For data flow diagrams at the same level use a dictionary-
   like ordering in order to choose figure numbers. In particular,
   if one diagram refines process  number a1.a2. ... .ak (for
   integers a1, a2, down to ak) and the other refines process number
   b1.b2. ... .bk, then

    - give the diagram for process number a1.a2. ... .ak the first
      (lower) figure number if a1 is strictly less than b1

    - give the diagram for process number b1.b2.  ... .bk the first
      (lower) figure number if b1 is strictly less than a1

     - otherwise --- when a1 equals b1 --- apply this method
       "recursively," considering the numbers a2 ... ak and
       b2 ... bk, in order to decide which of the two diagrams
       gets the lower figure number

Note that, if you *did* arrange the diagrams in a tree with the
context diagram at the root, then this would be the same as starting
at the top (with the diagram) and then moving down, moving all the
figures at one level before moving down to the next, and numbering all
the diagrams in the same level in left-to-right order as you see them
in the tree --- a "breadth first traversal" of the tree.


Conservation of Flow --- This is *much* more important than the
numbering scheme, and in CPSC 333, no exceptions to the following rule
will be allowed.

For all integers k, the set of input data flows and output data flows
going into and out of a process X in a level k diagram must be *exactly*
the same as the inputs coming into, and outputs going out of, the
processes in the lower level diagram that refines X.

That is: 

 - for every data flow into process X, there should be one or more
   data flows with the same name, coming from "the outside world" (or
   the rest of the system) into one or more of the processes in the
   lower level diagram that refines X

 - if there are two (or more) input flows coming into process X that
   have the same name then they will all have different "sources"
   (since you shouldn't *have* two or more data flows, with the same
   name, from the same source and to the same destinations).

   In this case, each "data flow with name N from source S" into
   the process X should correspond to one or more data flows into
   the processes in the lower level diagram (that refines X), with
   the same name N *and* from the same source S. You shouldn't actually
   *draw* S on the lower level diagram, but it would be a good idea
   to write "from S" at the end of each arrow in the lower level
   diagram that corresponds to this data flow, so that it's clear
   precisely which incoming data flows in the lower level diagram
   correspond to which inputs coming into process X

 - the same rules apply to flows going out of X in the higher level
   diagram and going from processes in the lower level diagram to
   "the outside world" in the lower level diagram that refines X

   One amendment is necessary: You *might* have more than one data
   flow from process X that has the same name and goes to the same
   destination --- if that destination is a data store, and the
   process does more than one of the jobs of creating, modifying,
   and deleting instances of whatever the data store contains. In this
   case the two (or three) outgoing flows would all be "decorated"
   differently: You wouldn't have two (or more) of these for creation
   of the same data to the same store, etc. Treat each kind of data flow
   with the same name and to the same data store separately
   (just as you treated incoming data flows with the same name but
   to different destinations separately, as described above).

Then (if you had enough time, as well as paper or blackboard space!)
you could take the entire set of leveled data flow diagrams, start
with the context diagram, and repeatedly "cut and paste", replacing a
process with the lower level data flow diagram that refines it (and
matching up the data flows using the information described above),
until there were no more lower level data flow diagrams to include.
At that point you'd have one *huge* data flow diagram representing the
entire system, such that every process shown in the diagram had a
process specification corresponding to it.

Of course, the leveled set of (simple) data flow diagrams is *much*
easier to read (and maintain) than the single huge one would be.