CPSC 333 --- Lecture 8 --- Wednesday, January 24, 1996 Data Flow Diagrams and Structure Analysis References: Pressman's "Beginner's Guide," pp. 74--82 Pressman's "Practitioner's Guide," Chapter 7 Note: I won't be following either of these very closely. Recall that, when principles of requirements analysis were discussed, three "views" of requirements that could be analysed and modeled separately, were introduced: information structure, information content, and information flow. Entity relationship diagrams and data dictionaries could be used to model information structure and information content, respectively. Data flow diagrams are useful for modeling information flow --- they way data is transformed as it moves through a system. Supporting models that augment data flow diagrams: - Process Specifications - Data Dictionaries We'll start with "regular" data flow diagrams, which don't include "control information;" *augmented data flow diagrams,* which also represent control information, will be introduced later. Components of Data Flow Diagrams: 1) Processes - receive inputs and generate outputs - have no memory - represented using *circles* on data flow diagrams, which are labeled by a *process number* and a *process name* 2) Data Flows - data transferred between processes, stores, and terminators - represented using *arrows* on data flow diagrams, which are labeled by a *flow name* 3) Data Stores - *internal* data storage areas - represented using *rectangles with the right side missing,* labeled by a *store name* 4) Terminators - *external* people or systems that the system being modeled communicates with, by receiving input or returning output - represented using *rectangles,* labeled by a *terminator name* Valid and Invalid Data Flows: - Since data stores are "passive," it doesn't make sense to have a data flow directly from one data store to another --- neither could initiate the communication - Since data stores maintain data in a logical "internal" format, while terminators expect an "external" format, it also doesn't make sense to have a data flow directly between a data store and a terminator (in either direction) --- at the least, a "format conversion" process would have to come between - Since the only communication we should model is communication the system could be expected to know about (and react to), and since there's no way that the system could be expected to know about communication between two terminators, it doesn't make sense to include a data flow from one terminator to another (or to itself), either - Since "processes" have no memory, it *also* doesn't make sense to have data flows between two *asynchronous* processes --- that is, processes that aren't guaranteed to be active at the same time (perhaps because they respond to different external events) Therefore, it *only* makes sense to have data flows - between a process and a terminator (in either direction) - between a process and a data store (in either direction), or, - between two processes that *are* guaranteed to be running at the same time, or if the data flow from the "first" process "activates" the second process Data Flows that Modify Data Stores While this notation isn't completely "standard," it is useful to introduce notation for data flows that modify the contents of data stores. - If a new "record" is being added to a data store then this should be shown using an arrow with a cross hatch near the arrowhead, from the process making the addition to the data store to which the new record is being added. You can assume that this process will be "informed" of the problem if this is an attempt to add a new record, when another record with same primary key already exists: It isn't necessary to show a data flow from the data store back to the process in order to record that this error checking happens. - If an existing "record" is being modified then this should be shown using a "regular" arrow (that is, the same symbol as usual) from the process making the change to the data store that includes the record to be modified. To make the data flow diagram reasonably simple, it's acceptable if you label the arrow with the name of a complete record, even though you may only want to change the value of one (or a few) attributes, and not every non-key attribute that the record has. You may assume in this case that the process will be informed if this is an attempt to "modify" an instance, when there isn't already a record in the data store with the same primary key (so that the process is really trying to create a new record, rather than modifying an existing one). Again, you shouldn't include an arrow from the data store back to the process to record this error checking. - If an existing "record" is being deleted then this should be shown using an arrow with an X near the arrowhead, from the process deleting the record to the data store from which the record will be deleted. For this case (only) it's really only necessary for a process to supply the values for the attributes in the primary key for the record to be deleted. I don't mind if you label the arrow with the name of a complete record in spite of this. You may assume in this case that the process will be informed if this is an attempt to delete a record that doesn't already exist --- that is, if there is no record in the data store matching the given primary key. Once again, you shouldn't include an arrow from the data store back to the process to represent this error checking. Data Flows representing reads from Data Stores If you want to read a single record from a data store then it's (generally) necessary to supply the values for all the attributes in the store's primary key. A process might also want a set of (all) records in a data store satisfying some criterion --- perhaps, with some given value for some nonkey attribute, or perhaps something more complicated than that. It's only necessary, for these reads, to show an arrow from the data store back to the process (returning the record or records satisfying the "search" criteria). Label this with either the name of a single record (if a primary key was supplied, and a single record is expected), or the plural form of this, if more than one record might be returned by the data store. Why Model Communication with Data Stores this Way (leaving some possible data flows out)? ... The data flows that I've said can be left out are "understood" or "implicit." Details such as the error checking that the process expects the data store to perform, or the search criteria used by the process when looking up information in the data store, can be include in the process's "process specification." If the extra data flows are left out then the only data flows leading out of data stores represent attempts to read records, and the only data flows leading into data stores represent attempts to create, modify, or delete records --- and, then, *read only* data stores and *write only* data stores can be identified, just by glancing at the data flow diagram. This is useful, because "data stores" on DFDs should correspond to tables of data that can grow or shrink, and be "arbitrarily large," just as objects in ERDs do. Therefore (when "essential requirements" are being modeled) it wouldn't make sense to include read only stores. It also wouldn't make sense to include "write only" data stores, since these data stores are supposed to be completely *internal* to the system. If an internal data store is "write only" then, as far as the outside world is concerned, the data store could be removed completely without changing system performance --- so that there's no reason for a "write only" data store to exist. It's also true that the additional data flows for error checking, etc. (which shouldn't be included) would clutter a diagram that's likely to be complicated anyway --- and, since the error checking (etc.) is understood to happen, the data flows wouldn't be providing any new information. Finally, an Example: A plausible data flow diagram for version 1 of the "Student Information System" - will be drawn on the blackboard during the lecture - will be made available as a Postscript file. You should e able to preview this using "ghostview," but it will probably be slow! Comments on the Example: - It omits some data flows that might be sensibly included: - additional feedback to the "Instructor" to report successful completion of requested activities, or to explain why an action wasn't performed (by reporting any errors that were found) - data flow (labeled "student") from the store "Students" back to each of processes 2 and 3, to represent reading of information necessary to perform complete error checking. In particular, it is probably necessary (or, at least, desirable) for Process 2 to check that the record that's about to be modified (by changing status to "passed") is for a student who's currently "registered" --- in order to detect and report attempts to "pass" students who've passed the course already. For Process 3, it is probably undesirable to "withdraw" students who've already passed, so it would be necessary to read a record from the data base, to make sure that the "status" is "registered", in order to prevent this from happening. - There may also be additional operations that might plausibly be included. For example, - it might be desirable for there to be a way to give the system a part of a student's name, and have the system provide a list of the ID numbers and names for students who "match" this partial information. - it might be desirable for the instructor to be able to change the registration status of a student from "passed" back to "registered," in order to correct a previous mistake. - it might be desirable for there to be a way to delete "passed" students from the data base --- perhaps even to clear out the system data base completely. Note that the DFD doesn't include any "control" or "sequencing" information. You could "think of" several or even all of the processes being active at once; there is certainly no order in which the processes are required to do their work given by this diagram. Note, as well, that it isn't necessary for a "process" to receive all the inputs represented by incoming data flows, or produce all the outputs corresponding to outgoing flows, every time the process is active. All "possible" inputs and outputs should be shown. It's also possible that more than one piece of information might be received or sent along any of the "data flows" connected to a process. The *process specification* for a ("bottom level") process should be helpful when you want to figure out "how" the data flows connected to a process are used. Final note (for now): In CPSC 333, we will include a "process number" for each process shown on a data flow diagram --- even though Pressman doesn't, in the examples given in his books. The numbering scheme to be used when creating these numbers will be explained in the next lecture. Supporting Models: Process Specifications If you are representing a system using a *single* data flow diagram, then there should be a "process specification" for each one of the processes shown on the diagram. If you are representing a system using a "leveled set" of data flow diagrams then there should be a process specification for each "bottom level" process on the diagrams. Leveled sets of DFDs will be discussed during the next lecture. A process specification should include: - The process name and number, and this should match the name and number given for the process on the data flow diagrams(s) (so that it's clear which process this "process specification" corresponds to) - A list of all process inputs and outputs, preferably with their sources and destinations listed as well (to make the process specification as "self contained" as possible) - All conditions on inputs that are *assumed* to be true but that are not checked by the process. In a sense, we "don't care" what happens if the process is invoked and these conditions aren't met: the process could do "anything at all," and still satisfy the specifications for it. In practice, of course, the process *shouldn't* be called unless these conditions are satisfied! - All error conditions that this process is expected to check for and report. Unlike the conditions mentioned in the paragraph just above this one, the process *is* responsible for checking for, and handling, the error conditions documented here. - Some kind of rule that can be used to whether a given set of outputs would be correct for a given set of inputs. This *should* include a description of how the error conditions mentioned in the paragraph above this one will be handled by the process. This *can* be an algorithm for generation of the outputs --- but it doesn't need to be. For example, suppose we had a process called "Decrement," which took a single integer x as input, and produced a single integer y as output. The assertion x = y + 1 (which is either "true" or "false" once integer values have been chosen for x and y) gives a rule that will allow you to decide whether a value for "y" would be correct as an output, when the process received a given value for "x" as input. *Instead* of this, you could include an extremely simple algorithm: y := x - 1 *Avoid* using an algorithm (and give a rule instead) if such a rule is easy to describe, or *especially* if there are *several* different algorithms that software designers and implementers could chose in order to do the job later on, and you don't want to "bias" the choice of algorithm now by including one in the process specification. Sometimes, though, the easiest (and, perhaps, only straightforward) way to describe an acceptable output is to give an algorithm that computes it. Example: A Process Specification for Process 1 on the Example DFD: Process 1: Register Student Input: ID number, from "Instructor" name, from "Instructor" Output: student, a new record added to the data store "Students" Conditions assumed to be true: (none) Error Conditions: 1) Syntactically incorrect ID number of name (or both) 2) ID number is in use (that is, there is already an instance of a "student" in the data store "Student," that has the same ID number) Relationships between inputs and outputs: If none of the above error conditions occur: a) student.ID number is the same as the input ID number b) student.name is the same as the input name c) student.status is "registered" If error condition (1) occurs: if the ID number is syntactically incorrect, then Send an error message to the instructor, saying that the ID number is syntactically incorrect (and reminding the instructor of the correct syntax for the ID number) end if if the the name is syntactically incorrect, then Send an error message to the instructor, saying that the name is syntactically incorrect (and reminding the instructor of the correct syntax for the name) end if If error condition (2) occurs: Send an error message to the instructor, saying that it isn't possible to register this student because a student with the same ID number is already registered in or has already passed the course. Notes: - You can use the "record" or "structure" syntax shown above (to refer to the components of the output "student") to refer to the components of a "complex" data item, in general. - Instead of the above three "relationships," given for the case that errors haven't been detected, the process specification could have included the (pseudo-code) algorithm student.ID number := ID number student.name := name student.status := "registered" Add the new instance "student" to the data store "Students" This algorithm could be further expanded, to show when and how the "error conditions" are checked for and reported. - The error messages "alluded to" in the above specification should, at some point, be decided on --- and then documented in the requirements specification (*possibly* by listing them in the process specification) There *are* guidelines for the formation of error messages. Among other things, they should be informative, polite, and constructive. They should also be *consistent*: If the same error condition is checked in several processes then, ideally, *exactly* the same error message should be used in all cases (and not several slightly different wordings, with one used in each process). Error messages for different conditions should also use consistent wordings, reported using a consistent interface, and so on. A small amount of additional material about this is generally covered in CPSC 451; more than that is included in CPSC 481.