TREES ===== Many natural relationships are hierarchical in nature. Hierarchies can express one to many relationships instead of the one to one relationships possible with single list structures. There are many natural examples of hierachical relationships * parent to children - one parent can have one or more children * company to employees * UNIX file system. These kinds of relationships are often represented visually as tree structures. It is conventional to show the bottom of the tree at the top and the branches pointing downward but that is certainly not a requirement and in many cases it is easier to provide output of a tree so that the tree is shown with the root (bottom) at the left and the tree growing toward the right. Each item in the tree is called a node and the relationships between nodes are represented by arcs or edges connecting the nodes together. The nomenclature of trees is a hybrid of terms about a biological tree, the parent-child relationship and graphs ( a more general expression of relationships). The following processes define a tree: 1. Create - create the structure for an empty tree 2. Traverse - traverse a tree in pre-, in-, or post-order 3. Insert - insert a node or key into the tree 4. Delete - delete a node or key in a tree 5. Search - search for a value in a tree 6. Update - change the value of a node 7. Characteristic - determine the size, height, average path length 8. Relatives - determine the parent/sibling(s) of a current node 9. Empty - determine if any nodes exist in a tree or subtree 10. Clear - remove values from the nodes Root, parent, child, sibling, ancestor, descendent -------------------------------------------------- Because of the one to many relationships which form the basis of trees there will be a single node from which all others derive. This node is called the ROOT. Designating a specific node as the root results in an ORIENTED tree. In computing applications, all trees are usually considered as oriented trees, that is some node is designated as root and all relationships are oriented around this designation. The root is also the ANCESTOR of all other nodes and the PARENT of those nodes immediately below it, which consequently are called its CHILDREN and its DESCENDENTS. Descendents include all nodes below a particular parent. Children of the same parent are called SIBLINGS. Paths ----- Because of the parent-child relationship, edges between nodes are considered to be directional running from parent to child. From this, the idea of a path which is a sequence of edges which can be traversed always moving in the direction parent to child is derived. A given node is an ancestor to all nodes which can be reached on paths from that node and in the same way those nodes that can be reached are descendents. Path length, height, depth, level --------------------------------- The path length between two nodes is the number of edges that must be traversed to get from one to the other. It can be seen that there is only one path between any two nodes and so only one possible path length. The number of nodes on the path between the root of the tree and its most distant descendent is called the height of the tree. Another label applied to nodes is the level number. Sometimes level number is counted in terms of the number of nodes on a path from the root to the given node while other time the count is the number of edges, i.e. one less than the number of nodes. If nodes are counted then the maximum level number is the same as the height and the level of the root is 1. If, instead we count edges then the maximum level is one less than the height and the level of the root is zero. Generally the second definition will be used so that the height and level are different. Formal Definition of a Tree --------------------------- 1. A single node by itself is a tree. It is also the root of the tree. 2. If n is a node and T1,T2,...Tk are trees with roots n1,n2,..nk then we can make a new tree by making n the parent of n1,n2,..nk. In this tree, n is the root and T1,T2,...Tk are subtrees. Note that this is a formal definition of a tree -- any kind of tree. All trees are defined recursively. Binary tree, leaf, internal and external nodes ---------------------------------------------- A tree in which no node may have more than two children is called a BINARY TREE. A node which has no children is a LEAF since it is found on the outside of the tree. Nodes with children are INTERNAL NODES. In some applications we wish to consider that leaf nodes and internal nodes with only one child have imaginary children filling empty positions. These nodes are called EXTERNAL NODES and such a tree is an EXTENDED BINARY TREE. (For example, in the sorting film, the tree sort used external nodes to indicate that the value in the node has been promoted upward.) As well as the parent-child relationship, an ordering may exist between siblings, for example, the left child must be smaller than the right child, or the left child is the first sibling. If it does then the tree is said to be ORDERED. In the case of an ordered binary tree the two children are distinguished as the left subtree and the right subtree. P P / \\ / \\ C1 C2 C2 C1 The two trees shown above are IDENTICAL if the tree is unordered but are DISTINCT if the tree is ordered. For ordered binary trees the children must be drawn as shown to indicate which is the left child and which is the right, i.e. a node with a single child is always drawn to the left or to the right of its parent depending on whether it is larger or smaller. Binary Search Tree ------------------ A common form of the binary tree found in computer applications is the binary search tree. This is an ordered tree in which the following relationship holds: at any position in the tree all nodes less than the current node lie to the left and all nodes greater than the current node lie to the right. or left child < parent < right child It is clear that many arrangements of a given set of nodes could be made, all of which would satisfy the requirements for a binary search tree. Note that a binary tree does not have this relationship and there may be duplicate values. In a binary search tree, the relationship must hold for all parents and children and DUPLICATES ARE NOT ALLOWED. Minimum height Trees -------------------- As will be seen later, the performance of algorithms which operate on binary trees are generally dependent on two factors: the number of nodes in the tree and the height of the tree. For a fixed number of nodes it is normal to minimise the height of the tree. Due to the binary nature of the tree the minimum height for any given number of nodes can be predicted by examining trees in which all internal nodes have two children and all leaves are at the same level: level height # # wrt level Total T wrt Ht 0 1 * 1 2^0 1 2^1 - 1 1 2 * * 2 2^1 3 2^2 - 1 2 3 * * * * 4 2^2 7 2^3 - 1 Therefore, # nodes at a level = 2^l where l is the level maximum # of nodes = 2^h - 1 where h is the height minimum height of a tree = ceil (lg (n+1)), where ceil is the integer value rounded UP. Thus, if a tree has 7 nodes, then ceil(lg (7+1)) is 3. In this case all of the nodes which could exist are present and therefore the tree is as compact as it can be and is of minimum height. A tree which is minimum height and has the maximum number of nodes is called a FULL tree. If there were 8 nodes then height would be ceil(lg (8+1)). The log function produces a value greater than 3.0 . If a tree with 8 nodes were a full tree there would have to be more than 3 levels. Note, as well, that in the absence of any information about the shape of the tree, the height that is calculated assumes that the tree is of minimum height. Implementation of Binary Trees ------------------------------ The basic element of a binary tree is the node. This normally requires at least three fields; two to hold pointers to the two subtrees and one to hold the information represented by the node. Non-linked tree A non-linked implementation is possible since a simple relationship exits between the number of nodes at each level. Each successive level in the tree can have twice as many nodes as the level above. If the nodes of a binary tree were to be stored as elements in an array then the position of children of any given node could be found from the simple relation: for node at position P the left child is at 2P and the right child is at 2P +1 Similarly the parent of a given node could also be determined directly. This is not possible in linked implementations. An array implementation would be a very efficient implementation of a full or nearly full tree. As the number of nodes was further and further from the value required for a full tree, this implementation would suffer severely from space wastage. Reorganisation of the tree would also incur considerable data movement. There are occaisions when the array implementation is superior, for example in heaps which will be discussed later. Linked trees A more common representation for a tree is a linked structure which is a special case of the linked list. Just as with a linked list it is possible to represent a tree in an array using indices as pointers or by dynamic allocation of nodes and actual pointers. When dynamic allocation is available this is the natural choice since it avoids problems of wasted space and unnecessary data movement. Operations on the Tree Data Type -------------------------------- Building the Tree Algorithms to build a tree will depend upon the type of tree being constructed, i.e. on the nature of the parent child relationship and upon the possible ordering of siblings. The following algorithms will be considered for building a binary search tree. Algorithm 1 - Building a binary search tree from random input. The first node becomes the root and then new nodes are added as leaves. The search is linear down one path in the tree, moving left or right at each node depending on the relative magnitudes of the new key and the current node. For example, using the following data 4, 7, 5, 2, 1, 3, 6 the tree produced would look as follows: Array implementation ------------------------------------------------ | | | | | | | | | | | | | | 4 | 2 | 7 | 1 | 3 | 5 | | | | | | 6 | | | | | | | | | | | | | | ------------------------------------------------ Linked implementation 4 / \ / \ / \ 2 7 / \ / / \ / 1 3 5 \ \ 6 It can be seen that if the data presented to this algorithm is ordered, the result will be equivalent to a singly linked list. Since we do not generate all of the node positions in the linked implementation, the linked list will have n nodes using n nodes in ascending order. However, for the array implementation, an array of length 2^(n-1) would be required. Sometimes random data can have the same effect on the shape of the tree. This is called a DEGENERATE TREE. Performance of Algorithm Building the tree involves two steps -- searching for the position and then inserting the node. For random data, the search is likely to be lg n since the tree will likely be close to minimum height. Since n nodes must be inserted, the algorithm will be n lg n. Unfortunately, this is not the worst case. If the data to be inserted is in ascending order (or descending), then the worst case is n(n+1)/2 or n squared. This is due to the potential search of all n nodes for every insertion. Algorithm 2 Since it is possible to get a degenerate tree which is really a linear linked list, a different algorithm will be necessary to avoid this occurrence. This algorithm depends upon the inherent mathematical relationships between nodes in a full binary tree and their position in the tree. A full tree is one which contains the maximum possible number of nodes for its height. If the nodes in such a tree are numbered by their position, the numbers at any given level are such that the highest power of 2 which will divide the position number equally represents the height of the node above the leaves. Remember that there are no duplicates in a BST. If we consider building a BST from ordered data, this algorithm uses the position in the ordered list to determine the position in the BST produced. Thus in the tree below, the integers refer to the position in the ordered list and not the value of the data in the node. 8 divisor : 2^3 4 12 divisor : 2^2 2 6 10 14 divisor : 2^1 1 3 5 7 9 11 13 15 divisor : 2^0 The tree will be constructed left to right and it will be necessary to remember the last node currently entered on each level since this node could be incompletely linked into the tree. This can be achieved by a small array of pointers. A 20 element array is sufficient for a tree of 2^20 nodes. IT IS IMPORTANT TO REMEMBER THAT THIS ALGORITHM REQUIRES THE INPUT DATA TO BE SORTED IN ASCENDING ORDER. Main points of algorithm Adding nodes to tree *When a new node is to be entered its left child, if one exists, will already be in the tree and will be the last node entered at the previous level hence this link can be set immediately. * The right child of a new node will not yet be in the tree so its right pointer will be set to nil. * A new node may be the right child of a node in the tree at the next highest level. This is so if the right child of the last node at the next level is currently nil. Trace for 6 nodes The insertion is based on the position in the list and it is the position that is divided by 2^n to determine the level. In this way, there are no empty subtrees (except for the last node entered in some cases). The data is as follows: data: 4 5 7 9 14 17 position: 1 2 3 4 5 6 The pointer array points to the node position as shown in the diagram previously. The highest power of 2 that will divide evenly into the position is 2^0. Therefore the first node will go at the far left in the leaf (or 0) level. After the first node is placed in the tree, the situation is as follows: ------- | | ------- | | ------- | 1 | 4 ------- Since 2 is divisible by 2^1, the next node will be at level 1 above the leaves. Since there are no nodes currently in that level, the node below must be a left child ------- | | ------- | 2 | 5 ------- / | 1 | 4 ------- 3 is only divisible by 2^0, so it also goes at the leaf (0) level. We also check the node the next level up. If it has no right child, then this node must by a right child. ------- | | ------- | 2 | 5 ------- / \ | 3 | 4 7 ------- 4 is divisible by 2^2, so it goes in level 2. Since it is the first entry in that level, the last node entered in level 1 must be a left child. ------- | 4 | 9 ------- / | 2 | 5 ------- / \ | 3 | 4 7 ------- Node 5 is at the 0 level and since the last node entered at the next level has a right child there are no tree pointers to set. ------- | 4 | 9 ------- / | 2 | 5 ------- / \ | 5 | 4 7 14 ------- Node 6 goes in level 1 and since the last node entered has a right child, this one has a left child at position 5. Checking the node above we determine that node 4 needs a right child. ------- | 4 | 9 ------- / \ | 6 | 5 17 ------- / \ / | 5 | 4 7 14 ------- If another was entered, it would go at the 0 level and node 6 would get a right child. The basic insertion assumes that the tree will be full. If this is not the case then unconnected subtrees may exist due to the failure to make appropriate right children connections. Therefore, to ensure that the tree is complete, and before assigning a root, scan from the root (highest level) down for nil right children. When found, look below for any "lastnode" not in left subtree. If there are any they must be dangling subtrees. These must be connnected as right children to the appropriate upper node. This would have occurred if we had stopped with node 5. Node 5 would have to become the right child of node 4. Performance of Algorithm This is an O(n) algorithm and so it is relatively efficient in constructing the tree ( it is not possible to construct a tree in less than O(n) steps). Since the algorithm is based upon the structure of the full tree it will always form a minimum height tree but not necessarily of optimum shape. If another node were to be added to a full tree it would become the new root and the resulting tree would be very unbalanced since all nodes below the root would be in the left subtree of the root. This is not really as bad as it sounds since the insertion will have increased the path length of nodes by one. However, the average search will not be (lg (n+1)) - 1. It will be lg (n+1). Other methods for constructing a minimum height tree from ordered data will be shown later and be seen to be less efficient but capable of producing a closer to optimum shape.