TREES
=====
Many natural relationships are hierarchical in nature. Hierarchies can
express one to many relationships instead of the one to one relationships
possible with single list structures. There are many natural examples of
hierachical relationships
* parent to children - one parent can have one or more children
* company to employees
* UNIX file system.
These kinds of relationships are often represented visually as tree
structures. It is conventional to show the bottom of the tree at the top
and the branches pointing downward but that is certainly not a requirement
and in many cases it is easier to provide output of a tree so that the tree
is shown with the root (bottom) at the left and the tree growing toward the
right. Each item in the tree is called a node and the relationships
between nodes are represented by arcs or edges connecting the nodes
together. The nomenclature of trees is a hybrid of terms about a biological
tree, the parent-child relationship and graphs ( a more general expression
of relationships).
The following processes define a tree:
1. Create - create the structure for an empty tree
2. Traverse - traverse a tree in pre-, in-, or post-order
3. Insert - insert a node or key into the tree
4. Delete - delete a node or key in a tree
5. Search - search for a value in a tree
6. Update - change the value of a node
7. Characteristic - determine the size, height, average path length
8. Relatives - determine the parent/sibling(s) of a current node
9. Empty - determine if any nodes exist in a tree or subtree
10. Clear - remove values from the nodes
Root, parent, child, sibling, ancestor, descendent
--------------------------------------------------
Because of the one to many relationships which form the basis of trees there
will be a single node from which all others derive. This node is called the
ROOT. Designating a specific node as the root results in an ORIENTED tree.
In computing applications, all trees are usually considered as oriented
trees, that is some node is designated as root and all relationships are
oriented around this designation. The root is also the ANCESTOR of all
other nodes and the PARENT of those nodes immediately below it, which
consequently are called its CHILDREN and its DESCENDENTS. Descendents
include all nodes below a particular parent. Children of the same parent
are called SIBLINGS.
Paths
-----
Because of the parent-child relationship, edges between nodes are considered
to be directional running from parent to child. From this, the idea of a
path which is a sequence of edges which can be traversed always moving in
the direction parent to child is derived. A given node is an ancestor to
all nodes which can be reached on paths from that node and in the same way
those nodes that can be reached are descendents.
Path length, height, depth, level
---------------------------------
The path length between two nodes is the number of edges that must be
traversed to get from one to the other. It can be seen that there is only
one path between any two nodes and so only one possible path length. The
number of nodes on the path between the root of the tree and its most
distant descendent is called the height of the tree. Another label
applied to nodes is the level number. Sometimes level number is counted in
terms of the number of nodes on a path from the root to the given node while
other time the count is the number of edges, i.e. one less than the number
of nodes. If nodes are counted then the maximum level number is the same as
the height and the level of the root is 1. If, instead we count edges then
the maximum level is one less than the height and the level of the root is
zero. Generally the second definition will be used so that the height and
level are different.
Formal Definition of a Tree
---------------------------
1. A single node by itself is a tree. It is also the root of the tree.
2. If n is a node and T1,T2,...Tk are trees with roots n1,n2,..nk then we
can make a new tree by making n the parent of n1,n2,..nk. In this tree, n is
the root and T1,T2,...Tk are subtrees.
Note that this is a formal definition of a tree -- any kind of tree. All
trees are defined recursively.
Binary tree, leaf, internal and external nodes
----------------------------------------------
A tree in which no node may have more than two children is called a BINARY
TREE. A node which has no children is a LEAF since it is found on the
outside of the tree. Nodes with children are INTERNAL NODES. In some
applications we wish to consider that leaf nodes and internal nodes with
only one child have imaginary children filling empty positions. These nodes
are called EXTERNAL NODES and such a tree is an EXTENDED BINARY TREE.
(For example, in the sorting film, the tree sort used external nodes to
indicate that the value in the node has been promoted upward.)
As well as the parent-child relationship, an ordering may exist between
siblings, for example, the left child must be smaller than the right child,
or the left child is the first sibling. If it does then the tree is said to
be ORDERED. In the case of an ordered binary tree the two children are
distinguished as the left subtree and the right subtree.
P P
/ \\ / \\
C1 C2 C2 C1
The two trees shown above are IDENTICAL if the tree is unordered but are
DISTINCT if the tree is ordered. For ordered binary trees the children must
be drawn as shown to indicate which is the left child and which is the
right, i.e. a node with a single child is always drawn to the left or to
the right of its parent depending on whether it is larger or smaller.
Binary Search Tree
------------------
A common form of the binary tree found in computer applications is the
binary search tree. This is an ordered tree in which the following
relationship holds:
at any position in the tree all nodes less than the current node lie to
the left and all nodes greater than the current node lie to the right.
or
left child < parent < right child
It is clear that many arrangements of a given set of nodes could be made,
all of which would satisfy the requirements for a binary search tree.
Note that a binary tree does not have this relationship and there may be
duplicate values. In a binary search tree, the relationship must hold for all
parents and children and DUPLICATES ARE NOT ALLOWED.
Minimum height Trees
--------------------
As will be seen later, the performance of algorithms which operate on
binary trees are generally dependent on two factors: the number of nodes in
the tree and the height of the tree. For a fixed number of nodes it is
normal to minimise the height of the tree. Due to the binary nature of the
tree the minimum height for any given number of nodes can be predicted by
examining trees in which all internal nodes have two children and all leaves
are at the same level:
level height # # wrt level Total T wrt Ht
0 1 * 1 2^0 1 2^1 - 1
1 2 * * 2 2^1 3 2^2 - 1
2 3 * * * * 4 2^2 7 2^3 - 1
Therefore,
# nodes at a level = 2^l where l is the level
maximum # of nodes = 2^h - 1 where h is the height
minimum height of a tree = ceil (lg (n+1)), where ceil is the integer
value rounded UP.
Thus, if a tree has 7 nodes, then ceil(lg (7+1)) is 3. In this case all of
the nodes which could exist are present and therefore the tree is as compact
as it can be and is of minimum height. A tree which is minimum height and
has the maximum number of nodes is called a FULL tree. If there were 8
nodes then height would be ceil(lg (8+1)). The log function produces a
value greater than 3.0 . If a tree with 8 nodes were a full tree there
would have to be more than 3 levels. Note, as well, that in the absence of
any information about the shape of the tree, the height that is calculated
assumes that the tree is of minimum height.
Implementation of Binary Trees
------------------------------
The basic element of a binary tree is the node. This normally requires at
least three fields; two to hold pointers to the two subtrees and one to hold
the information represented by the node.
Non-linked tree
A non-linked implementation is possible since a simple relationship exits
between the number of nodes at each level. Each successive level in the
tree can have twice as many nodes as the level above. If the nodes of a
binary tree were to be stored as elements in an array then the position of
children of any given node could be found from the simple relation:
for node at position P the left child is at 2P and the right child is at 2P +1
Similarly the parent of a given node could also be determined directly.
This is not possible in linked implementations. An array implementation
would be a very efficient implementation of a full or nearly full tree. As
the number of nodes was further and further from the value required for a
full tree, this implementation would suffer severely from space
wastage. Reorganisation of the tree would also incur considerable data
movement. There are occaisions when the array implementation is superior,
for example in heaps which will be discussed later.
Linked trees
A more common representation for a tree is a linked structure which is a
special case of the linked list. Just as with a linked list it is possible
to represent a tree in an array using indices as pointers or by dynamic
allocation of nodes and actual pointers. When dynamic allocation is
available this is the natural choice since it avoids problems of wasted
space and unnecessary data movement.
Operations on the Tree Data Type
--------------------------------
Building the Tree
Algorithms to build a tree will depend upon the type of tree being
constructed, i.e. on the nature of the parent child relationship and upon
the possible ordering of siblings. The following algorithms will be
considered for building a binary search tree.
Algorithm 1 - Building a binary search tree from random input.
The first node becomes the root and then new nodes are added as leaves. The
search is linear down one path in the tree, moving left or right at each
node depending on the relative magnitudes of the new key and the current node.
For example, using the following data
4, 7, 5, 2, 1, 3, 6
the tree produced would look as follows:
Array implementation
------------------------------------------------
| | | | | | | | | | | | |
| 4 | 2 | 7 | 1 | 3 | 5 | | | | | | 6 |
| | | | | | | | | | | | |
------------------------------------------------
Linked implementation
4
/ \
/ \
/ \
2 7
/ \ /
/ \ /
1 3 5
\
\
6
It can be seen that if the data presented to this algorithm is ordered,
the result will be equivalent to a singly linked list. Since we do not
generate all of the node positions in the linked implementation, the linked
list will have n nodes using n nodes in ascending order. However, for the
array implementation, an array of length 2^(n-1) would be required.
Sometimes random data can have the same effect on the shape of the
tree. This is called a DEGENERATE TREE.
Performance of Algorithm
Building the tree involves two steps -- searching for the position and then
inserting the node. For random data, the search is likely to be lg n since
the tree will likely be close to minimum height. Since n nodes must be
inserted, the algorithm will be n lg n. Unfortunately, this is not the worst
case. If the data to be inserted is in ascending order (or descending), then
the worst case is n(n+1)/2 or n squared. This is due to the potential search
of all n nodes for every insertion.
Algorithm 2
Since it is possible to get a degenerate tree which is really a linear
linked list, a different algorithm will be necessary to avoid this
occurrence. This algorithm depends upon the inherent mathematical
relationships between nodes in a full binary tree and their position in the
tree. A full tree is one which contains the maximum possible number of
nodes for its height. If the nodes in such a tree are numbered by their
position, the numbers at any given level are such that the highest power of
2 which will divide the position number equally represents the height of the
node above the leaves. Remember that there are no duplicates in a BST. If
we consider building a BST from ordered data, this algorithm uses the
position in the ordered list to determine the position in the BST produced.
Thus in the tree below, the integers refer to the position in the ordered
list and not the value of the data in the node.
8 divisor : 2^3
4 12 divisor : 2^2
2 6 10 14 divisor : 2^1
1 3 5 7 9 11 13 15 divisor : 2^0
The tree will be constructed left to right and it will be necessary to
remember the last node currently entered on each level since this node
could be incompletely linked into the tree. This can be achieved by a small
array of pointers. A 20 element array is sufficient for a tree of 2^20
nodes.
IT IS IMPORTANT TO REMEMBER THAT THIS ALGORITHM REQUIRES THE INPUT DATA TO
BE SORTED IN ASCENDING ORDER.
Main points of algorithm
Adding nodes to tree
*When a new node is to be entered its left child, if one exists, will
already be in the tree and will be the last node entered at the previous
level hence this link can be set immediately.
* The right child of a new node will not yet be in the tree so its right
pointer will be set to nil.
* A new node may be the right child of a node in the tree at the next
highest level. This is so if the right child of the last node at the
next level is currently nil.
Trace for 6 nodes
The insertion is based on the position in the list and it is the position
that is divided by 2^n to determine the level. In this way, there are no
empty subtrees (except for the last node entered in some cases). The data
is as follows:
data: 4 5 7 9 14 17
position: 1 2 3 4 5 6
The pointer array points to the node position as shown in the diagram
previously.
The highest power of 2 that will divide evenly into the position is 2^0.
Therefore the first node will go at the far left in the leaf (or 0) level.
After the first node is placed in the tree, the situation is as follows:
-------
| |
-------
| |
-------
| 1 | 4
-------
Since 2 is divisible by 2^1, the next node will be at level 1 above the
leaves. Since there are no nodes currently in that level, the node below
must be a left child
-------
| |
-------
| 2 | 5
------- /
| 1 | 4
-------
3 is only divisible by 2^0, so it also goes at the leaf (0) level. We also
check the node the next level up. If it has no right child, then this node
must by a right child.
-------
| |
-------
| 2 | 5
------- / \
| 3 | 4 7
-------
4 is divisible by 2^2, so it goes in level 2. Since it is the first entry
in that level, the last node entered in level 1 must be a left child.
-------
| 4 | 9
------- /
| 2 | 5
------- / \
| 3 | 4 7
-------
Node 5 is at the 0 level and since the last node entered at the next level
has a right child there are no tree pointers to set.
-------
| 4 | 9
------- /
| 2 | 5
------- / \
| 5 | 4 7 14
-------
Node 6 goes in level 1 and since the last node entered has a right child,
this one has a left child at position 5. Checking the node above we
determine that node 4 needs a right child.
-------
| 4 | 9
------- / \
| 6 | 5 17
------- / \ /
| 5 | 4 7 14
-------
If another was entered, it would go at the 0 level and node 6 would get a
right child.
The basic insertion assumes that the tree will be full. If this is not the
case then unconnected subtrees may exist due to the failure to make
appropriate right children connections. Therefore, to ensure that the tree
is complete, and before assigning a root, scan from the root (highest level)
down for nil right children. When found, look below for any "lastnode" not
in left subtree. If there are any they must be dangling subtrees. These
must be connnected as right children to the appropriate upper node. This
would have occurred if we had stopped with node 5. Node 5 would have to
become the right child of node 4.
Performance of Algorithm
This is an O(n) algorithm and so it is relatively efficient in constructing
the tree ( it is not possible to construct a tree in less than O(n) steps).
Since the algorithm is based upon the structure of the full tree it will
always form a minimum height tree but not necessarily of optimum shape. If
another node were to be added to a full tree it would become the new root
and the resulting tree would be very unbalanced since all nodes below the
root would be in the left subtree of the root. This is not really as bad as
it sounds since the insertion will have increased the path length of nodes
by one. However, the average search will not be (lg (n+1)) - 1. It will be
lg (n+1). Other methods for constructing a minimum height tree from ordered
data will be shown later and be seen to be less efficient but capable of
producing a closer to optimum shape.