3         Feed-Forward Neural Networks

3.1      Basic Architecture

A layered feed-forward network consists of a certain number of layers, and each layer contains a certain number of units. There is an input layer, an output layer, and one or more hidden layers between the input and the output layer. Each unit receives its inputs directly from the previous layer (except for input units) and sends its output directly to units in the next layer (except for output units). Unlike the Recurrent network, which contains feedback information, there are no connections from any of the units to the inputs of the previous layers nor to other units in the same layer, nor to units more than one layer ahead. Every unit only acts as an input to the immediate next layer. Obviously, this class of networks is easier to analyze theoretically than other general topologies because their outputs can be represented with explicit functions of the inputs and the weights.   

Figure 31 Feed-forward neural network

An example of a layered network with one hidden layer is shown in Figure 3‑1 . In this network there are l inputs, m hidden units, and n output units. The output of the jth hidden unit is obtained by first forming a weighted linear combination of the l input values, then adding a bias,

                                                                                      (3.1)

where  is the weight from input i to hidden unit j in the first layer and  is the bias for hidden unit j. If we are considering the bias term as being weights from an extra input (3.1) can be rewritten to the form of,

                                                                                                  (3.2)

The activation of hidden unit j  then can be obtained by transforming the linear sum using an activation function

                                                                                                        (3.3)

The outputs of the network can be obtained by transforming the activation of the hidden units using a second layer of processing units. For each output unit k, first we get the linear combination of the output of the hidden units,

                                                                           (3.4)

Again we can absorb the bias and rewrite the above equation to,

                                                                                                 (3.5)

Then applying the activation function  to (3.5) we can get the kth output

                                                                                                     (3.6)    

Combining (3.2), (3.3), (3.5) and (3.6) we get the complete representation of the network as

                                                                          (3.7)

The network of Figure 3‑1 is a network with one hidden layer. We can extend it to have two or more hidden layers easily as long as we make the above transformation further.

One thing we need to note is that the input units are very special units. They are hypothetical units that produce outputs equal to their supposed inputs. No processing is done by these input units.

3.2      Representation Capability

The feed-forward networks provide a general framework for representing non-linear functional mapping between a set of input variables and a set of output variables. The representation capability of a network can be defined as the range of mappings it can implement when the weights are varied. Theories [5] [6] [7] show that:

1)      Single-layer networks are capable of representing only linearly separable functions or linearly separable decision domains.

2)      Two hidden layered networks can represent an arbitrary decision boundary to arbitrary accuracy with threshold activation functions and could approximate any smooth mapping to any accuracy with sigmoid activation functions.

3)      One hidden layered network can approximate arbitrarily well any functional continuous mapping from one finite-dimensional space to another, provided that the number of hidden units is sufficiently large. To be more precise, feed-forward networks with a single hidden layer and trained by least-squares are statistically consistent estimators of arbitrary square-integral regression functions if assumptions about samples, target noises, number of hidden units, and other factors are all met. Feed-forward networks with a single hidden layer using threshold or sigmoid activation functions are universally consistent estimators of binary classifications under similar assumptions.

3.3      Network Structure Design

Though theoretically there exists a network that can simulate a problem to any accuracy, there is no easy way to find it. To define an exact network architecture such as how many hidden layers should be used, how many units should there be within a hidden layer for a certain problem is always a painful job. Here we will give a brief discussion about the issues to be considered when we design a network.

3.3.1      Number of Hidden Layers

Because networks with two hidden layers can represent functions with any kind of shapes, there is no theoretical reason to use networks with more than two hidden layers. It has also been determined that for the vast majority of practical problems, there is no reason to use more than one hidden layer. Problems that require two hidden layers are only rarely encountered in practice. Even for problems requiring more than one hidden layer theoretically, most of the time, using one hidden layer performs much better than using two hidden layers in practice [1].  Training often slows dramatically when more hidden layers are used. There are several reasons why we should use as few layers as possible in practice:

1)      Most training algorithms for feed-forward network are gradient-based. The additional layer through which errors must be back propagated makes the gradient very unstable.  The success of any gradient-directed optimization algorithm is dependent on the degree to which the gradient remains unchanged as the parameters vary.

2)      The number of local minima increases dramatically with more hidden layers. Most of the gradient-based optimization algorithms can only find local minima, thus they miss the global minima.  Even though the training algorithm can find the global minima, there is a higher probability that after many time-consuming iterations, we will find ourselves stuck in a local minimum and have to escape or start over.

Of course, it is possible that for a certain problem, using more hidden layers of just a few units is better than using fewer hidden layers requiring too many units, especially for networks that need to learn a function with discontinuities. In general, it is strongly recommended that one hidden layer be the first choice for any practical feed-forward network design. If using a single hidden layer with a large number of hidden units does not perform well, then it may be worth trying a second hidden layer with fewer processing units.         

3.3.2      Number of Hidden Units

Another important issue in designing a network is how many units to place in each layer. Using too few units can fail to detect the signals fully in a complicated data set, leading to underfitting. Using too many units will increase the training time, perhaps so much that it becomes impossible to train it adequately in a reasonable period of time. A large number of hidden units might cause overfitting, in which case the network has so much information processing capacity, that the limited amount of information contained in the training set is not enough to train the network. 

The best number of hidden units depends on many factors – the numbers of input and output units, the number of training cases, the amount of noise in the targets, the complexity of the error function, the network architecture, and the training algorithm [5].

There are lots of  “rules of thumb” for selecting the number of units in the hidden layers [1] [5]:

·         - between the input layer size and output layer size

·         -  two third of the input layer size plus the output layer size

·        - less than twice  the input layer size

·        - squared input layer size multiplied by output layer size

Those rules can only be taken as a rough reference when selecting a hidden layer size. They do not reflect the facts well because they only consider the factor of the input layer size and output layer size but ignore other important factors that we previously mentioned.

In most situations, there is no easy way to determine the optimal number of hidden units without training using different numbers of hidden units and estimating the generalization error of each.  The best approach to find the optimal number of hidden units is trial and error. In practice, we can use either the forward selection or backward selection to determine the hidden layer size. Forward selection starts with choosing an appropriate criterion for evaluating the performance of the network. Then we select a small number of hidden units, like two if it is difficult to guess how small it is; train and test the network; record its performance. Next we slightly increase the number of hidden units; train and test until the error is acceptably small, or no significant improvement is noted, whichever comes first. Backward selection, which is in contrast with forward selection, starts with a large number of hidden units, and then decreases the number gradually [1][8]. This process is time-consuming, but it works well.        

3.4      Back-Propagation

Back-propagation is the most commonly used method for training multi-layer feed-forward networks. It can be applied to any feed-forward network with differentiable activation functions. This technique was popularized by Rumelhart, Hinton and Williams [9].

For most networks, the learning process is based on a suitable error function (Section 2.6), which is then minimized with respect to the weights and bias. If a network has differential activation functions, then the activations of the output units become differentiable functions of input variables, the weights and bias. If we also define a differentiable error function of the network outputs such as the sum-of-square error function, then the error function itself is a differentiable function of the weights. Therefore, we can evaluate the derivative of the error with respect to weights, and these derivatives can then be used to find the weights that minimize the error function, by either using the popular gradient descent or other optimization methods.  The algorithm for evaluating the derivative of the error function is known as back-propagation, because it propagates the errors backward through the network. 

3.4.1      Error Function Derivative Calculation

We considers a general feed-forward network with arbitrary differentiable non-linear activation functions and a differential error function.  

From Section 3.1, we know that each unit j is obtained by first forming a weighted sum of its inputs of the form,

                                                                                                     (3.8)    

 where zi is the activation of an unit, or input. We then apply the activation function

                                                                                                        (3.9)

Note that one or more of the variables zj in (3.8) could be an input, in which case we will denote it by xi. Similarly, the unit j in (3.9) could be an output unit, which we will denote by yk.

The error function will be written as a sum, over all patterns in the training set, of an error defined for each pattern separately,

                                                                                    (3.10)

where p indexes the patterns, Y is the vector of outputs, and W is the vector of all weights. Ep can be expressed as a differentiable function of the output variable yk.

The goal is to find a way to evaluate the derivatives of the error functions E with respect to the weights and bias. Using (3.10) we can express these derivatives as sums over the training set patterns of the derivatives for each pattern separately. Now we will consider one pattern at a time.

For each pattern, with all the inputs, we can get the activations of all hidden and output units in the network by successive application of (3.8) and (3.9). This process is called forward propagation or forward pass. Once we have the activations of all the outputs, together with the target values, we can get the full expression of the error function Ep.

Now consider the evaluation of the derivative of Ep with respect to some weight wji . Applying the chain rule can get the partial derivatives

                                                                                                (3.11)

where we define

                                                                                                          (3.12)

From equation (3.11) it is easy to see that the derivative can be obtained by multiplying the value of δ for the unit at the output end of the weight by the value of z for the unit at the input end.   Thus the task becomes to find the δj for each hidden and output unit in the network.

       For the output unit, δk is very straightforward,

                                                                                     (3.13)

For a hidden unit, δk is obtained indirectly. Hidden units can influence the error only through their effects on the unit k to which they send output connections so

                                                                                                (3.14)

The first factor is just the δk of unit k so

                                                                                        (3.15)

For the second factor we know that if unit j connects directly to unit k then , otherwise it is zero.  So we can get the following back-propagation formula,

                                                                                           (3.16)

which means that the values of δ for a particular hidden unit can be obtained by propagating the δ’s backwards from units later in the network, as shown in Figure 3‑2.  Recursively applying the equation gets the δ’s for all of the hidden units in a feed-forward network, no matter how many layers it has.

Figure 32 Backward propagation

3.4.2      Weight Adjustment with the Gradient Descent Method

Once we get the derivatives of the error function with respect to weights, we can use them to update the weights so as to decrease the error. There are many varieties of gradient-based optimization algorithms based on these derivatives.  One of the simplest of such algorithms is called gradient descent or steepest descent. With this algorithm, the weights are updated in the direction in which the error E decreases most rapidly, i.e., along negative gradient. The weight updating process begins with an initial guess for weights (which might be chosen randomly), and then generates a sequence of weights using the following formula,

                                                                                            (3.17)

where η is a small positive number called the learning rate, which is the step size we need to take for the next step.  Gradient descent only tells us the direction we will move to, but the step size or learning rate needs to be decided as well. Too low a learning rate makes the network learn very slowly, while too high a learning rate will lead to oscillation. One way to avoid oscillation for large η is to make the weight change dependent on the past weight change by adding a momentum term,

                                                                                         (3.18)

That is, the weight change is a combination of a step down the negative gradient, plus a fraction α of the previous weight change, where  and typically [6].

The role of the learning rate and the momentum term are shown in Figure 3‑3 [3]. When no momentum term is used, it typically takes a long time before the minimum is reached with a low learning rate (a), whereas for large learning rates the minimum may be never reached because of oscillation (b). When adding a momentum term, the minimum will be reached faster (c).

Text Box: a

Figure 33  The descent vs. learning rate and momentum

There are two basic weight-update variations: batch learning and incremental learning. With batch learning, the weights are updated over all the training data. It repeats the following loop: a) Process all the training data; b) Update the weights. Each such loop through the training set is called an epoch. While for incremental learning, the weights are updated for each sample separately. It repeats the following loop: a) Process one sample from the training data; b) Update the weights.

3.5      Other Optimization Algorithms

Though the gradient descent optimization method used in the standard back-propagation learning algorithm is widely used and proven very successful in many applications, it does suffer two problems:

1)      The convergence tends to be extremely slow

2)      Convergence to the global minimum is not guaranteed

Many researchers [6][7][10][11][12] have devised improvements to the standard gradient descent method such as dynamically modifying learning parameters or adjusting the steepness of the sigmoid function.

In appropriate circumstances, other optimization methods may be better than the gradient descent.  Many converge much faster than gradient descent in certain situations while others promise a higher probability of convergence to global minima [6].

One of the most often recommended optimization methods to replace the gradient descent is conjugate gradient descent [1][6][7], which is a direction set minimization method. Minimization along a direction d brings the function E to a place where its gradient is perpendicular to d.  Instead of following the gradient at every step, a set of n directions is constructed which are all conjugate to each other, such that minimization along one of these directions does not spoil the minimization along one of the earlier direction.

Gradient methods using second-derivatives (Hessian matrix), such as Newton's method, can be very efficient under certain conditions [6].  Where first-order methods use a local linear approximation of the error surface, second-order methods use a quadratic approximation. Because such methods use all the first and second order derivative information in exact form, local convergence properties are excellent. Unfortunately, they are often impractical because explicit calculations of the full Hessian matrix can be very expensive in large problems.

Some powerful, stochastic optimization methods such as simulated annealing [1][6] and genetic algorithms [1][6], which can overcome the local minima, have also been used successfully in a number of problems.



<< Previou Page   Index Page   Next Page >>

Copyright ©2000-2007 Zhanshou Yu. All Right Reserved.