4         Data Collection, Analysis and Processing

One of the most important components in the success of any neural network solution is the data. The quality, availability, reliability, repeatability, and relevance of the data used to develop and run the system is critical to its success. Even a primitive model can perform well if the input data has been processed in such a way that it clearly reveals the important information. On the other hand, even the best model cannot help us much if the necessary input information is presented in a complex and confusing way.

Figure 41 Data processing

Data processing starts from the data collection and analysis, followed by pre-processing and then feeds to the neural network. Finally, post-processing is needed to transform the outputs of the network to the required outputs (Figure 4‑1), if necessary. This chapter discusses some of the most important considerations involved in processing data for neural networks.

4.1      Types of Variables

Variables can be roughly divided into two categories based on their properties [1][7]:

1)      Categorical Variables

Categorical variables do not have a natural ordering – they do not have relationships like “greater than” or “less than”. Some of them come from some input values that do not have numerical values but have to transform to numerical values as input variables. For example, a variable called “color type”, which can take on the value “red”, ”green”, and “yellow” is a categorical variable. Sex is a categorical variable too. Numerical data can also be categorical. Zip codes and telephone area codes are classic examples.

Categorical variables can be presented to the networks with the 1-of-c encoding scheme, which has as many units as there are values that the variable can take on. Exactly one of the units will be turned on according to the value of the variable, and all the other units will be turned off.  In the above “color type” example, it requires three input variables, with the three colors represented by input values of (1,0,0), (0,1,0) and (0,0,1).

Another way to encode categorical variables is to represent all the possible values to one continuous input variable. For example, the “red”, ”green”, and “yellow” could be represented as 0.0, 0.5, and 1.0. The bad news for this method is that it imposes an artificial ordering on the data that does not exist. But for variables with a large number of categories, this can dramatically decrease the number of input units.

2)      Ordinal Variables

Ordinal Variables have a natural ordering. Such data can be simply transformed directly into corresponding values of a continuous variable, either with or without scaling.

4.2      Data Collection

The data collection plan typically consists of three tasks:

1)   Identifying the data requirement

The first thing to do when planning data collection is to decide what data we will need to solve the problem. In general, it will be necessary to obtain the assistance of some experts in the field. We need to know: a) What data are definitely relevant to the problem; b) What data may be relevant; c) What data are collateral. Both relevant and possibly relevant data should be considered as inputs to the application.

2)   Identifying data sources

The next step is to decide from where the data will be obtained. This will allow us to make realistic estimates of the difficulty and expense of obtaining it. If the application demands real time data, these estimates should include an allowance for converting analogue data to digital form.

In some cases, it may be desirable to obtain data from a simulation of the real situation. This could be the case if the application is intended to monitor conditions which have health, safety or significant cost implications. Care must be taken to ensure that the simulation is accurate and representative of the real case.

3)   Determining the data quantity

It is important to make a reasonable estimation of how much data we will need to develop the neural network properly. If too little data is collected, it may not reflect the full range of properties that the network should be learning, and this will limit its performance with unseen data. On the other hand, it is possible to introduce unnecessary expense by collecting too much data. In general, the quantity of data required is governed by the number of training cases that will be needed to ensure the network performs adequately. The intrinsic dimensionality of the data and the required resolution are the main factors determining the number of training cases and, therefore, the quantity of data required.

It is vital to assess correctly the quality of the data that will be presented to the neural network. Often, the data will be less than perfect, and if the network is to perform correctly then it needs to be trained with a greater quantity of data than would be the case if high quality data were available.

4.3      Preliminary Data Analysis

There are two basic techniques which can be used to help us understand the data:

1)   Statistical analysis

Neural networks can be regarded as extensions of standard statistical techniques, and so such tests can give us an idea of the performance the network is likely to achieve. In addition, analysis can give useful clues to the defining features - for example, if the data is divided into classes, a statistical test can determine the possibility of distinguishing between the different classes in raw data or pre-processed data.

2)   Data visualization

Plotting a graph of the data in a suitable format enables us to spot distinguishing features, such as kinks or peaks, which characterize the data. This will enable us to plan and, if practicable, test the pre-processing required to enhance those features.

Preliminary data analysis often combines both visualization and statistical tests in an iterative manner. Visualization gives an appraisal of the data, and ideas about the underlying patterns, while statistical analysis enables us to test those ideas.

4.4      Data Preparation

When the raw data has been collected, it may need converting into a more suitable format. At this stage, we should do the following:

1)   Data validity checks

Data validity checks will reveal any patently unacceptable data that, if retained, would produce poor results. A simple data range check is an example of validity checking. For example, if we have collected oven temperature data in degrees centigrade, we would expect values in the range 50 to 400. A value of, say, -10, or 900, is clearly wrong.

If there is a pattern to the distribution of faulty data (for example, if most was collected on a Monday morning), try and diagnose the cause. Depending on the nature of the fault, we may need to discard the data or make allowances for its shortcomings. If there exist undesirable deterministic components such as trends or seasonal variation, they should be removed first [13].

2)   Partitioning data

Partitioning is the process of dividing the data into validation sets, training sets, and test sets. By definition, validation sets are used to decide the architecture of the network; training sets are used to actually update the weights in a network; test sets are used to examine the final performance of the network. The primary concerns should be to ensure that: a) the training set contains enough data, and suitable data distribution to adequately demonstrate the properties we wish the network to learn; b) there is no unwarranted similarity between data in different data sets.

4.5      Data Pre-Processing

Theoretically, a neural network could be used to map the raw input data directly to required output data.  But in practice, it is nearly always beneficial, sometimes critical to apply pre-processing to the input data before they are fed to a network. There are many techniques and considerations relevant to data pre-processing. Pre-processing can vary from simple filtering (as in time-series data), to complex processes for extracting features from image data. Since the choice of pre-processing algorithms depends on the application and the nature of the data, the range of possibilities is vast. However, the aims of pre-processing algorithms are often very similar, namely [1][5][7]:

1) Transform the data into a form suited to the network inputs - this can often  simplify the processing that the network has to perform and lead to faster development times. Such transformations may include:

·        Apply a mathematical function ( logarithm or square) to an input;

·        Encode textual data from a database;

·        Scale data so that it has a zero mean and a standard deviation of one;

·        Take the Fourier transform of a time-series.

2) Select the most relevant data - This may include simple operations such as filtering or taking combinations of inputs to optimize the information content of the data. This is particularly important when the data is noisy or contains irrelevant information. Careful selection of relevant data will make networks easier to develop and improve their performance on noisy data.

3) Minimize the number of inputs to the network - Reducing the dimensionality of the input data and minimizing the number of inputs to the network can simplify the problem. In some situations - for example in image processing - it is simply impossible to apply all the inputs to the network. In an application to classify cell types from microscope images, each image may contain a quarter of a million pixels: clearly, it would not be feasible to use that many inputs. In this case, the pre-processing might compute some simple parameters such as area and length/height ratio, which would then be used as inputs to the network. This process is called feature extraction [7].

4.6      Data Post-Processing

Post-processing covers any process that is applied to the output of the network. As with pre-processing, it is entirely dependent on the application and may include detecting when a parameter exceeds an acceptable range, or using the output of a network as one input to a rule-based processor. Sometimes it is just the reverse process of data pre-processing.



<< Previou Page   Index Page   Next Page >>

Copyright ©2000-2007 Zhanshou Yu. All Right Reserved.