Svm With An Input Vector Of Size

Published Date: 02 Nov 2017

Contrary to the probabilistic classifier, both the ANN and the SVM did not show a decay of the performances when the size of the input vector increases (see Error: Reference source not found). More precisely the performances increase meaningfully from 1 to 10 input features and tend to flatten after the size of the input vector is greater than 10. On the other hand the probabilistic classifier performs better than the ANN and SVM when the size of the input vector is small.

The performances of the ANN depending on the number of neurons show a different behavior for the 3 different class partitions. While for the 2-class partitions the performances are nearly constant for a number of neurons greater than 3, the 3-class and the 4-class partitions undergo a greater improvement up to 10 neurons (see Error: Reference source not found, Error: Reference source not found, and Error: Reference source not found).

We tried different assignment for SVM parameters Îµ, Ïƒ and C in order to find the optimum configuration with the highest efficiency. As we can see from Error: Reference source not found, when we keep Îµ and C constant (Îµ=0.001 and C=1000 for example), the SVM performances, for a high number of input features, depend on Ïƒ and reach a maximum when Ïƒ=1, corresponding to an optimum trade-off between SVM generalization capability (large values of Ïƒ) and model accuracy with respect to the training data (small values of Ïƒ). The value of Ïƒ corresponding to this trade-off decreases to 0.1 for lower values of the input vector size, reflecting the fact that the generalization capability is less important when the training set is more representative. If we keep Ïƒ and C constant (Ïƒ =1 and C=1000, for example), the best performances are achieved when Îµ is close to 0 and the allowed training error is minimized (see Error: Reference source not found). From this observation, by abductive reasoning we could conclude that the input noise level is low [82]. In accordance with such a behavior the performance of the network improves when the parameter C increases from 0.01 to 1000 (see Error: Reference source not found). Since the results tend to flatten for values of C greater than 1000, the parameter C was set equal to 1000.

From the computational complexity point of view, the ANN and the SVM provide a more concise model of the density, and so reduce the computational overheads associated with the Parzen window, which requires the storage of all the training data [50].

6.5: Conclusion

In this Chapter we have shown as information-theoretic methods for feature selection can be efficiently used for the assessment of protein three-dimensional models.

This first application allowed us to test the effectiveness of the feature selection algorithm described in Chapter 4.

Since our approach does not rely with the method used to predict the protein structure, it can be used in all the main tertiary structure prediction algorithms. In particular it can be combined with a traditional consensus procedure for the meta-prediction of protein structure.

Future work will include the analysis of different methods to evaluate the quality of the models, including both alignment dependent and independent measures [51]. Furthermore we will make use of a larger database with a set of protein models representative of the Protein Data Bank (PDB) of known protein structures [52]. Finally the model assessment algorithm will be applied as final step of a protein three-dimensional-structure meta-predictor and the performances will be compared with those of a traditional consensus procedure.

Chapter 7: An Introduction to Time Series Analysis

A time series is a collection of observations made sequentially through time [53]. Examples occur in a variety of fields, ranging from economics to engineering, and methods of analysing time series constitute an important area of statistics. The meaning of modeling data via time series methods is the following:

Time series analysis accounts for the fact that data points taken over time may have an internal structure that should be accounted for [54].

A time series is said to be continuous when observations are made continuously through time. The definition of a discrete time series is an ordered sequence of values of a variable at equally spaced time intervals. The usage of time series models is twofold [54]:

Obtain an understanding of the underlying forces and structure that produced the observed data

Fit a model and proceed to forecasting, monitoring or even feedback and feedforward control.

Time Series Analysis is used for many applications such as:

Economic Forecasting

Weather Forecasting

Process and Quality Control

The term "univariate time series" refers to a time series that consists of single (scalar) observations recorded sequentially over equal time increments. Throughout this thesis we will deal with univariate series. Univariate time series can be investigated using a great variety of models. The most general Box-Jenkins model [55] includes difference operators, autoregressive terms, moving average terms, seasonal difference operators, seasonal autoregressive terms, and seasonal moving average terms. As with modeling in general, however, only necessary terms should be included in the model. There are three primary stages in building a univariate time series model.

Model Identification

Model Estimation

Model Validation

Model identification is the analytical process during which it is chosen the model that best fits the given time series and its parameters are optimized. The first step in developing a Box-Jenkins model is to determine if the series is stationary. During the identification and estimation phase it could be useful to make a spectral analysis of the time series in order to find out if the chosen sampling frequency is high enough to represent correctly the time series. Furthermore, it may be necessary to adjust the optimum sampling frequency during potential transients on the data.

Estimating the parameters for the Box-Jenkins models is a quite complicated non-linear estimation problem. The main approaches to fitting Box-Jenkins models are non-linear least squares and maximum likelihood estimation. Maximum likelihood estimation is generally the preferred technique.

Model diagnostics for Box-Jenkins models is similar to model validation for non-linear least squares fitting. That is, the error term is assumed to follow the assumptions for a stationary univariate process. The residuals should be white noise (or independent when their distributions are normal) drawings from a fixed distribution with a constant mean and variance. If the Box-Jenkins model is a good model for the data, the residuals should satisfy these assumptions. Many mathematical tools can be used to verify if the whiteness assumptions are satisfied, one of the most used is Andersonâ€™s whiteness test (see Chapter 8). If the whiteness assumptions are not satisfied, we need to fit a more appropriate model. That is, we go back to the model identification step and try to develop a better model. Hopefully the analysis of the residuals can provide some clues as to a more appropriate model.

In the following chapters we will use time series analysis for regression purposes. We will try to predict the future values of given time series. Prediction in the context of time series analysis usually means prediction of future values of a primary series of interest from past observations of the primary series itself (predictand) as well as, when feasible, from past observations of covariate series (predictors). An example is prediction in autoregressive moving average (ARMA) series where the nature of statistical dependence within and between series enables successful forecasting from past observations, perhaps after some differencing or some other linear filtering. Nonlinear alternatives include neural network prediction models where the input series and the primary output series are connected by a sequence of layers each containing several nodes. In many cases a time series may not follow a generalized linear model or any ARMA model, may be observed irregularly with gaps of missing data. In addition, the data may be governed by a skewed distribution far from Gaussian, which is often the case with economic and geophysical data. These problems, namely, very short time series, data gaps, irregularly observed time series, non-Gaussian data, and incorporation of covariate data can be alleviated surprisingly well by a Bayesian approach centered about the estimation of the predictive density: the conditional density of the value to be predicted given the observed data. This approach can be used also in the pre-processing phase as a method for selecting the relevant data (input features) for the forecasting goal (see Chapter 4).

In this thesis the time series are related to the measurements of physical quantities as for example meteorological data and air pollutants concentrations. The regression models adopted for such series are developed using machine learning techniques and tools as Artificial Neural Networks (ANN) and Support Vector Machines (SVMs). Machine learning methods give an implicit identification and estimation of the model underlying the time series analyzed and can be effectively trained using quadratic programming. In fact, as seen in Chapter 1, machine learning is concerned with the development of algorithms and techniques that allow computers to "learn" and attempts to eliminate the need for human intuition and effort in the analysis of the data. The next two Chapters are applications of the machine learning based data mining system depicted in Chapter 5 to time series analysis.

Chapter 8: Air pollution concentrations forecasting algorithms

The respect of the European laws concerning urban and suburban air pollution requires the analysis and implementation of automatic operating procedures in order to prevent the risk for the principal air pollutants to be above the alarm thresholds. The aim of the analysis is the medium-term forecasting of the air-pollutants mean and maximum values by means of meteorological actual and forecasted data. Critical air pollution events frequently occur where the geographical and meteorological conditions do not permit an easy circulation of air and a large part of the population moves frequently between distant places of a city. These events require drastic measures such as the closing of the schools and factories and the restriction of vehicular traffic. The forecasting of such phenomena with up to two days in advance would allow to take more efficient measures to safeguard citizens health.

In all the cases in which we can assume that the air pollutants emission and dispersion processes are stationary, it is possible to solve this problem by means of statistical learning algorithms that do not require the use of an explicit prediction model. The definition of a prognostic dispersion model is necessary when the stationarity conditions are not verified. It may happen for example when it is needed to forecast the air pollutants concentrations variation due to a large variation of the emission of a source or to the presence of a new source, or when it is needed to evaluate a prediction in an area where there are not measurement points.

The Artificial Neural Networks (ANN) and the Support Vector Machines (SVM) have been often used as a prognostic tool for air pollution [56] [57] [58]. Even if we refer to these approaches as black-box methods, in as much as they are not based on an explicit model, they have generalization capabilities that make possibile their application to not-stationary situations.

In particular, the combination of the predictions of a set of models to improve the final prediction represents an important research topic, known in the literature as stacking. A general formalism that describes such a technique can be found in [59]. This approach consists in iterating a procedure that combines measurements data and data which are obtained by means of prediction algorithms in order to use them all as the input to a new prediction algorithm. This technique was used in [60], where the prediction of the ozone maximum concentration with 24 hours in advance, for the urban area of Lyon (France), is implemented by using a set of not linear models identified by different SVMs. The choice of the proper model is based on the meteorological conditions (geopotential label) and the forecasting of ozone mean concentration for a specific day is carried out for each model taking as input variables the maximum ozone concentration and the maximum air temperature observed in the previous day together with the maximum forecasted air temperature for that day.

The first step of the implementation of a prognostic neural network or SVM is the selection of the best subset of features that are going to be used as the input to the forecasting tool.

The method we used for features selection is the algorithm with backward eliminations described in Chapter 4. The analysis described in the following was performed on the hourly data of the principal air pollutants (SO2,NO,NO2,NOx,CO,O3,PM10) and meteorological parameters (air temperature, relative humidity, wind velocity and direction, atmospheric pressure, solar radiation and rain) measured by a station located in the urban area of the city of Goteborg (Sweden) and relative to the time period 01/04Ã·10/05 [61]. A brief introduction to the problem of air pollutants and the reasons why such pollutants are dangerous for human health precede the description of the data analysis method.

8.1: Air pollutants

One of the biggest problems of urban areas is the air pollution. Air pollution arises from the adverse effects on the environment of a variety of substances contaminants emitted into the atmosphere by natural and man-made processes. Due to heavy vehicular traffic and to the presence of possible industrial areas, in the urban air of the city pollutants can often be found at a concentration higher than the alarm level fixed by law. The prediction of an episode of pollution is therefore of fundamental importance to safeguard the health of citizens. A list of the principal air pollutants and their main characteristics follows [62].

Sulphur Dioxide (SO2)

Sulphur dioxide is an acidic gas which combines with water vapour in the atmosphere to produce acid rain. Both wet and dry deposition have been implicated in the damage and destruction of vegetation and in the degradation of soils, building materials and watercourses. SO2 in ambient air can also affect human health, particularly in those suffering from asthma and chronic lung diseases.

The principal source of this gas is power stations burning fossil fuels which contain sulphur. Major SO2 problems now only tend to occur in cities in which coal is still widely used for domestic heating, in industry and in power stations. As many power stations are now located away from urban areas, SO2 emissions may affect air quality in both rural and urban areas. The last 40 years have seen a decline in coal burning (domestic, industrial and in power generation) As a result, ambient concentrations of this pollutant in the most developed countries have decreased steadily over this period. Even moderate concentrations may result in a fall in lung function in asthmatics. Tightness in the chest and coughing occur at high levels, and lung function of asthmatics may be impaired to the extent that medical help is required. Sulphur dioxide pollution is considered more harmful when particulate and other pollution concentrations are high.

Nitrogen Oxides (NO and NO2)

Nitrogen oxides are formed during high temperature combustion processes from the oxidation of nitrogen in the air or fuel. The principal source of nitrogen oxides - nitric oxide (NO) and nitrogen dioxide (NO2), collectively known as NOx - is road traffic, which is responsible for approximately half the emissions in Europe. NO and NO2 concentrations are therefore greatest in urban areas where traffic is heaviest. Other important sources are power stations, heating plants and industrial processes. Nitrogen dioxide can irritate the lungs and lower resistance to respiratory infections such as influenza. Continued or frequent exposure to concentrations that are typically much higher than those normally found in the ambient air may cause increased incidence of acute respiratory illness in children.

Carbon Monoxide (CO)

Carbon monoxide (CO) is a toxic gas which is emitted into the atmosphere as a result of combustion processes, and is also formed by the oxidation of hydrocarbons and other organic compounds. In European urban areas, CO is produced almost entirely (90%) from road traffic emissions. It survives in the atmosphere for a period of approximately one month but is eventually oxidised to carbon dioxide(CO2). This gas prevents the normal transport of oxygen by the blood. This can lead to a significant reduction in the supply of oxygen to the heart, particularly in people suffering from heart disease.

Ozone (O3)

Ground-level ozone (O3), unlike other pollutants mentioned , is not emitted directly into the atmosphere, but is a secondary pollutant produced by reaction between nitrogen dioxide (NO2), hydrocarbons and sunlight. Ozone levels are not as high in urban areas (where high levels of NO are emitted from vehicles) as in rural areas. Sunlight provides the energy to initiate ozone formation; consequently, high levels of ozone are generally observed during hot, still sunny, summertime weather. Ozone irritates the airways of the lungs, increasing the symptoms of those suffering from asthma and lung diseases.

Particulate (PM10)

Airborne particulate matter varies widely in its physical and chemical composition, source and particle size. PM10 particles (the fraction of particulates in air of very small size (<10 Âµm)) are of major current concern, as they are small enough to penetrate deep into the lungs and so potentially pose significant health risks. Larger particles meanwhile, are not readily inhaled, and are removed relatively efficiently from the air by sedimentation. The principal source of airborne PM10 matter in European cities is road traffic emissions, particularly from diesel vehicles. Fine particles can be carried deep into the lungs where they can cause inflammation and a worsening of the condition of people with heart and lung diseases. In addition, they may carry surface-absorbed carcinogenic compounds into the lungs.

In the following table the primary and secondary sources of air pollution are shown for each pollutant [63]. The limit values fixed by the law are reported in Appendix 1.

Air Pollutant

Primary Sources

Secondary Sources

8.2: Features Selection

The first step of the analysis was the selection of the most useful features for the prediction of each of the targets relative to the air-pollutants concentrations. For each air pollutant the target was chosen to be the mean value over 24 hours, measured every 4 hours (corresponding to 6 daily intervals a day). The complete set of features on which was made the selection, for each of the available parameters (air pollutants, air temperature, relative humidity, atmospheric pressure, solar radiation, rain, wind speed and direction), consisted of the maximum and minimum values and the daily averages of the previous three days to which the measurement hour and the reference to the week day were added. Thus the initial set of features, for each air-pollutant, included 130 features. From this analysis an apposite set of data was excluded; such set was used as the test set. In particular the method described in Chapter 4 was applied to the selection of the best subset of features useful for the prediction of the average daily concentration of PM10 in the city of Goteborg. In fact we observed, from the data, that this concentration was several times above the limit value for the safeguard of human health (50 Âµg/m3). PM10 levels were split in three classes. Such classes were selected so that they contained the same number of training samples. In fact, each class referred to approximately 33% of the training set samples. The Markov blanket size was set equal to 3. The best subset of 16 features turned out to be the following:

Average concentration of PM10 in the previous day.

Maximum hourly value of the ozone concentration one, two and three days in advance.

Maximum hourly value of the air temperature one, two and three days in advance.

Maximum hourly value of the solar radiation one, two and three days in advance.

Minimum hourly value of SO2 one and two days in advance.

Average concentration of the relative humidity in the previous day.

Maximum and minimum hourly value of the relative humidity in the previous day.

Average value of the air temperature three days in advance.

The results can be explained considering that PM10 is partly primary, directly put in the atmosphere, and partly secondary, that is, produced by chemical-physical transformations that involve different substances as SOx, NOx, COVs, NH3 and that determine its production and/or removal [64].

The available data did not allow the analysis of the other air pollutants because the limit values fixed by Italian law are not exceeded by the available Swedish measurement data. Anyway the feature selection algorithm gave other interesting results. The best subset of 8 features for some significant air pollutantsâ€™targets are reported below.

Maximum hourly concentrations of Nitrogen Dioxide (NO2):

Maximum hourly concentration of the NO2 in the previous day.

Average concentration of the ozone concentration in the previous day.

Maximum hourly value of the ozone concentration one, two, three days in advance.

Maximum hourly value of the air temperature in the previous day.

Minimum hourly value of the air temperature in the previous day.

Maximum hourly value of the wind direction in the previous day.

Maximum hourly concentrations of Sulphur Dioxide (SO2):

Minimum hourly concentration of SO2 in the previous day.

Maximum hourly value of the ozone concentration one, two, three days in advance.

Maximum hourly value of the air temperature one, two, three days in advance.

Minimum hourly value of the humidity in the previous day.

Average daily concentrations of Carbon Monoxide (CO):

Minimum hourly concentration of SO2 in the previous day.

Average concentration of CO in the previous day.

Maximum hourly value of the ozone concentration in the previous day.

Maximum hourly value of the air temperature one, two, three days in advance.

Minimum hourly value of the atmospheric pressure in the previous day.

Mean value of solar radiation in the previous day.

The following paragraphs are about the forecasting of the average daily concentration of PM10.

8.3: Forecasting when the concentrations are above the limit values for the protection of human health.

We used the machine learning tools illustrated in Chapter 5. The neural networks were trained on a representative subset of the data used for the features selection algorithm. We used a subset of the first two years of data: a measurement sample every three samples after leaving out one sample every five of the original data. In this way we reduced the computational time of the adopted machine-learning algorithms while obtaining a subset of data as representative as that used for the features selection. In fact such a subset includes a sufficient number of all the 6 daily intervals in which the measurement data were divided by our analysis. The test set consisted of the data not used for the features selection algorithm. Since the number of the training samples above the maximum threshold for the PM10 concentration was much lower than the number of samples under such threshold, the training of the networks was performed weighting more the kind of samples present a fewer number of times. As we can see from Error: Reference source not found and Error: Reference source not found the ANN performance, both for the samples under the threshold and for the samples above the threshold, increases when the number of input features increase. More precisely the performance increases meaningfully from 2 to 8 input features and tends to flatten when the size of the input vector is greater than 8. The best subset of 8 features is the following:

Average concentration of PM10 in the previous day.

Maximum hourly value of the ozone concentration one, two and three days in advance.

Maximum hourly value of the air temperature in the previous day.

Maximum hourly value of the solar radiation one, two and three days in advance.

Selecting as input to the ANN such set of 8 features, the best results can be obtained with a neural network having 18 neurons in the hidden layer. In table 2 are displayed the results obtained with 5115 samples of days under the threshold and 61 samples of days above the threshold. It can be noted that the probability to have a false alarm is really low (0.82%) while the capability to forecast when the concentrations are above the threshold is about 80%.

We tried different assignment for SVM parameters Îµ, Ïƒ and C, in order to find the optimum configuration with the highest performance. As we can see from Error: Reference source not found, when we keep Îµ and C constant (Îµ=0.001 and C=1000), the SVM performances referring to samples above the threshold, for a high number of input features, depend on Ïƒ and reach a maximum when Ïƒ=1, corresponding to an optimum trade-off between SVM generalization capability (large values of Ïƒ) and model accuracy with respect to the training data (small values of Ïƒ). The value of Ïƒ corresponding to this trade-off decreases to 0.1 for lower values of the input vector size (Error: Reference source not found) and for samples below the threshold (Error: Reference source not found), reflecting the fact that the generalization capability is less important when the training set is more representative. If we keep Ïƒ and C constant (Error: Reference source not found: Ïƒ=1 and C=1000; Error: Reference source not found: Ïƒ=0.1 and C=1000), the best performances are achieved when Îµ is close to 0 and the allowed training error is minimized. From this observation, by abductive reasoning we could conclude that the input noise level is low [73]. In accordance with such a behavior the performance of the network improves when the parameter C increases from 1 to 1000 (see Error: Reference source not found). Since the results tend to flatten for values of C greater than 1000, the parameter C was set equal to 1000. We can set Ïƒ equal to 0.1 in order to achieve a number of false alarms lower than in the case when Ïƒ is equal to 1. The best performance of the SVM corresponding to Îµ=0.001, Ïƒ =0.1 and C=1000 is achieved using as input features the best subset of 8 features previously defined. The probability to have a false alarm is really low (0.13%) while the capability to forecast when the concentrations are above the threshold is about 80%. On the other hand, when Ïƒ is set equal to 1 we can maximize the performance of the SVM when the samples are above the threshold. The best performance of the SVM corresponding to Îµ=0.001, Ïƒ =1 and C=1000 is achieved using as input features the best subset of 11 features. In this case the probability to have a false alarm is higher than in the previous one (0.96%) but the capability to forecast when the concentrations are above the threshold is nearly 90%.

In Error: Reference source not found it is shown a comparison of the performances of the SVM and the ANN (18 neurons in the hidden layer) as a function of the number of input features, for Ïƒ equal to 0.1 and 1 respectively (Îµ=0.001 and C=1000).

8.4: Future activities

The training of the ANN and SVM will be improved with stacking techniques using as inputs the measurements and forecasted values of the selected features. Since for some pollutants the meteorological conditions are very important in the generation process, different neural networks will be trained for each different geopotential condition [65]. The analysis will be completed extending the forecasting capability of the machine learning algorithm to areas where there are not measurement points, by means of the optimization of a multi-source gaussian dispersion model. It could be interesting to carry out the same kind of analysis described in this paper for PM10 also for the other air pollutants.Finally it would be useful to carry out the same kind of analysis described in this Chapter also for the city of Turin. To do so we may use the data from ARPA Piedmont.

Chapter 9: Weather Nowcasting Algorithms

Weather forecast systems are among the most heavy equation systems that a computer must solve [66]. A very large amount of data, coming from satellites, ground stations and sensors located around our planet give every day information that must be used to foresee the weather situation in next hours and days all around the world. Weather reports give forecast for next 24, 48 and 12 hours for wide/mesoscale areas [67]. Today they are quite reliable but it is not unusual that in a restricted region the conditions suddenly change. Snow, ice, fog are some very dangerous events that can occur on the roads giving important problems for road safety. A different strategy is here presented. "Nowcast" [68] means a forecast restricted to next three hours in a very limited area. This new approach where the data fusion is performed with soft-computing techniques is far less time consuming than the algorithms used in traditional wide areas weather forecasts. Nowcasting does not replace the traditional weather report. But the possibility to have an accurate estimate of the weather conditions in next three hours can be a useful tool in several situations, such as ice on the road, sms based weather info to mobile phone networks subscribers and so on. The work described in this chapter is strongly related to enhancements to the "NEMEFO system [83], developed at the Polytechnic of Turin, which is a nowcast system. It uses data sampled every 15 minutes by means of a meteo station and forecasts the evolution of these data in next three hours.

Meteorological stations measure data useful to evaluate the weather conditions of a specific area. The machine learning system we use to nowcast the future meteorological conditions must consider hundredths of thousands of data sampled and stored during last years. The reliability of the prediction is based on the highest quantity of data. The data mining system uses this "knowledge" to estimate the weather trend in next three hours. But it should be useless to train the system using the whole data set. This set must be analyzed to extract only the most significant information which can be used to foresee the temperature, the pressure, the humidity and so on. Our approach is based on the backward selection algorithm described in Chapter 4. Each target has been split in three classes. Such classes have been selected so that they contain the same number of training samples. In fact, each class refers to approximately 33.33% of the training set samples. The Markov blanket size has been set equal to 3. The data set on which the selection algorithm is performed is relative to the first half of the available data (measurements data from 01/01/05 to 30/06/06). In the following are described the results of the analysis for the most interesting and critical case: forecasting three hours in advance. For each available climatic variable (see Error: Reference source not found) we chose as input features the hourly values three, four, five hours and a day before the prediction time, the gradient and the second derivative of the variables three hours before the prediction time, and the sine and cosine of the prediction time. Thus, we had 50 features for each climatic variable to be predicted. The best subset of 16 features for the soil temperature and the relative humidity turned out to be:

Soil temperature:

Measured soil temperature one day before the prediction time;

Gradient and second derivative of the soil temperature three hours before the prediction time;

Measured air temperature three and five hours before the prediction time

Second derivative of the air temperature three hours before the prediction time;

Measured relative humidity three, four, five hours and one day before the prediction time;

Gradient of relative humidity three hours before the prediction time;

Gradient and second derivative of solar radiation three hours before the prediction time;

Sine and cosine of the prediction time;

Relative humidity:

Measured relative humidity three, five hours and one day before the prediction time;

Gradient of relative humidity three hours before the prediction time;

Measured air temperature three, four, five hours and one day before the prediction time;

Measured soil temperature three hours and one day before the prediction time

Gradient of soil temperature three hours before the prediction time;

Measured solar radiation three hours and one day before the prediction time;

Gradient and second derivative of solar radiation three hours before the prediction time;

Cosine of the prediction time;

9.1: A spectral analysis of meteorological data for weather forecast applications

It is very important to have significant data as an input to the data mining system. During the pre-processing phase for the meteo forecaster it could be necessary to perform a spectral analysis of the meteorological variables in order to adjust the sampling frequency. In fact like the great majority of natural signals climatic signals are not stationary and they experience transients that include a large range of frequencies in a short period of time. Fourier Transform is not sufficient to process this kind of signals because all the information on time location of a certain frequency is lost in the analytical process. The complexity of climate variability on all time scales requires the use of several refined tools to unravel its primary dynamics from observations. The analysis of the temporal series has shown that climatic variations are extremely irregular in the time-space domain. This feature makes them difficult to be foreseen if no particular mathematical tools are used. The aim of this Paragraph is to describe how to analyze and represent climatic signals in order to foresee their future short-time evolution for weather nowcasting. A spectral analysis of the meteorological data is described. More precisely the purpose of the spectral analysis is the determination of the optimum sampling frequency for a set of climatic not stationary signals and the description of an algorithm for the automatic adjustment of such a frequency.

A great variety of applications, regarding technological fields very different from each other, require the processing of signals that represent, in general, the trend of the physical quantities under analysis. In many cases such signals are electric signals: bioengineering, telecommunications, mechanics, the study of seismic phenomena are some samples of scientific fields in which the information to be processed (the heart beat, televisionâ€™s images, a mechanical vibration, a seismic movement) is first of all transduced (by means of the right transducers) into an electric signal and then processed in this form. With the spreading of computers and microprocessors the signals are often represented as sequences of numerical values that in the continuous or discrete time domain, can be analyzed with numerical techniques and mathematically modeled as random processes (see [69] and [70]).

Climatic signals in the meteorological field can be analyzed with the same techniques. These signals can be represented by representations with expansions on bases of functions possibly complete, not redundant and orthogonal or orthonormal. In other words we can represent a time-continuous signal (but the same procedure is easily extensible to discrete-time signals) by means of a sequence (possibly a finite one) of coefficients (see [69] and [71]). This representation may be exact or only approximated, and the corresponding basis shall be chosen by means of a trade-off between accuracy and simplicity. Given a finite energy, real signal x(t), , the Fourier Transform X(f) of the signal, can be interpreted as a particular kind of scalar product:

The functions are an infinite set of complex orthogonal functions, in fact it is easy to show that .

We assume to have a sampled signal x(nT0) where fs=1/T0 is the sampling frequency; we further assume that x(nT0) is meaningfully different from zero only in the interval .

So we can build a periodical version of the signal x(t):

where is the amplitude spectrum of and B= 1/T0 is another way to write the sampling frequency. And we have N=T/T0 samples on which we calculate the transform. The coefficients of the Discrete Fourier Transform (DFT) of the sequence , for , can be written in the following way:

The Fourier Transform can be adjusted to analyze only a little portion of the signal each time: this technique is called windowing of the signal (Short-Time Fourier Transform, STFT) [72]. The STFT represents the signal as a bidimensional function of time and frequency. It provides information concerning when and at which frequencies an event present on the signal analyzed happened. Anyway the accuracy of such information is bounded by the windowâ€™s dimension. The disadvantage of such a technique is that once you choose the width of the window this remains the same for all frequencies. This fact affects the accuracy with which we can identify in time and frequency possible transients present on the signal. If the sphere of sinusoidal signals is exited there are bases more suitable than Fourierâ€™s basis for the analysis of transients and not stationary signals. Such bases have to be local in time and frequency and simple to obtain. The wavelet analysis represents a windowing technique with variable dimensions. Such technique allows to use a long time interval if we want to get from the signal information on its contents at low frequencies, and short time intervals if we want to investigate on the contents of the signal at high frequencies. We can define the Continuous Wavelet Transform (CWT) of a signal x(t) as in the following:

The parameter a is called scaling factor and it is strictly greater than zero, b is the time translation. The prototype function is called mother wavelet, it has to have a zero mean value and a good localization in time; so it oscillates in a damping way and extinguishes as a little wave, this is why it is called wavelet. An acceptable wavelet shall have a zero mean value and pass-band characteristics. Many methods and results of meteorological data spectral analysis already exist in the literature ([73-77]), but their aim is the study of the trends of global/medium scale meteorological data in the medium and long term (on time scale) while the analysis described in this thesis is useful for local weather nowcasting applications (see [68]). The spectral analysis described in this thesis was carried out in two steps. Firstly a Fourier Analysis (Fast Fourier Transform, FFT [70], a much faster algorithm to compute the DFT) of the climatic data was performed. The outcome of the Fourier analysis was the determination of the optimum sampling frequency on the whole observation interval. Then many questions arose: does the sampling frequency have to be held at its optimum value all the day? In which periods of time is each signal more stationary? In which others does it undergo fast transients? Is it possible to find an algorithm for the automatic adjustment of the sampling frequency? To answer all these questions the wavelet analysis (Discrete Wavelet Transform, DWT, see [71], [72]) of the same data was carried out in order to detect potential drifts and transients with daily periodicity.

9.1.1: Climatic Signals

A meteorological station and a sensor for air temperature and humidity measurements have been installed at the Department of Electronics of Polytechnic of Turin. A data acquisition unit is connected to the sensors enabling the recording of the relevant values as shown in Error: Reference source not found. The data used for the spectral analysis refer to the period of time 09/15/2003-02/17/2004. In the following sections are shown the results of the spectral analysis for air and soil temperature and humidity.

9.1.2: Fourier Analysis Results

We performed the analysis on overlapping windows centered at the noon of each day. The parameters of the Fourier Transform were chosen as shown below:

Number of window samples: N=256 (approximately 2.6 days);

Sampling frequency: fs=1/(15 min)= 1/900 Hz = 1.111Â·10-3 Hz;

Estimated bandwidth of the data: B=5.555Â·10-4 Hz;

Frequency resolution: fs/N=4.34Â·10-6 Hz.

First of all the Fourier Amplitude Spectrum was calculated for each signal in each day in which all the data were significant (in the proper range of values). The proper definition of the bandwidth (B) of a signal was also chosen as the interval of frequencies in which the 99.9% of the signal power (calculated as the sum of the coefficients of the periodogram) is located. The power density spectrum of each signal was truncated at the minimum of these two values: the 99.9% power value and the value for which the signal rebuilt from the truncated spectrum compared to the original one showed an absolute error equal to a certain threshold, Err_abs, determined by the precision of the corresponding sensor (Err_abs thresholds are shown in Error: Reference source not found for air temperature, soil temperature and humidity).

The highest daily values of such bandwidth were assumed to be correlated to the optimum sampling frequency fs (they are shown in Error: Reference source not found for air temperature, soil temperature and humidity).

Note that for the Nyquistâ€™s criterion the band can not be greater than half the sampling frequency (fs=1.111Î‡10-3 Hz). So if the optimum sampling frequency assumes a value very near to the measurement frequency it is assumed that the corresponding parameter is downsampled. Then the average periodogram on the useful days was calculated. It has to be noted that the principal peak of the average periodogram for specific signals, like air and soil temperature and humidity, is on the fourth coefficient at f=3Â·(1.111Â·10-3)/256 Hz =1.3Â·10-5 Hz (see Error: Reference source not found, Error: Reference source not found and Error: Reference source not found). This value indicates a frequency of one cycle per day and it's in agreement with the daily pseudoperiodicity of the signals. Other signals with not negligible high-frequencies components, such as velocity and direction of the wind, the rain and the solar radiation, have broader spectra and should be considered as downsampled. On the other hand pressure behaves nearly as a constant and has a spectrum with a high peak in the origin and a value close to zero almost everywhere else. Even if the long-term high-frequency power contents of the climatic data are small, the short-term high-frequency power contents could affect the forecasting capability of the nowcasting tools [56], if no time-dependent sampling frequencyâ€™s adjustment is performed. Consequently the wavelet analysis was applied to the climatic data.

9.1.3: Wavelet Analysis Results

The Matlab-compliant software packet Wavelab [78] was used to carry out the Wavelet analysis. An analysis was performed using Haar orthonormal basis (see [72]). In fact we were interested in time location of dynamical transients and the analysis with the highest time resolution had to be used (see [72]). The output of the wavelet transform is set up by a number of levels set by the user. We chose 5 levels of analysis. Each level is relative to a given subband of the original signal and it has its own time resolution. The output coefficients of the wavelet transform represent the energy components of the input signal in the subband identified by the level of the transform (if B is the band of the input signal: subband6= B/2-B for level L=1, subband5= B/4-B/2 for level L=2 and so on) and in the time interval identified with an accuracy equal to the time resolution, Tr, of the level (Tr = 30 min for level L=1, Tr = 1 h for level L=2, etc.). Error: Reference source not found shows the data structure of the wavelet analysis described in this thesis.

The number of samples for the transform must be a power of 2. It was chosen to be 512=29. The samples were grouped as follows:

Nday are the 96 samples belonging to the day under analysis, Nprec (160) and Nsucc (256) are respectively the 160 lookbehind and 256 lookforward samples necessary to calculate the wavelet transform at the edges of each day. In this case the coarsest level of the transform, in the worst case, has valid coefficients also close to the edges of the time interval considered. Error: Reference source not found, Error: Reference source not found and Error: Reference source not found are the graphs relative to the air temperature, which as soil temperature and humidity shows, during the preventive analysis of data, the presence of well localized transients during the day. Such signals may need the fine adjustment of the sampling frequency during the transientsâ€™ time intervals in order to be properly analyzed. As pictures show, the low-frequency power contents of the signal (in subband1 and subband2) are two orders of magnitude greater than the power contents in subbands 3, 4 and 5 (f â‰¤ 2.77Â·10-4 Hz) and the power contents in such subbands are ten times greater than those in subband6.

From the pictures we can observe that the DWTâ€™s coefficients are significant only at low frequencies ( f â‰¤ 2.77Â·10-4 Hz). This frequency band is lower than half the optimum sampling frequency calculated before. Consequently there is no need for the adjustment of the sampling frequency during the transients. Error: Reference source not found shows the results of the wavelet analysis for air and soil temperature, and humidity: air and soil temperature have their energy peaks (not stationary dynamical behaviour) at 8 a. m. (principal peak) and 4 p. m. (secondary peak) and humidity has its peak at 8 a. m.

One of the many possible algorithms for the automatic adjustment of the sampling frequency for each climatic datum with transients at a certain peak hour could be for example:

fs â† Optimum sampling frequency from Fourier analysis always but in the transient hour when fs â† twice the bandwidth of the transient calculated with the wavelet transform analysis.

9.1.4: Conclusions

In this Paragraph a particular spectral analysis and representation of climatic signals, which are based upon the use of the Fourier and Wavelet Transforms, have been discussed. A successful solution strategy for the automatic adjustment of the sampling frequency for the set of climatic data has been presented, and the approach is generally applicable to sets of not stationary signals. The outcome of the Fourier analysis was the determination of the optimum sampling frequency for each climatic signal under analysis and on the whole observation interval. The wavelet analysis gave some hints on the time localization of the daily transients for each signal and on the determination of an algorithm for the adjustment of the optimum sampling frequency during the day. Further activities of analysis may be the seasonal analysis of the interesting data over a certain number of years and the analysis of an historical database to investigate the possible presence of not-stationary variabilities (seasonal, annual, astronomical, and so on).

9.2: The Forecasting Engine

Once the statistical procedure has chosen the best predictors and the optimum sampling frequency has been set, a forecasting engine is requested to predict future values. We used the machine learning tools described in Chapter 5 to foresee next value of temperature and relative humidity.

Firstly we trained the ANNs and SVM with the data used for the features selection (half of the available data) and we used the remaining part of the available measurement data as test set. The ANNs performances for soil temperature and relative humidity are shown in Error: Reference source not found and Error: Reference source not found respectively.

As we can see from Error: Reference source not found the ANNs performances for the soil temperature improve significantly for high numbers of hidden neurons and high numbers of input features. The results for Relative humidity are less accurate because the behaviour of the parameter is more chaotic and it assumes a wider range of values. Still ANN performances improve for a high number of input features. The results for the SVM are the same order of magnitude than those for the best ANNs. The machine learning methods give results not as good as expected. It can be observed from a comparison of measured and forecasted data that when measured data show not-stationary behavior, there is a delay of the forecasted data with respect to the measured data that is comparable to the forecasting interval (see Error: Reference source not found and Error: Reference source not found).

As an optimization step we can estimate if the residuals between measures and forecasted data are the realization of a white gaussian stochastic process. If this is the case, the forecasting engine is an optimal forecasting tool; if not, the forecasting method can be improved. Anderson whiteness test [79], allows one to decide if a stochastic process X() is a white process by means of the statistical analysis of a realization x(1), x(2),â€¦,x(N) of the process; as far as identification is concerned, the whiteness test can be used to:

Verify through the analysis of the residuals the correctness of a model that has been identified and its prediction error: if the residuals are a white process (with zero mean) then the identified model is a good description of the time series/dynamic system under analysis;

Estimate from a single realization of the process if the stochastic component of the corresponding time series is a white process (once the possible deterministic components have been removed).

Therefore Anderson whiteness test is a statistical test whose null hypothesis is that the stochastic process X() is a white process [80]. Certain characteristics of the estimator provided by the coefficients of the autocorrelation function of the process are obtained on the base of the null hypothesis. Particularly it is studied how the first M coefficients of the autocorrelation function (for n=0, 1, 2,â€¦, M) tend to zero as the number of data N tend to infinity for n â‰ 0. If the estimation of the coefficients of the autocorrelation function of the data x=(x(1), x(2), â€¦, x(N)) satisfies such characteristics then the null hypothesis is approved, otherwise the null hypothesis is rejected and the opposite hypothesis is approved. The test is based on the idea that if X() is a white process (with zero mean for simplicityâ€™s sake), then is a consistent estimator of the quadratic mean value of the covariance function of the process X(). Thus for n â‰ 0 and .

So it is to be expected that for N sufficiently high the estimate based on the data x=(x(1), x(2), â€¦, x(N)) assumes values close to zero for every n â‰ 0. Andersonâ€™s statistical test makes this idea quantitative and to remove the scale factor it has to be considered the estimate of the normalized covariance function . If the process X() is a white process then:

for n, n1, n2 >0, n1 â‰ n2. So for N sufficiently high:

the random variable is distributed as a Gaussian variable with zero mean and unit variance (standard Gaussian); so the probability Î± that the modulus of is greater than Î² is given by .

The random variables and are uncorrelated.

The whiteness test can be reformulated as a gaussianity test. The whiteness test can be based on the characteristic for every single n â‰ 0 following the steps written here below:

The value of Î² corresponding to a certain value of Î± is calculated.

The sequence of data x=(x(1), x(2), â€¦, x(N)) is used to calculate .

It is verified if the modulus of is less than Î². If the modulus of is less than Î² it is said that the whiteness test has been passed by the sequence of data with a level of meaningfulness of 100Â·Î± % and the null hypothesis is accepted.

In Andersonâ€™s test all the estimates for n = 1, 2, â€¦, M are considered at the same time.

Therefore Andersonâ€™s whiteness test can be executed in the following four steps:

The parameter of meaningfulness Î± is fixed and the corresponding value of Î² is calculated.

for n = 1, 2, â€¦, M are calculated on the given sequence of data x = (x(1), x(2), â€¦, x(N)).

The number Mout of values of n such that is calculated.

if , then the whiteness test has been passed by the sequence of data with a level of meaningfulness of 100Â·Î± % and the null hypothesis is accepted; otherwise the null hypothesis is rejected.

Where Î± (0<Î±<1) is said to be the parameter of meaningfulness. For the whiteness test as described above:

The probability of resolving that X() is not white while it is assumes the value Î± for N â†’ âˆž. This probability is decreased when Î± is decreased, but also the probability to resolve that X() is white while it is not is decreased.

The probability of resolving that X() is white while it is not can not be calculated because the distribution is not known when X() is not white.

Typical values of some of the parameters are the following:

Î± = 0.05 with the corresponding Î² =1.96; M has to be chosen between 5 and N/4 (the estimate of the covariance function based on N data is unaccurate for n close to N).

During the validation of the neural meteo forecasting system the following values of the parameters have been chosen:

Î± =0.05; Î² =1.96; M (indicated in the following as Ï„max) has been chosen according to the results of the validation test as shown in the following paragraph.

9.3: Validation Procedure of the Machine Learning Forecaster

Firstly it has been considered the historical record of the measurements of the climatic data with the corresponding forecasting values with 1, 2 and 3 hours in advance; for each parameter under analysis it has been possible to retrieve a file with the raw data in the following format; each row containing all the data concerning a certain hour of a certain day structured as shown below:

The first preprocessing on each of the raw data files has been the gathering of the data related only to subsequent days during which the data set is complete (a complete data set consists of all the 96 daily measurements taken at the sampling frequency of one each quarter of an hour). For each parameter such first step has allowed to get what stated in the following table:

Secondly it has been verified how much the whiteness test is a good estimate of the randomness of the sequence of residuals [81]; a certain number of pseudorandom sequences has been generated, the length of such sequences has been chosen to be 10000 that is comparable to the length of the available data. Then the whiteness test has been carried out on the pseudo random sequences in a cycle of 1000 iterations, and it has been counted the corresponding number of iterations in which the pseudorandom sequences are classified as Gaussian; the number of autocorrelation coefficients on which the test has been carried out (Ï„max), has been changed in a range between 15 and 250 with a step of 5. It has been observed that the number of times, in which the pseudo random sequences are classified as white, has a local maximum at Ï„max =20. Hence it has been launched another cycle, with Ï„max assuming values in a range between 18 and 24, to detect the global maximum. As shown in the following table such maximum is exactly at Ï„max =20.

Where W is the number of times per cycle (1000 iterations) that the sequences are classified as white and NW is the number of times per cycle that the whiteness test gives negative results (not white sequence). So it has been assumed Ï„max =20. The values of the parameters for the test on the available data are the following:

Î± =0.05; Î² =1.96 and Ï„max =20.

The results of the whiteness test for each climatic data are given in the following table:

The results of Andersonâ€™s whiteness test show that the sequence of the residuals of the machine learning forecasting system can not be considered as the realization of a white gaussian stochastic process; so the forecasting system adopted by this analysis is not the optimal one and it can be improved during the identification phase of the fundamental parameters on which it can be modeled. Other whiteness tests can be adopted and run on the available data to understand better why the systemâ€™s residuals are not white. One reason can be the fact that the climatic data series can not be modeled as a stationary process on a daily basis, but the problem can be investigated further. Moreover it could be interesting to verify if the adoption of different forecasting systems in parallel with a winner takes all output decision algorithm could improve the performance of our system.

As a first step, we tried to circumvent the problem of the delay in forecasted data by means of a further selection of the training data. More precisely we selected a different training set for each different prediction hour. This training set consists of the measurement values relative only to the prediction time in the past days with respect to the actual value to predict. In other words, let us say, for example, that it is 7 a.m. and we want to forecast the soil temperature at 10 a.m, then we select the features values corresponding only to 7 a.m. for all the days preceding the day of the prediction instead of using all the features values corresponding to all the hours belonging to all the days contained in the available measurement data and preceding the prediction day as in the previous case. The optimal subset of features included all those selected by the information-theoretic selection algorithm except those related to the time of measurements and forecasting. The test set included five days chosen one every ten days contained in the set of measurements once the training set was removed from the available data. The ANNs performances are shown in Error: Reference source not found and Error: Reference source not found.

The best ANN performances improve of one order of magnitude with respect to the previous case. In fact as we can see from Error: Reference source not found and Error: Reference source not found, the delay between measure and forecasted data is not present anymore taking the new training set. Moreover the two curves representing measured and forecasted data are almost overlapped showing a 0.339 and 0.5 mean squared error for soil temperature and relative humidity respectively (24 hours), a result comparable to that of state of the art competing nowcasting softwares [66].

We investigated also the performances of the SVM taking as input features the 14 features used as an input to the ANNs. We tried different assignment for SVM parameters Îµ, Ïƒ and C, in order to find the optimum configuration with the highest performance. As we can see from Error: Reference source not found and Error: Reference source not found, when we keep C constant (C= 10000), the SVM performances reach a maximum when Ïƒ=0.1 and Îµ=0.001, corresponding to a satisfactory model accuracy with respect to the training data (small values of Ïƒ). From the observation that the best performances are achieved when Îµ is close to 0, by abductive reasoning we could conclude that the input noise level is low [82]. In accordance with such a behaviour the performance of the network improves when the parameter C increases from 1 to 10000 (see Error: Reference source not found and Error: Reference source not found). Since the results tend to flatten for values of C greater than 10000, the parameter C was set equal to 10000.

9.4: Conclusions

Short term forecast (nowcast) seems to be an interesting topic in meteorological forecast. The possibility to have an accurate estimate of the weather conditions in next three hours can be a useful tool for several situations, such as ice on the road, sms based weather info to mobile phone networks subscribers and so on. The application of a very powerful data mining system to weather nowcasting has been shown in this Chapter. Its performance is comparable or even better than that of competing softwares. It could be useful to develop a consensus procedure between parallel systems implementing different machine learning methods in order to improve the overall performance. Moreover the weather nowcaster described in this Chapter might be used to integrate the forecast of a network of stations to evaluate weather forecast for a larger region.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now