offsprings:calibration:software

The **ground truth data** is defined as the data taken by direct observation, i.e., by a reference station, in contrast to that one that is provided by inference. In our case, in general, this data is ozone concentrations taken from official reference stations and is measured as $\mu$gr/m3.
The **sensor data** is defined as the data taken by the sensors. In our case, there are three kinds of sensors: metal-oxide sensor whose data are measured in KOhms, temperature sensors whose data are measured in ºC and relative humidity sensors whose data are measured in %.

In the following, we summarize one of the methods implemented for the calibration of the Captor nodes.

Each CAPTOR node is deployed on the roof of a reference station during a period time of at least 3 weeks. In general, the calibration of a sensor means to approximate the true value Y by a function f(X):

\begin{equation} \tag{1} Y=f(X)+\varepsilon \end{equation}

Where f is a fixed but unknown function, X is a vector of p predictors or input variables and $\varepsilon$ is a random error term distributed as a zero mean Gaussian random variable with variance ${\sigma}$$^{2}$, i.e, $\varepsilon N(0, {\sigma}^{2})$ and independent of X. In this approximation, eq (1) is modelled by saying that we are regressing Y on X (or Y onto X). In order to find a regression of the data, we may consider linear combinations of fixed non-linear functions of the input variables, of the form:

$$f(X)=\beta _{0}+\sum _{i=1}^{p}\beta _{i}\varphi _{i}(X)$$

where ${\phi}$$_{\mathrm{i}}$(X) are known as basis functions and ${\beta}$$_{0}$ is the slope or intercept and the ${\beta}$’s (i=1,…,p) are the regression coefficients. The most basic model is using a Multivariate Linear Regression (MLR) in which each basis function ${\phi}$$_{\mathrm{i}}$ is linear with respect Xi i.e., ${\phi}$$_{\mathrm{i}}$(X)=X$_{\mathrm{i}}$:

\begin{equation} \tag{2} Y=\beta _{0}+\beta _{1}X_{1}+\cdots +\beta _{p}X_{p}+\varepsilon =\sum _{i=0}^{p}\beta _{i}X_{i}+\varepsilon =\beta X+\varepsilon \end{equation}

It is to say, Y is approximated by a linear combination of the predictors. Note that the dimension of Y and X$_{\mathrm{i}}$ (i=1,…,p) is N, the size of the sample set (X$_{\mathrm{i}}$,Y${\in}$R$^{\mathrm{N}}$ or ${\in}$R$^{\mathrm{N}}$$^{\mathrm{·}}$$^{\mathrm{(p+1)}}$), where X has been extended by a vector X0 of 1’s and the slope ${\beta}$$_{0}$ has been integrated in the ${\beta}$. More complicated basis functions may be used, such as powers of x, ${\phi}$$_{\mathrm{i}}$(X)=X$^{\mathrm{j}}$$_{\mathrm{i}}$ or polynomial functions of several features. In the calibration process for the CAPTOR project, the Multivariate Linear Regression (MLR) will be used. In order to regress Y on X, the coefficients ${\beta}$ have to be approximated by ${\beta}$’. Our goal is to obtain coefficient estimates ${\beta}$’ such that the linear model of eq (2) fits the available data well, that is, so that $y$${\approx}$${\beta}$’ X. In other words, we want to find those coefficients ${\beta}$’ where by the resulting line is as close as possible to the N data points. We refer to James et al1 for finding the ${\beta}$’, e.g., minimizing the least squares criterion. Let $yi’= \beta’ Xi$ be the prediction of yi based on the value of xi. The difference between the estimated value $yi’$ and the original value $yi$ is called the residual, $ei=yi- yi’$. We define the Residual Sum of Squares (RSS) as:

\begin{equation} \tag{3} \mathrm{RSS}=e_{1}^{2}+\ldots+e_{n}^{2}=\sum _{i=1}^{n}(y_{i}- y_{i}^{'})^{2}=\sum _{i=1}^{n}(y_{i}- \beta _{i}^{'}x_{i})^{2} \end{equation}

We wonder how close are the ${\beta}$’ from the real true ${\beta}$. In computing the standard errors in ${\beta}$’, they depend on the variance ${\sigma}$$^{2 }$ of the error ${\varepsilon}$. However, this variance is unknown. A way of estimating this variance is to define the Residual Standard Error (RMSE), defined as the square root of the mean RSS: RMSE=(RSS/(n-p+1))$^{\mathrm{1/2}}$.

In this section, we focus in captor nodes that use metal-oxide sensors. Let us assume that the data set for calibration has size N. We assume that each of the M ozone sensors is independent of each other. The data consist of: The reference station data Y${\in}$R$^{\mathrm{N}}$, The ozone (O3) data captured by each sensor X$_{1}$=X${\in}$R$^{\mathrm{N}}$, with M ozone sensors, The Relative Humidity (RH) data captured by the sensor X$_{2}$=HR${\in}$R$^{\mathrm{N}}$, The Temperature (T) data captured by the sensor X$_{3}$=T${\in}$R$^{\mathrm{N}}$, The MLR model used is, then:

\begin{equation} Y = \beta _0 + \beta _1 X_1 + \beta _2 X_2 + \beta _3 X_3 + \varepsilon \end{equation}

Where we have recalled X$_{1}$=X (ozone), X$_{2}$=HR (Relative Humidity) and X$_{3}$=T (Temperature) for commodity. In order to calibrate a CAPTOR node with a single ozone sensor, we proceed as follows: The data set N is divided in two sets: the training set of size N$_{1}$ and the test or validation set of size N$_{2}$. Obtain the ${\beta}$’ by minimizing the least squares criterion over the training set and obtain the RRSE as quality parameter of the training set by using the RSS of the training set. Predict the $y’= \beta _0’ + \beta _1’ X + \beta _2’ HR + \beta _1’$ T where X,HR,T${\in}$R$^{\mathrm{N2}}$ are data of the validation set. Obtain the RSE of the validation set by using the RSS of the test set. At the end of the process, each M individual sensor is calibrated per each CAPTOR node. Now the question is which one represents best the CAPTOR node. The sensor that has less validation RSE is taken as reference sensor for that node. Now, the ozone sensor is calibrated and the ozone concentration can be predicted by new values using formula, where Xcal is the calibrated value and the subscript new means new uncalibrated collected data:

\begin{equation} X_\mathrm{cal} = \beta’_0 + \beta’_1 X_\mathrm{1new } + \beta’_2 X_\mathrm{2new} + \beta’_3 X_\mathrm{3new} \end{equation}

The software tool to calibrate the sensors of Captor nodes have been developed by UPC and uses python as a programming language. For using the software it is needed that the user install python (open source) and install the following python modules:

Numpy sklearn.utils sklearn csv matplotlib

These modules are used for uploading the data, plotting and form mathematical manipulation. In the following the software is described.

Let us assume that the captor node has 5 ozone sensors, 1 temperature sensor and 1 relative humidity sensor. The software needs that the data is provided as a CSV file with data in the following format:

date; RefSt; S1; S2; S3; S4; S5; T; RH

where: Date: is the date in which the sample was taken. The date is in UTC (Coordinated Universal Time) format, e.g. dd-mm-yyyy T hh:mm:ss or dd/mm/yy hh:mm:ss. RefSt: is the value of Ozone of the reference station in which the node has been placed S1; S2; S3; S4; S5: are the values taken by the five Ozone sensors (resistance values) T: is the value of the temperature sensor. RH: is the value of the Relative Humidity sensor.

For example, here there are five samples taken from a node:

date; RefSt; S1; S2; S3; S4; S5; T; RH 2017-09-24T04:30:48;69.8937;62.3520;59.6113;47.8217;370.3207;16.03;75.27 2017-09-24T05:00:49;86.5677;89.2967;98.8177;76.0747;575.8170;15.63;76.00 2017-09-24T05:30:51;100.3943;114.8940;133.6520;90.8980;1090.0657;15.00;76.00 2017-09-24T06:00:52;110.8643;131.2980;157.1250;103.0783;3017.3080;15.00;76.00 2017-09-24T06:30:54;109.9107;130.9013;148.3290;96.0383;5246.5073;15.00;75.23

The python software obtains the calibration coefficients and plots the data, the normalized data, and scatterplots. A library with several functions has been created for plotting the data. The code for all the process can be found in appendix A.

The input of the software is a file in CSV format, section 3.1.

The output of the software calibration is a set of files that is written in a specified folder:

Scatterplots in which in the x-axes is the normalized reference station data and in the y-axes is the normalized Ozone sensor data. The scatterplot allows us to see how linear is the data.

Figure 1. Scatterplot of sensor S4

Plots with the raw data of every uncalibrated Ozone sensor (resistances) and the reference station. This kind of plot allows us to see whether the sensor data follows the same patterns that the reference station. For example, these plots allow us to identified change of scales, gaps in the data or peaks that show a malfunction in some of the sensors.

Figure 2. Data Set of Captor 17001.

Plots with the normalize data of every Ozone sensor and the reference station can be drawn. What it is done, is to normalize all the data (sensors and reference station) with respect the mean and variance. For example, for an Ozone sensor whose data is stored in vector x=(x$_{1}$,…,x$_{\mathrm{N}}$), with N the number of samples, the normalized variable xnorm of x is defined as $_{\mathrm{norm}}$ = (x-${\mu}$)/${\sigma}$, where ${\mu}$ is the mean of vector x and ${\sigma}$ is the standard deviation of vector x. These plots allow us to better visualize the temporal behaviour of the uncalibrated data with respect the reference station data. The plot is very similar to the previous one, but all the data is in similar scale, which allows a better understanding of the data.

Figure 3. Data Set Normalized of Captor 17001.

Plots with the calibrated data of every sensor: it is plotted the calibrated ozone concentrations for the training, validation and whole data. The training and validation data sets are shuffle, it is to say, the samples of the training and validation are randomly taken from the data set. For this reason, in these plots, the data is shown with these peaks. On the other hand, in the plot with the whole data set, the samples are re-ordered, and then plotted in the correct order.

Figure 4. Calibrated Ozone for Training of sensor S4 Captor 17001.

Figure 5. Calibrated Ozone for Validation of sensor S4 Captor 17001.

Figure 6. Calibrated Ozone for the whole data set of sensor S4 Captor 17001.

File with the RMSE of all the sensors: this file allows us in the case of the captor to choose which of the 5 sensors performs better. In the case of raptor there is only one electrochemical sensor. The file stores the size of the data set, the RMSE for the training, the validation and the whole data for a captor node. For example:

17001 [857, 479, 378] [ 14.33349493 12.63236135 10.81323125 10.10156376 16.32232404] [ 14.41200886 12.80010649 10.99627148 10.10755182 16.36424206] [ 14.34305967 12.68454743 10.8754736 10.08647158 16.31219212]

where the 1st row identifies the captor node, the 2nd row is the data size and the size of the training and validation set, the 3rd row gives the training RMSE, the 4th row gives the validation RMSE and the 5th row gives the RMSE of the whole set. In this case, we could choose S4 as the best sensor, the one that gives the less RMSE among the 5 metal-oxide sensors,

Normalized coefficients of the Multivariate Linear Regression (MLR) algorithm: The regression algorithm works with normalize data. Then the coefficients are calculated to give a normalized calibrated data. This data has to be denormalized using the mean and the standard deviation calculated previously and stored in a file, see below.

For a captor node:

# Each row has the betas of each sensor (S1 to S5). columns are offset, beta_O3, beta_HR, beta_T 0.000000 0.689965 0.426913 0.159453 0.000000 0.739328 0.352376 0.136598 0.000000 0.724662 0.439091 0.225301 0.000000 0.678176 0.398302 0.148486 0.000000 0.200369 0.662802 -0.046232

It is to say, for sensor S1, β0’=0.0000, β1’=0.689965, β2’=0.426913, β3’=0.159453, and so on.

A file with the values of the mean and standard deviation of each sensor. These data is necessary for denormalize the calibrated sensor data.

For a captor node:

1st row have the means of the Ref Station, S1 to S5, T and RH 91.400990 310.376556 232.357651 12.320598 275.475542 1774.450494 25.174752 38.079059 2nd row have the std of the Ref Station, S1 to S5, T and RH 48.014360 141.201602 112.268139 3.328627 135.665164 2609.624402 5.231312 21.591499

The values of the RefStat, S4 (the sensor chosen), T and RH are marked in blue. These values are the ones for later predict the ozone. Now, for obtaining the new calibrated data, we have to perform the following operations:

1. Normalize the new data using the vector of means and std stored. Note that we have to normalize, e.g. for captor nodes, the sensor chosen (S4 in the previous example) we normalize the values of the ozone for sensor S4, T and RH.

2. Predict the normalize calibrated ozone using formula (7): $X_{CalNorm} = \beta’_0 + \beta’1 X_{1new} + \beta’2 X_{2new} + \beta’_3 X_{3new}$.

3. Denormalize the normalize calibrated value using the RefStat media and std:

\begin{equation} X_Cal = X_{CalNorm} * \sigma_{RefStatO3} + \mu_{RefStatO3} \end{equation}

The following software is delivered:

Calib_CAPTOR.py: this python module is a library used by the other two files. MLR_CAPTOR-Calib_1-node-cmd.py: python program for calibrating sensors. Plot_Vol_concentrations.py: python program for predicting values.

4.1. Calibration of sensors Assuming that the file with data is CAP-17001_Cal1.txt with 4 ozone sensors, 1 temperature sensor and 1 relative humidity sensor: date; RefSt; S1; S2; S3; S4; S5; T; RH 08/05/2017 10:30;73;53.2878;77.8722;100.3167;51.1944;32.6811;33;29.33 08/05/2017 11:00;96;100.3553;103.436;293.4037;130.7197;34.7047;29.87;28.43 08/05/2017 11:30;94;268.2847;88.789;397.3427;235.4683;216.161;22.1;35.73 08/05/2017 12:00;100;375.3243;106.526;389.12;257.5347;1118.9013;23.13;35.43 ….

1. Execute de following command for learning the use of the script:

python MLR_CAPTOR-Calib_1-node-cmd.py –h

usage: MLR_CAPTOR-Calib_1-node-cmd.py [-h] [-f CAPTOR_FILE] [-n CAPTOR_NAME]

[-p PLACE_ER] [-s SENSORS_C] [-t TOTAL_SEN]

Calibrate a Captor Node with 5 Ozone Sensors

optional arguments:

- h, –help show this help message and exit
- f CAPTOR_FILE, –captor_file CAPTOR_FILE

file (including path) with the CSV data

- n CAPTOR_NAME, –captor_name CAPTOR_NAME

Name of the Captor

- p PLACE_ER, –Place_ER PLACE_ER

Name of the Reference Station

- s SENSORS_C, –sensors_c SENSORS_C

Sensors to calibrate, e.g. 134, 1=s1, 3=s3, and 4=s4

- t TOTAL_SEN, –TOTAL_SEN TOTAL_SEN

Total number of sensors between the Ref Station and the Temp/HR sensor

2. Execute the following command for calibrating sensors 1,2,3,4 (sensor s5 is not calibrated in the example)

python MLR_CAPTOR-Calib_1-node-cmd.py –f CAP-17001_Cal1.txt –n C17001 –t 5 –p PR –s 1234

The script will produce 4 scatter-plots, 4 training plots, 4 testing plots and 4 plots for the whole calibration data (one per sensor). Moreover, it will produce a plot for the normalize data. On the other hand it will produce 2 scripts with the coefficients and the means and standard deviation.

Finally it produces a file called QoI-metricsC17001.csv with the RMSE of the calibration. Select in the test RMSE row, the sensor with lowest RMSE, for example, if the file contains:

CAPTOR NAME: ;C17001;Sensors to calibrate;[1, 2, 3, 4] Size of Data;Size of Training Data;Size of Test Data 856;556;299 Train RMSE 15.0111;12.7920;10.2722;9.0728 Test RMSE 15.2282;12.8262;10.4780;9.0548 All Data RMSE 15.0617;12.7817;10.3266;9.0506 Train R2 0.8496;0.8908;0.9296;0.9451 Test R2 0.8595;0.9003;0.9335;0.9503 All Data R2 0.8533;0.8943;0.9310;0.9470

Sensor s4 has the lowest test RMSE (9.0548).

Assuming that file CAP-17001_Vol.txt has the raw data measured during the campaign:

Date; S1; S2; S3; S4;S5; Temp; RH 31/05/2017 12:00:35;57.6523;123.4613;379.3270;185.9597;32.7923;22.00;35.37 31/05/2017 12:30:37;75.4007;129.4997;397.0947;203.3357;33.0903;22.97;34.40 31/05/2017 13:00:39;98.3527;121.5243;369.3643;197.3640;45.3057;24.87;33.23 31/05/2017 13:30:40;92.3877;120.6137;374.9873;190.4703;37.9113;25.97;32.40 31/05/2017 14:00:42;115.7570;121.9630;375.3777;204.3810;48.7137;27.07;31.27 ….

1. Execute de following command for learning the use of the script:

python Plot_Vol_concentrations.py –h

usage: Plot_Vol_concentrations.py [-h] [-f CAPTOR_FILE] [-n CAPTOR_NAME]

[-s BESTSENSOR] [-t TOTAL_SEN]

Calibrate a Captor Node with 5 Ozone Sensors

optional arguments:

- h, –help show this help message and exit
- f CAPTOR_FILE, –captor_file CAPTOR_FILE

file (including path) with the CSV data

- n CAPTOR_NAME, –captor_name CAPTOR_NAME

Name of the Captor (same as in calibration file)

- s BESTSENSOR, –BestSensor BESTSENSOR

Sensor chosen in the calibration process, s1=1, s2=2, s3=3, s4=4

- t TOTAL_SEN, –TOTAL_SEN TOTAL_SEN

Total number of sensors between the Ref Station and the Temp/HR sensor

2. Execute the following command for obtaining calibrated values (assuming that we have selected sensor s4 as better sensor).

python Plot_Vol_concentrations.py –f CAP-17001_Vol.txt –n C17001 –t 5 –s 4

the script will result in a file called C17001-cal.txt with calibrated values:

date; calibrated value

31/05/2017 12:00:35;103.48304935799813 31/05/2017 12:30:37;106.08392240489202 31/05/2017 13:00:39;109.16029536948636 31/05/2017 13:30:40;113.30527396475918 ….

And a png file called C17001calibrated-s4.png with the calibrated values plotted.

offsprings/calibration/software.txt · Last modified: 2018/09/24 12:22 by roger

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International