Predicting temperature curve based on fast kNN local linear estimation of the conditional distribution function

View article
Environmental Science

Introduction

Over the past decades, the volume and complexity of collected data have rapidly increased the sizes and number of covariates. This increase has created significant potential and demand for scientific and technological innovations. The increased storage capacity of information, the improvement of computers and their processing capabilities, the proliferation of surveillance systems, and the improved sensors are the technological progress that has favored the emergence of this kind of data. These are now commonly used in many fields of application such as astronomy, biology, climatology, ecology, chemistry, economics, medicine, engineering sciences, etc. The standard statistical models based on the discretization of the big data in the finite grid suffer from certain drawbacks. The first anomaly is the problem of the dimensionality curse when the size of the discretization grid is large. The second anomaly is that, with the transformation, the original data loses its characteristics such as the functionality, correlation, heteroscedasticity or the homoscedasticity of the data. In particular, the asymptotic behavior of the constructed data is related to the obtained sampling.

Recall that all the mentioned defects can substantially affect the multivariate approaches’ efficiency in big data analysis. To overcome these problems, a new approach in modern statistics called functional data analysis is developed recently. Such procedure allows the use of the natural space of the data, which permits the profit from the whole information. In this modern statistics branch, the local linearity method (LLM) estimation is widely studied. It is motivated by its small bias in the estimation processing (see Fan & Gijbels (1996) for a uni-dimensional framework, and Baìllo & Grané (2009), for the Nonparametric Functional Data Analysis (NPFDM) set up in the area of functional statistics). Barrientos-Marin, Ferraty & Vieu (2010) studied the LLM-estimation of the nonparametric operator of the Banach explanatory variable. We cite Berlinet, Elamine & Mas (2011) for an alternative LLM-estimator constructed by inverting the local variance-covariance matrix of the functional variable. Concerning the LLM-estimation of the conditional cumulative distribution function (CCDF), we point out that the first result was stated by Laksaci, Rachdi & Rahmani (2013). Later, Demongeot et al. (2014) precise the least square error of the LLM-estimator of the CCDF-model. We also point out that the previous studies utilized the kernel local linearity method. However, this paper focuses on CCDF-estimation with a new weighting approach obtained by mixing the local linear fitting to the k-Nearest Neighbors (kNN) method. Indeed, the method of kNN has been received growing attention in nonparametric functional statistics. In particular, the first results on the kNN-LLM estimation were obtained by Chikr-Elmezouar et al. (2019). They studied the conditional density estimation using the local linear method under the kNN smoothing. They established the almost complete consistency of the obtained estimator. Recently, Almanjahie et al. (2021) consider the estimation of the robust regression function using the kNN. They proved the uniform consistency on the number of the neighbor of the constructed estimator. We also mention Laksaci, Ould-Said & Rachdi (2021) for the kNN estimation of the quantile regression. For more recent results on the kNN-LLM estimation in functional statistics, we cite Rachdi et al. (2021). They treated the case when the response variable is observed with missing at random and the regressor is of functional nature.

On the other hand, the estimation based on the kNN method has more advantages than the Nadaraya–Watson algorithm (see Burba, Ferraty & Vieu (2009), for more discussion on the motivations on this approach). In this paper, we benefit from the advantages of both the kNN weighting and LLM-fitting by combining the two algorithms to provide a fast efficiency estimator for the CCDF. Specifically, we combine these approaches to construct two estimates of the CCDF and to the shortest conditional modal interval (SCMI) predictive regions using mixing functional time series. Notice also that the smoothing parameter in the k-Nearest Neighbors method is varied randomly with respect to the observations. This feature makes the applicability of these estimators very fast and accurate because the smoothing parameter is selected with respect to the nature of the data. To highlight the smoothing parameter selection issue, we compare several cross-validation rules to select the best number of the neighborhood. The superiority of both estimators in practice is emphasized by the meteorological data; specifically, the constructed estimator to predict the yearly curve of the temperature in Europe central.

This paper is structured as follows. The two kNN-LLM estimators of the CCDF are constructed in the “Method” section. We devote the section “Results and discussions” to some discussions related to our proposed estimators. The simulation study for testing the superiority of the proposed estimators is presented in the “Simulation study” section. The performance of the constructed estimator in temperature prediction using real data is conducted in the “Real-data application” section. Our conclusion and remarks for further research are presented in the last section.

Methods

In this section, we will construct two new estimators by combining the local linear approach to the kNN smoothing methods.

The fast kNN-LLM of the CCDF

Consider (X1,Y1), (X2,Y2), (X3,Y3)…, (Xn,Yn) be stationary sequence of random vector (X, Y) valued in F×IR, where F is a separable metric and has a metric d. Let Nx be the neighborhood of fixed curve xF, for which we suppose that the conditional cumulative distribution function (CCDF) F(·|x) has a continuous conditional density f(·|x).

This estimation procedure is based on the definition of CCDF, as conditional expectation:

F(y|x)=IE[II{Yiy}|X=x]

with IIA is the indicator function of A. In the LLM technique, we approximate F(y|x) locally in Nx using

x0Nx,F(y|x0)=ayx+byxd(x0,x)+o(d(x,x0)).

So, the kNN-LLM estimator of CCDF F(y|x) is obtained by estimating ayx and byx, in (1), as the minimizers of the rule,

Mina,bi=1n(II{Yiy}abd(Xi,x))2Ker(d(x,Xi)IHk),

where Ker means the kernel function and IH=min{hIR+,satisfiesi=11IBa(x,h)(Xi)=k}. Similarly to Barrientos-Marin, Ferraty, & Vieu (2010), we obtain by derivative that the a^ and b^ are the solutions of

tL(KerΥKerL)(a^b^)=0,

where

tL=(1,1,,1crd(X1,x),d(Xn,x))

and

KER=diag(Ker(d(x,X1)IHk),Ker(d(x,X2)IHk),Ker(d(x,Xn)IHk))

with

andtΥ=(1I{Y1y},,1I{Yny}).

It follows that

(a^b^)=(tLKERL)1(tLKERΥ).

Hence,

a^=(1,0)(tLKerL)1(tLKerΥ).

Finally, the Fast kNN-LLM of the CCDF F(y|x) is

F^(y|x)=a^yx=i,j=1nβij1{Y1y}i,j=1nβij,

where

βij=d(Xi,x)(d(Xi,x)d(Xj,x))×Ker(IHk1d(IHXi))Ker(IHk1d(x,Xj)).

The smooth kNN-LLM of the CCDF

An alternative estimation of CCDF is built by treating the function F(·|x) as a conditional expectation, i.e.,

IE[H(1(yYi))|Xi=x]F(y|x) as 0,

where H is the cumulative distribution function, (n=) is a positive real sequence. In fact, this idea was proposed, first, by Fan & Gijbels (1996) in nonfunctional setup. Under this consideration, our motivation is based on the fact that the smooth kNN-LLM of the CCDF is obtained by estimating the operators ayx and byx of the formula (1) as

Min(a,b)IR2i=1n(H(l1(yYi))abd(Xi,x))2Ker(d(x,Xi)IHk),

where l=min{IR+,satisfiesi=11(y,y+)(Yi)=l}. Then, we prove that the smooth kNN-LLM of the CCDF F(y|x) is explicited by

F^(y|x)=i,j=1nβijH(l1(yYi))i,j=1nβij.

Results and discussions

On the impact of this contribution

It is well known that the CCDF has pivotal role in nonparametric statistics modeling. Indeed, the nonparametric estimation of this model is an imperative step for several nonparametric model including conditional density, the conditional quantile functions and the conditional hazard. In the prediction setting, the CCDF allows constructing various predictive regions or, more specifically, predictive intervals. We mention for instance, the shortest conditional modal interval (SCMI), the conditional percentile interval and the maximum conditional density region (MCDR) (see De Gooijer & Gannoun (2000) for their definitions). Of course, the diversity of the applicability of CCDF highlights the importance of this conditional model, which has the power of characterizing, completely, the conditional law of the considered random variables. As mentioned in the bibliographical discussion of the introduction section, the CCDF model has been widely studied in NPFDM. However, our present work’s novelty mainly estimates the CCDF model based on the combination of two fundamental approaches: the kNN and the LLM. This combination allows to construct an attractive estimator allowing to inherits the advantages of two methods. Indeed, it is well known that the LLM improves the bias property of the CKM while the weighting by the kNN-algorithm offers a sophisticated procedure for the smoothing parameter choose. It is selected locally with respect to the vicinity at the point of conditioning which is more adaptive to the data topological structure. Such adaptation is essential in nonparametric functional data analysis, where our estimators’ efficiency is connected to the data structure explored through the concentration property of the probability measure of the functional variable (see Ferraty & Vieu (2006)). Nevertheless, the establishment of the convergence rate of the kNN-LLM estimators is more complicated than the case considered by Laksaci, Rachdi & Rahmani (2013). In our case, the smoothing parameter is taken to be a random variable, while it is a scalar in the classical situation. Considering the dependent case which is more general and more realistic situation this difficulty becomes more complicated. In conclusion, the principal axes of this contribution are: (1) the conditional distribution function as a pivotal model for various nonparametric conditional models, (2) the estimation method as a new proceeder even in the nonfunctional case (as far as we know, there is no work in the CCDF estimation by combining the LLM to kNN) and (3) the functional time series case as a generalization for the independent case. To emphasize the usefulness of the present contribution in the prediction issue, we discuss in the following section how we can predict real future characteristic of a continuous-time process given its past.

Functional time series prediction

Recall that the nonparametric prediction is considered to be the most important application of the functional nonparametric data analysis. In particular, functional time series examples can be composed based on a continuous-time process. Indeed, consider a random variable (St) where t ∈ [0, b) having real-values in a continuous-time process. So, from St we compose n functional random variables (Xi)i = 1,…,n obtained by

t[0,b),Xi(t)=Sn1((i1)b+t).

Therefore, if our aim is to predict a future value Y = St0, at fixed point t0 = b + s given (St)t [0, b), we then define a sequence of the interest variable Y, i.e.,

Yi=Sn1(i)b+s,i=1,,n.

Thereafter, we construct our predictor (conditional median, conditional quantile or the conditional mode) by using the observations (Xi, Yi)i = 1,…, n 1. However, since the predictive region or, more specifically, the predictive interval is often more instructive than predicting a single-point, we focus on this kind of prediction. Formally, for all ζ ∈ (0, 1), the interval/region is defined as a set IζIR satisfies

IP(YnIζ|Xn)=1ζ.

As mentioned in the above section, one of the main feature of the CCDF is the possibility to construct several predictive regions Iζ. Of course, the efficiency of each prediction interval is assessed by the means of the length of the set Iζ and the presence of the true value in Iζ. It is well documented that the width of the SCMI is the smallest compared to all predictive regions with the same coverage probability (see De Gooijer & Gannoun (2000)). The latter is introduced by Tong & Yao (1995) and obtained by

[A1ζ,B1ζ]=argminc,d{Leb[c,d]|(d|x)F(c|x)1ζ}.

The Leb(·) refers to the Lebesgue measure. Using the CCFD estimators, we approximate the SCMI by

[A1ζ(Xn),B1ζ(Xn)]=argminc,d{Leb[c,d]|F¨(d|Xn)F¨(c|Xn)1ζ}.

where F¨ means F^ or F~. The easy implementation of this approximation is studied and discussed in the “REAL-DATA APPLICATION” section.

Simulation study

In this simulation study, we propose to control the behaviour of the estimators w.r.t. the dependency degree of the data. More precisely, our aim is to compare, considering finite sample, the efficiency of the estimators F^(y|x) and F~(y|x). To do this, we use the fact that m-dependent variables are α-mixing and we generate n functional variables as follows.

In the first, we draw n + m − 1 independent functional variables by

Sj(t)=2cos(tW1j)+0.2W2j for j=1,,n+m1,

where W1 and W2 are uniformly distributed on [0, π/4]. Next, we simulate the m-dependent functional variables defined by:

Xi(t)=j=ii+m1Sj(t).

Discretizing the curves Xi’s on the same grid leads to the construction of 100 equispaced measurements in (0, 2π). In Fig. 1, we plotted the functional variables associated to the strong case where m = 4.

A sample of 100 curves.

Figure 1: A sample of 100 curves.

For the response variable we consider four regression models:

ModelM1:Yi=502πlog((4Xi(t))2+2)dt+εi,

ModelM2:Yi=02πexp(Xi(t))dt+1.502πexp(Xi(t))dt+εi,

ModelM3:Yi=02π(5log((4Xi(t))2+2))dt+02πexp(Xi(t))dt+1.5δi02πexp(Xi(t))dt,

ModelM4:Yi=02πexp(Xi(t))dt+1.502πexp(Xi(t))dt+5δi02πlog((4Xi(t))2+2)dt,

where εiN(0,0.25) (resp. δiExp(2)). Note that, based on this models and with given X = x, the CCDF of Y is explicitly determined according to the distributions of εi and δi, which permit the determination of the theoretical CCDF, F(y|x).

Now, we specify quickly the different parameters involved in both estimators. Note that the parameters of the two estimators are the kernel K, the semi-metric d, the number k and/or l of neighbors. So, for this numerical study, we point out that we have taken a quadratic kernel supported within (0, 1) and used the L2 metric and the numbers of neighbors k, l are chosen using the following cross-validation criterion, defined as

j=1n(F(Yj|Xj)F¯j(Yj|Xj)),

where F¯j denotes the leave-one-out-curve estimate of the F~ and F^.

The performances of the two estimates is examined by comparing their average absolute errors:

AE(F~)=1ni=1n|F(Yi|Xi)F~(Yi|Xi)| and AE(F~)=1ni=1n|F(Yi|Xi)F^(Yi|Xi)|.

Thus, in order to control behaviors of both estimates w.r.t. the level of dependency, we plotted in Fig. 2, the curves of AE( F~) and AE( F^) w.r.t. the values of m.

AE(
$\tilde{F}$F~
) (dotted line) and AE(
$\widehat F$F^
) (continuous line).

Figure 2: AE( F~) (dotted line) and AE( F^) (continuous line).

It can be seen that, both errors increase substantially relatively to the values of m. Furthermore, it is clear that in the models M3 and M4 (heteroscedastic case), the estimate F^ outperforms F~, but in the models M1 and M2 (homoscedastic case), the estimate F~ is significantly better than F^.

Real-data application

In this section, we show the applicability of the proposed estimators to a real data example. To do that, we consider the problem of predicting the monthly average temperature one year ahead. For this purpose, we consider the same data set used by Laksaci, Lemdani & Said (2011) which are available at the website https://www.met.hu/en/eghajlat/magyarorszag_eghajlata/eghajlati_adatsorok/Debrecen/adatok/napi_adatok/index.php. This data were collected by Debrecen’s station, Hungary (northern latitude 47°35′44″ and eastern longitude 21°38′43″). They are monthly measurements (1,200 months = 100 years) from 1901 to 2000. The latter can be viewed as a continuous process denoted by St. As noticed in the previous section, from St, we construct n + 1 = 100 curves (Xi(t)), i = 1,…, n + 1, where Xi denotes the average temperature curve observed during the (1 year) 12 months of the ith year. The process (St) and the curves (Xi)i are plotted in Figs. 3 and 4.

Mean temperature by year.

Figure 3: Mean temperature by year.

Monthly mean temperature.

Figure 4: Monthly mean temperature.

Of course, the proposed predictive interval’s efficiency is closely connected to the parameters’ choices in the estimator of the conditional distribution function. For the real data example, we compare estimators F and F in the SCMI estimation (with ζ = 0.1). For the computational study, we use the same kernel K, that is, the quadratic kernel on (0,1). The latter is adequate with this type of nonparametric approach. It is usually used in nonparametric functional statistics and incorporates the technical assumptions of the theoretical development of this kind of model. Concerning the choice of the metric d, we point out that it is closely related to the nature of the functional variable and its smoothing property. The Principal Component Analysis (PCA) metric is more suitable for this type of discontinuous functional regressors. For the choice of k (or h), we utilize the same cross-validation method used by De Gooijer & Gannoun (2000), which is based on the criterion

CV=1nj=1ni=1n(1{YiYj}F^j(Yi|xj))2 .

This criterion is optimised over the same subsets of k (or h) proposed by Rachdi & Vieu (2007), that is, {5,10,20,…,45}.

To determine the SCMI predictive interval of the whole curve of the last year (i = 100) of this sample knowing the functional covariates X99, we use the first 98 curves as a training sample. Then, we predict the CCDF knowing X99 by F^(·|X99 *) and F~(·|X99*) where X99* is the nearest curve to X99 in the training sample (Yij,Xi)i=199 with Yij=Xi+1(j), for each fixed month j = 1,…,12. Figure 5 displays the results. The dashed curve represents the observed data and the solid curves represent the estimated values for the two extremities of the SCMI predictive interval. We observe that the result of the F~ is significantly better than the F^ one. In the sense that it has an average mean length (M.L = 1.23) versus M.L = 1.97 for F^. Of course, this gain is not surprising.

Results obtained by smooth kNN-LLM and fast kNN-LLM estimators.

Figure 5: Results obtained by smooth kNN-LLM and fast kNN-LLM estimators.

In the second illustration, we emphasize the importance of the kNN-LLM estimation of CCDF on the construction of the SCMI predictive interval by comparing the two kNN-LLM estimators, f and F^, to their competitive estimators constructed by the kNN-Nadraya–Watson (kNN-NW) method. More precisely, we compare the estimators F~ and F^ to the smooth ( F~1(y|x)) and fast ( F^1(y|x)) kNN-NW estimators, where

F^1(y|x)=i=1nKer(d(x,Xi)I​Hk)H(ll1(yYi))inKer(d(x,Xi)I​Hk)

and

F^1(y|x)=i=1nKer(d(x,Xi)IHk)1{Y1y}i=1nKer(d(x,Xi)IHk).

Of course, in order to conduct a comprehensive comparison, we must treat the four models. For this reason, we have used the same kernel, the metric as well as the same selection method of the optimal number of neighbor k. Similar to the first illustration, this comparative analysis is performed over the 12 functional time series (Yij,Xi)i=1100 with Yij=Xi+1(j), for each fixed month j = 1,…,12. For each fixed j, we split (randomly) the functional time series (Yij,Xi)i=1100 into two parts: the learning sample (70 observations) and the test sample (30 observations). Next, we compute the SCMI predictive intervals for the curves of the testing sample using the estimators F~, F^, F~1 and F^1. The efficiency of the four models is evaluated using the Nemenyi test plots for the average mean of the interval-length (ML). The comparison results are displayed in Fig. 6.

Comparison of the average means of length intervals.

Figure 6: Comparison of the average means of length intervals.

Once again, the conclusion is without surprise. It confirms the statement mentioned in the first illustration. More precisely, the local linear approach is more accurate than the Nadaraya-Watson method. This conclusion emphasizes the superiority of the local linear over the classical kernel method, which has been shown in the multivariate statistics through the bias term. On the other hand, it should be noted that all these four functional approaches have a stratification performance in this context of predictive issues.

Conclusion

In this paper, we have studied the nonparametric estimation of the cumulative distribution function of the scalar response variable given a functional explanatory variable. Two new estimators are constructed by combining the local linear approach to the kNN smoothing methods. The first one is built by a fast algorithm based on the conditional cumulative distribution as classical regression of the indicator function. The second estimator is obtained by integration of the double kernel conditional density estimator. The latter gives a smooth estimator of the conditional cumulative distribution function. An empirical analysis is conducted to compare both estimators and their easy implementation in practice. Both artificial and real data carry out the finite sample performance of the two estimators. Undoubtedly, the present contribution highlights the conditional distribution function’s potential impact as a pivotal model in functional data analysis. It is involved in various conational nonparametric models, and its estimation is indispensable as preliminary steps to estimate numerous nonparametric functional models. For instance, we have focused on the prediction problem in this paper, and we have constructed two predictors (single prediction and region predictor). The artificial data analysis shows that both estimators have a satisfactory efficiency in different sinarios of regression data analysis, including homoscedasticity case, heteroscedasticity case, mixture models case, light-tailed or heavy-tailed conditional distribution cases. In conclusion, we can say that the functional data analysis through the conditional distribution has a significant impact in practice, and the proposed estimators of this contribution are fast, very easy to implement, and have a good performance in the prediction issues. Finally, let us note that the present contribution opens several prospects for future research. For example, it will be very interesting to compare the efficiency of our approach to other functional models such as the robust regression and the relative regression. Such models allow to reduce the sensitivity of the kNN approach to the noisy data, missing values, and the presence of outliers.

Supplemental Information

2 Citations   Views   Downloads