Dependence structure analysis of multisite river inflow data using vine copulaCEEMDAN based hybrid model
 Published
 Accepted
 Received
 Academic Editor
 Jianhua Xu
 Subject Areas
 Statistics, Computational Science, Natural Resource Management, Ecohydrology, Spatial and Geographic Information Science
 Keywords
 Canonicalvine, Pair copula construction, Complete ensembe empirical mode decomposition with adaptive noises
 Copyright
 © 2020 Nazir et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
 Cite this article
 2020. Dependence structure analysis of multisite river inflow data using vine copulaCEEMDAN based hybrid model. PeerJ 8:e10285 https://doi.org/10.7717/peerj.10285
Abstract
Several datadriven and hybrid models are univariate and not considered the dependance structure of multivariate random variables, especially the multisite river inflow data, which requires the joint distribution of the same river basin system. In this paper, we proposed a Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) Vine copulabased approach to address this issue. The proposed hybrid model comprised on two stages: In the first stage, the CEEMDAN is used to extract the high dimensional multiscale features. Further, the multiple models are used to predict multiscale components and residuals. In the second stage, the residuals obtained from the first stage are used to model the joint uncertainty of multisite river inflow data by using Canonical Vine. For the application of the proposed twostep architecture, daily river inflow data of the Indus River Basin is used. The proposed twostage methodology is compared with only the first stage proposed model, Vector Autoregressive and copulabased Autoregressive Integrated Moving Average models. The four evaluation measures, that is, Mean Absolute Relative Error (MARE), Mean Absolute Deviation (MAD), NashSutcliffe Efficiency (NSE) and Mean Square Error (MSE), are used to observe the prediction performance. The results demonstrated that the proposed model outperforms significantly with minimum MARE, MAD, NSE, and MSE for two case studies having significant joint dependance. Therefore, it is concluded that the prediction can be improved by appropriately modeling the dependance structure of the multisite river inflow data.
Introduction
In the past few decades, reliable prediction of rivers inflow has gained popularity in all waterrelated departments because of their crucial role in the reservoir, irrigation management, water planning, risk evaluation and flood controlling (Porporato & Ridolfi, 2001; Jandhyala, Liu & Fotopoulos, 2009; Di, Yang & Wang, 2014; Tiwari et al., 2017; Nazir et al., 2019). Johnston & Smakhtin (2014) reviewed the importance of river data and concluded that river inflow data is an indispensable component of water resources. For many rivers and water storage systems, a joint prediction of inflow at multisite, which not only accounts for the inflow characteristics of individual streams but also consider their intersite correlations, is necessary for planning water resources and flood control (Wang & Robertson, 2011). Several singlesite and multisite models have been developed to predict river inflow (Sharma & O’Neill, 2002; Wu & Chau, 2010; Wang & Robertson, 2011; Oyebode, Otieno & Adeyemo, 2014; Devia, Ganasri & Dwarakish, 2015; Wang et al., 2018). Among singlesite models, traditional statistical models that include Autoregressive (AR), Moving Averages (MA), Autoregressive Integrated Moving Average (ARIMA), Autoregressive Conditional Heteroscedasticity (ARCH) and Generalized ARCH (GARCH) have been efficiently used to predict river inflow data (Ghimire, 2017). However, these are commonly used methods but have some drawbacks that these are unable to capture nonlinear, nonstationary, and interdependence characteristics of timeseries data such as river inflow data (Box, 1970; Sharma & O’Neill, 2002; Wu & Chau, 2010; Oyebode, Otieno & Adeyemo, 2014; Devia, Ganasri & Dwarakish, 2015; Wang et al., 2018). The limitations of such nonstationary and nonlinearity led to the emergence of a new paradigm named as datadriven or intelligence models (Wu & Chau, 2010; Oyebode, Otieno & Adeyemo, 2014; Das & Ghosh, 2017).
Several datadriven approaches have been recognized as useful tools to model complex nonstationary and nonlinear river inflow data. For example, KNearest Neighbors, model tree (Oyebode, Otieno & Adeyemo, 2014), computational intelligence (Das & Ghosh, 2017), Genetic Algorithm, Support Vector Machine, Neural Networks (NN) includes Artificial Neural Network (ANN) and Artificial Intelligence (Tiwari et al., 2017). These datadriven models can learn complex behavior, which is an inherent part of river inflow data, without considering any assumption about data. ElShafie, Taha & Noureldin (2007) discussed that river inflow forecasting is an essential procedure for proper water operation. They proposed an adaptiveneuro fuzzy inference system to forecast the monthly inflow data. Wu & Chau (2010) explained in their review that the datadriven approaches performed better than the traditional statistical models to predict the nonlinear data. However, datadriven models may suffer an overfitting problem and are sensitive to parameter selection (Nazir et al., 2019). Moreover, datadriven models ignored the timevarying or multiscale characteristics of time series data. Several hybridization methods have been proposed to extract multiscale or timevarying information from timeseries data. These methods can be combined datadriven models with some datapreprocessing data methods, that is, Wavelet Analysis (WA), Empirical Mode Decomposition (EMD), Ensemble EMD (EEMD) and Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) (Ji, Lu & Tang, 2012; Karthikeyan & Kumar, 2013; Kang et al., 2017). The extracted multiscale information is further used as input in datadriven models to efficiently predict complicated timeseries data (Di, Yang & Wang, 2014; Panigrahi & Behera, 2017). Quality of prediction can be improved by independent modeling of these multiscale components when these multiscale components were modeled independently (Karthikeyan & Kumar, 2013; Nazir et al., 2019). Di, Yang & Wang (2014) introduced a fourstage hybrid model combining EMD/EEMD with the radial basis function of NN methods. They found that their hybrid model performs better than the conventional single time series models. Nazir et al. (2019) also proposed a hybrid model comprised of WA/EMDCEEMDAN as a data preprocessing technique to model the inflow data.
However, all traditional statistical, datadriven and hybrid models are only useful to deal with nonlinear, nonstationary, and multiscale data. They did not model the dependance structure of the multisite rivers inflow data, which requires the joint distribution of the same river basin system. Medda & Bhar (2019) deals with the comparison of singlesite and multisite streamflow prediction models. Their study revealed that the crosscorrelation between multisite rivers enhances the performance of streamflow predictions. Therefore, failure to incorporate such multisite dependance in predicting rivers inflow may produce an inaccurate and unreliable prediction. Several models have also been developed for modeling multisite rivers inflow data (Ledolter, 1978; Porporato & Ridolfi, 2001; Sharma & O’Neill, 2002; Jandhyala, Liu & Fotopoulos, 2009; Hao & Singh, 2013; Aghakouchak, 2014; Aranda & GarcíaBartual, 2018). Jandhyala, Liu & Fotopoulos (2009) proposed a Gaussian formulation to detect changes in mean rivers flow by considering the six rivers dependance structure which flows in the same region. They reached on the conclusion that if the assumption of temporal independence is satisfied, their proposed multivariate framework performs reasonably well in predicting average rivers flow.
Dependence between sites can be modeled by copulas (Min & Czado, 2010; Aghakouchak, 2014; Liu et al., 2015; Balistrocchi et al., 2017; Aranda & GarcíaBartual, 2018; Addo, Chanda & Metcalfe, 2018; Wang et al., 2018). Hao & Singh (2013) have proposed the maximum entropy copula for multisite streamflow simulation and shown a reasonable agreement with observed streamflow. Aranda & GarcíaBartual (2018) used the copula theory for the probabilistic modeling of rivers flow. However, the use of Copula is restricted to bivariate dependance only. For modeling the high dimensional data, vines including Regular Vine (RVine), Canonical Vine (CVine) and Drawable Vine (DVine) have been introduced (Czado & Aas, 2013; Liu et al., 2015; Bedford & Cooke, 2001; Bedford, Daneshkhah & Wilson, 2016). Vines are used for building a highdimensional probabilistic dependance structure, which may be comprised on the product of simple bivariate and conditional bivariate distribution functions. Bedford, Daneshkhah & Wilson (2016) approximated the complex joint uncertainty by using vines due to its flexibility of constructing highdimensional multivariate distributions in the hierarchy, which proves better than the other simple bivariate elliptical and Archimedean copulas. However, copula/vines have an underlying assumption that the time series data or a variable should be linear and stationary, which means there is no serial correlation between a time series variable and its lagged version, which usually does not meet in complex river inflow data (Sklar, 1959; Czado et al., 2009; Lee & Salas, 2011; Yusof, Kane & Yusop, 2013). Laux et al. (2011) proposed a hybrid model based on ARMAGARCH to deal with a linearity and stationery: Independent Identically Distributed (IID) requirement of the Copula. They first transformed the nonlinearity of complex data using the ARMAGARCH model, then the filtered residuals resulting from ARMAGARCH are used as input in Copula. They concluded that IID transformations through ARMAGARCH are imperative before modeling multivariate dependance structure between hydrological data to avoid bias induced due to volatile dependance. Singh et al. (2018) also developed a hybrid copulabased model for seasonal data. They incorporated ARMA with Copula to effectively filter out the interdependence, which are proved fruitful in accurately simulating pre and postmonsoon data. However, for multisite river inflow prediction, the inflow at individual river sites and dependance structure among different locations in the same river basin system, which involves nearby catchment characteristics of rivers, should preserve all statistical properties.
In summary, several datadriven and hybrid models have been developed which are univariate and do not recognize the dependency structure of multivariate random variables, particularly the multisite river inflow data, which needs the joint dissemination of the same river basin system (Nazir et al., 2019; Bedford, Daneshkhah & Wilson, 2016). In this study, we proposed two separate approaches CEEMDAN and CVine, to consider the dependance structure of multivariate random variables, especially the multisite river inflow data, while preserving all statistical properties: nonlinear, nonstationary, and multiscale.
Proposed Method
To predict the multisite rivers daily inflow data, a copulabased CEEMDAN hybrid model is proposed which works in twostages to model the multiscale and mutual dependance of multisite streams inflow as follows:
In the first stage: to get the independent residuals (which are used in the second stage to model the joint uncertainty); first, there was a need to model the multiscale components of time series data more appropriately. The CEEMDAN decomposition method is applied to extract high dimensional multiscale elements (Intrinsic Mode Function (IMF)) from each river’s inflow data separately. The dimensionality of these IMFs is reduced (CEEMDANR) by adding the respective high and low multiscale IMFs, respectively, except the first two IMFs and the last residual. Next, the Multiple Models (MM), that is, Group Method of Data Handling type Neural Network (GMDHNN) and the Revised form of GMDH type Neural Network (RGMDHNN) and ARIMA are selected to predict the IMFs and residuals (CEEMDANRMM) (Nazir et al., 2019). The residuals of this stage are further used as input to model the mutual dependance of multisite rivers inflow.
In the second stage: a CVine copula function is used to model mutual dependance of multisite rivers inflow of the same river basin system. The CVine is selected due to its functionality of selected a root node, which enhances the sum of pairwise dependance to this node. The schematic view of the proposed model is given in Fig. 1. For convenience, the proposed method is denoted as CVine based CEEMDANRMM method. The detail description of the proposed method is described as follows:
The CEEMDAN as a decomposition method
In this study, the CEEMDAN method introduced by Torres et al. (2011) is used to decompose the daily rivers inflow time series data, which is briefly described as follows:
1. The CEEMDAN add white Gaussian noises on the lines of EEMD, in original inflow data as follows: (1) $$y\left(t\right)=x\left(t\right)+{w}_{0}{n}^{j}\left(t\right)$$where $(j=1,2,\dots ,m$) mth ensemble, $x\left(t\right)$ is original inflow data and ${w}_{0}$ is the white noise amplitude. The first IMF is found in the usual way as in EEMD defined as: (2) $$\stackrel{~}{\mathrm{I}\mathrm{M}{\mathrm{F}}_{1}}={\sum}_{j=1}^{m}{\displaystyle \frac{\mathrm{I}\mathrm{M}{\mathrm{F}}_{j1}^{m}}{m}}$$
2. Compute the remainder of original inflow from the first IMF through the following equation: (3) $${r}_{1}\left(t\right)=x\left(t\right)\stackrel{~}{\mathrm{I}\mathrm{M}{\mathrm{F}}_{1}}$$
3. Add white noise in the remainder, which is calculated from Eq. (3) as ${r}_{1}\left(t\right)+{w}_{0}{n}^{j}\left(t\right)$ and decompose it to get second IMF as: (4) $$\stackrel{~}{\mathrm{I}\mathrm{M}{\mathrm{F}}_{2}}={\sum}_{j=1}^{m}{\displaystyle \frac{IM{F}_{j2}^{m}}{m}}$$
4. Repeat the steps of (2–3) until the process meets the stoppage criteria, and the remainder/residuals contain only one or two extremes. Finally, the remainder/residual is defined as: (5) $$R\left(t\right)=x\left(t\right){\sum}_{k=1}^{K}\stackrel{~}{IM{F}_{k}}$$where $k=\phantom{\rule{thickmathspace}{0ex}}1,2,..,K$.
After performing CEEMDAN, the next task is to reduce the dimension of multisubsequences (IMFs) (Nava, Matteo & Aste, 2018). To reduce the size of IMFs, excepting the first two IMFs and residuals, the remaining high and low multiscale components (IMFs) are added with each other respectively as follows: (6) $$x\left(t\right)=\mathrm{I}\mathrm{M}\mathrm{F}1+\mathrm{I}\mathrm{M}\mathrm{F}2+\sum _{J=1}^{s}\mathrm{I}\mathrm{M}{\mathrm{F}}_{J}+\sum _{k=s+1}^{K}\mathrm{I}\mathrm{M}{\mathrm{F}}_{k}+\mathrm{R}\mathrm{e}\mathrm{s}\mathrm{i}\mathrm{d}\mathrm{u}\mathrm{a}\mathrm{l}$$
Prediction step: In the prediction step, the decomposed IMF components and residual described in Eq. (6) are predicted. For that, two datadriven models (GMDHNN and its revised form RGMDHNN) and one traditional statistical model (ARIMA) are used (Nazir et al., 2019). A detailed description of two datadriven and one stochastic models is described in Box (1970) and Ahmadi, Mottaghitalaband & NarimanZadeh (2007). A brief description of GMDHNN is described as follows:
The IMF prediction derived by using GMDHNN: Except little applications in rivers inflow modeling, GMDHNN is known for many benefits attached to a wide range of areas. In comparison with ANN and other datadriven models, GMDHNN, which is a submodel of ANN, has many advantages: First, GMDHNN has been proved to be a useful tool for modeling of a complex and nonlinear system which is constructed to improve the explicit polynomial model by selforganizing (Ahmadi, Mottaghitalaband & NarimanZadeh, 2007). Second, the GMDHNN is useful in pairwise relationship considerations between all possible selected lagged inputs. All pairs are entered in a neuron to construct output. Further, an evaluation measure is used for neuron selection. The process is continued until the last layer. In the final segment, the only single bestpredicted neuron is selected. The only drawback of using GMDHNN is that it considers the relationship of two inputs while ignores the individual effects of each point. The Architecture GMDHNN (RGMDHNN), which is an improved form of GMDHNN, can be used to overcome this drawback problem, which can utilize twoinput relation as well as their individual effects. While the remaining procedure of RGMDHNN and GMDHNN is similar. The coefficients of all neurons are estimated with regularized least square estimation method as this method is robust with multicollinearity, which is the characteristic of time series data.
Copula theory and construction of CVine Copula
The Copula defined as ndimensional multivariate distribution function on a unit cube with uniform marginals. An extensive review on Copula includes (Czado et al., 2009; Lee & Salas, 2011; Aghakouchak, 2014; Bedford, Daneshkhah & Wilson, 2016; Bevacqua et al., 2017; Zhao et al., 2017; Yu, Zhang & Singh, 2018). It leads to a suggestion that several researchers agree on using Copula to model the nonlinear dependance in applications of finance, hydrology, and climatology. Sklar (1959) proposed a copula by establishing a link between marginal and multivariate distribution as let $F$ be the ndimensional distribution function and $F({x}_{i}$) be the marginal distribution of $X={\left[{X}_{1},{X}_{2},\dots .{X}_{n}\right]}^{T}$, then there exists a copula which defined as: (7) $$F\left({x}_{1},{x}_{2},\dots .{x}_{n}\right)=C\left(F\left({x}_{1}\right),F\left({x}_{2}\right),\dots F\left({x}_{n}\right)\right)$$ (8) $$c(F\left({x}_{1}\right),F\left({x}_{2}\right),.,F\left({x}_{n}\right)={\displaystyle \frac{d\left(C\left(F\left({x}_{1}\right),F\left({x}_{2}\right),\dots F\left({x}_{n}\right)\right)\right)}{dF\left({x}_{1}\right),F\left({x}_{2}\right),\dots F\left({x}_{n}\right)}}$$where $C$ is a cumulative multivariate distribution function, and $c$ is its density. The readers are advised to look (Nelsen, 2007) for the detailed study of copula theory. A wide variety of symmetrical, that is, Archimedean, elliptical, and asymmetrical copulas are introduced in the literature (Song & Kang, 2011; Almeida, Czado & Manner, 2016; Bevacqua et al., 2017; Wang et al., 2018; Yu, Zhang & Singh, 2018). The most widely used copula functions and their parameters are presented in Tables 1 and 2. Although the Copula is recognized as a powerful tool, it suffers a lack of flexibility when modeling the high dimensional data where complex dependencies exist among the different pairs of variables. Recently, this drawback is covered with the PCCs, also called vines. Specifically, the PCCs are a product of decomposition of bivariate Copula and conditional bivariate copula densities, where all selected bivariate copulas are chosen according to the requirement of their dependance structure (Aas et al., 2009). The initial work on PCCs is found in Kurowicka & Cooke (2006), and later its detail is provided in Yang et al. (2015). Further, it was extended by Bedford, Daneshkhah & Wilson (2016). They explored the case of Gaussian pair copula and called it as RVine. They demonstrated that the accurate specification of PCCs makes the multivariate distribution more useful, which is near to reality. The structure of the vine is comprised of connected trees. Different arrangements of vines are available, which is employed according to the requirement as RVine, CVine and DVine. The schematic view of CVine and DVine is presented in Fig. 2. Here in our study, the CVine copula structure is used to model the joint dependance structure of multisite rivers. The fourdimensional CVine copula density is expressed as follows (Allen, McAleer & Singh, 2017) for four variables: (9) $$f\left({x}_{1},..,{x}_{4}\right)=\prod _{i=1}^{4}f\left({x}_{i}\right)\prod _{k=1}^{3}\prod _{l=1}^{4k}{c}_{k,k+l1,..,k1}\left(F\left({x}_{k}{x}_{1},\dots ,{x}_{k1}\right),F\left({x}_{k+l}{x}_{1},\dots ,{x}_{k1}\right)\right),F\left({x}_{1},\dots ,{x}_{k1}\right))$$where index $k$ identifies trees and $i$ defines edges of all trees, ${c}_{k,k+11,..,k1}$ works according to its subscript and $F\left({x}_{}v\right)$ for $m$ dimensional vector $v$ presents the conditional distribution function, which is calculated as (Allen, McAleer & Singh, 2017): (10) $$F\left(xv\right)={\displaystyle \frac{d({c}_{xv{v}_{j}}\left(F\left(x{v}_{j}\right),F\left({v}_{j}{v}_{j}\right)\right)}{dF\left({v}_{j}{v}_{j}\right)}}$$where ${v}_{j}$ is an arbitrary component of $v$ and ${v}_{j}$ denotes the $\left(m1\right)$ dimensionl vector excluding ${v}_{j}$.
Copula  $\phi \left(\mathbf{u}\right)$  $C\left(u,v\right)$  Kendall’s τ  Parameter Range 

Clayton  $\frac{{t}^{\theta}1}{\theta}$  $C{\left({u}^{\theta}+{v}^{\theta}1\right)}^{1/\theta}$  $\frac{\theta}{\theta +2}$  $\theta >\left(0,\mathrm{\infty}\right)$ 
Gumbel  ${\left(lnt\right)}^{\theta}$  $\mathrm{e}\mathrm{x}\mathrm{p}({\left[ln{u}^{\theta}ln{v}^{\theta}\right]}^{{\displaystyle \frac{1}{\theta}}}$  $\frac{\theta 1}{\theta}$  $\theta >1$ 
Frank  ln$\left({\displaystyle \frac{{e}^{\theta t}1}{{e}^{\theta}1}}\right)$  ${\displaystyle \frac{1}{\theta}\left(1+{\displaystyle \frac{({e}^{\theta u}1)\left({e}^{\theta v}1\right)}{{e}^{\theta}1}}\right)}$  $1{\displaystyle \frac{4}{\theta}\left[1{D}_{1}{\left(\theta \right)}^{\ast}\right]}$  $\theta $> (−∞, ∞) 
Gaussian  ${\varphi}_{\rho}\left({\varphi}_{{u}_{1}}^{1},{\varphi}_{{u}_{2}}^{1}\right)$  $\frac{2}{\pi}\mathrm{a}\mathrm{r}\mathrm{c}\mathrm{s}\mathrm{i}\mathrm{n}\left(\rho \right)$  ρ > (−1, 1)  
Studentt  ${t}_{\rho ,\upsilon}\left({{t}_{\upsilon}}_{{u}_{1}}^{1},{{t}_{\upsilon}}_{{u}_{2}}^{1}\right)$  $\frac{2}{\pi}\mathrm{a}\mathrm{r}\mathrm{c}\mathrm{s}\mathrm{i}\mathrm{n}\left(\rho \right)$  ρ > (−1, 1), ν > 2 
Copula  $\phi \left(\mathbf{u}\right)$  τ  Parameter range 

(BB6)  ${\left(\mathrm{l}\mathrm{o}\mathrm{g}\left[1\phantom{\rule{thickmathspace}{0ex}}{\left(1t\right)}^{\theta}\right]\right)}^{\delta}$  $1+{\displaystyle \frac{4}{\delta \theta}{\int}_{0}^{1}(\mathrm{l}\mathrm{o}\mathrm{g}(1\phantom{\rule{thickmathspace}{0ex}}{\left(1t\right)}^{\theta})}$  $\theta \ge 1$ 
$\ast \left(1t\right)\left(1{\left(1t\right)}^{\theta}\right)))dt$  $\delta \ge 1$  
(BB7)  ${\left(1{\left(1t\right)}^{\theta}\right)}^{\delta}1$  $1+{\displaystyle \frac{4}{\theta \delta}{\oint}_{0}^{1}\begin{array}{c}({\left(1{\left(1t\right)}^{\theta}\right)}^{\delta +1}\\ \ast {\displaystyle \frac{{\left(1{\left(1t\right)}^{\theta}\right)}^{\delta}1}{{\left(1t\right)}^{\theta 1}}dt}\end{array}}$  $\theta \ge 1$ 
$\delta \ge 1$  
(BB8)  $\mathrm{l}\mathrm{o}\mathrm{g}\left[{\displaystyle \frac{1{\left(1\delta t\right)}^{\theta}}{1{\left(1\delta \right)}^{\theta}}}\right]$  $1+\frac{4}{\theta \delta}\underset{0}{\overset{1}{{\displaystyle \oint}}}\left(\mathrm{log}\left[{(1\delta t)}^{\theta 1}{(1\delta )}^{\theta}1\right](1t\delta )(1{(1t\delta )}^{\theta})\right)dt$  $\theta \ge 1$ 
$\delta >0$ 
Choosing pair copula families and estimation of parameters
There are several paircopula families, that is, Frank, Gumbel, Clayton, Gaussian and t as listed in Tables 1 and 2. The copula pair is typically chosen in each tree one by one according to different model selection criteria like Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and goodnessoffit test criteria. Aas et al. (2009) used AIC and BIC for the selection of bivariate pair of Copula for vine structure. However, care should be taken in selecting the bivariate copulas as the selection of a copula families in the tree for vine structure depends on the choice on the introductory level copulas due to the connected trees. In this study, the AIC, BIC (Mirbagherijam et al., 2015), and maximum loglikelihood methods are used as the most reliable selection criterion to select the possible pair of copulas and its conditionals. For the parameter estimation, the maximum likelihood method is employed for each couple of copulas and its conditional.
Marginal distribution
To proceed with Copula, the standard uniform distribution is required, which is the inverse transformation of the marginal distribution. According to previous studies, both parametric and nonparametric distributions are used Chen & Guo (2019) to get appropriate marginal distribution. To fit the firststage residuals resulted from CEEMDANRMM, Empirical, normal, and t distribution functions are used. The detailed description of normal, t and empirical distribution function is given in Chen & Guo (2019). The parameters of the marginal distribution are estimated through the maximum likelihood method and to verify the best fit distribution, Kolmogorov–Smirnov (K–S) test is used.
Simulation from CVine based conditional distribution
The simulation algorithm of CVine is defined as follows:
First four independent standard uniform random variables $\left({t}_{1},{t}_{2},{t}_{3},{t}_{4}\right)$ are generated. Then these values are used as probability levels to determine (${u}_{1},{u}_{2},{u}_{3},{u}_{4})$ through the following equations: (11) $${u}_{1}={t}_{1}$$ (12) $${u}_{2}=F({t}_{2}{u}_{1})$$ (13) $${u}_{3}=F({t}_{3}{u}_{1},{u}_{2})$$ (14) $${u}_{4}=F({u}_{4}{u}_{1},{u}_{2},{u}_{3})$$where the conditional distributional function is calculated through Eq. (10). Once ${u}_{i}$’s where i = 1, 2, 3, 4 are simulated, the corresponding ${x}_{i}$’s are calculated from the inverse normal CDFs.
Comparison of proposed model with benchmark models
The proposed model is compared with three benchmark models described as following:

First, the Vector Autoregressive (VAR) model of Ledolter (1978) is considered where the dependance structure of multisite rivers inflow is modeled directly without assuming their multiscale and joint probabilistic structure.

Second, the copulabased ARMA model of Singh et al. (2018) is selected for the comparison of the proposed model. The copulabased ARMA model only considers the joint dependance structure of multisite rivers inflow without assuming the multiscale characteristics into account.

Third, CEEMDANRMM, work of Nazir et al. (2019) is considered where only multiscale characteristics are modeled without considering the joint dependance structure among multisite river inflow.
Evaluation criteria
The prediction accuracy of the proposed model is evaluated using four evaluation measures such as MAD, Mean Absolute Relative Error (MARE), NashSutcliffe Efficiency (NSE) and Mean Square Error (MSE) are used to observe the prediction performance. The following are their Eq.’s from (15–18), respectively: (15) $$\mathrm{M}\mathrm{A}\mathrm{D}={\displaystyle \frac{1}{N}\sum _{t=1}^{N}\leftx\left(t\right)\hat{x}\left(t\right)\right}$$ (16) $$\mathrm{M}\mathrm{A}\mathrm{R}\mathrm{E}={\displaystyle \frac{100\mathrm{\%}}{N}{\sum}_{t=1}^{N}\left{\displaystyle \frac{x\left(t\right)\hat{x}\left(t\right)}{x\left(t\right)}}\right}$$ (17) $$\mathrm{N}\mathrm{S}\mathrm{E}=1{\displaystyle \frac{{\sum}_{t=1}^{N}{\left(\hat{x}\left(t\right)x\left(t\right)\right)}^{2}}{{\sum}_{t=1}^{N}{\left(x\left(t\right)\underset{\_}{x}\left(t\right)\right)}^{2}}}$$and (18) $$\mathrm{M}\mathrm{S}\mathrm{E}={\displaystyle \frac{1}{N}{\sum}_{t=1}^{N}{\left(x\left(t\right)\hat{x}\left(t\right)\right)}^{2}}$$
Where $x\left(t\right)$ is the original series of data and $\hat{x}\left(t\right)$ is the predicted series of data.
Case Study and Experimental Design
Being the most substantial irrigation, insensible water resource system, and source of power generation in Pakistan, Indus River Basin (IRB) is considered for the application of the proposed methodology. The timeseries data for the four rivers (River Indus, the River Jhelum, the River Chenab and the River Kabul) contributing significantly to the water system of IRB is utilized to validate the proposed methodology. These streams are generating severe flooding due to melting snow or glacier, and torrential monsoon precipitation in Pakistan. The 13% mountainous regions of the Upper Indus Basin (UIB) cover 13,680 km^{2} area of the glacier in Pakistan, which is significantly contributing to the IRB system. Pakistan suffered floods, almost one wave every three years from 1950 to 2011. Ali (2013) reported that $19 billion economic losses have occurred, total 109, 822 villages have damaged, and a total of 8,887 people have died due to floods in Pakistan. The only flood that occurred in 2010 caused $10 billion, the highest financial loss of Pakistan. Therefore, to assess the proposed model, it is appropriate to use rivers data of IRB as an illustrative case study.
The daily river inflow data set used in this study is comprised on 1st January 2015 to 19 November 2018. For the application of the proposed objective, the daily inflow of the Indus River at Tarbela with its two principal left and one right bank tributaries: Jhelum River at Mangla, Chenab River at Marala and Kabul River at Nowshera respectively are selected. The schematic view of rivers chosen is presented in Fig. 3. The daily inflow data is measured in 1000 CUSECS, which was acquired from the site of Pakistan Water and Power Development Authority (WAPDA).
Results
The firststage proposed model results: the first stage of the proposed strategy is comprised of modeling the multiscale features of each river inflow data to proceed with the multisite joint dependance. The proposed approach is applied to the daily inflow data set of four rivers, that is, Indus, Kabul, Jhelum and Chenab. The river inflow data for all case studies is first decomposed into multiscale IMFs through CEEMDAN. All four river’s inflow data is decomposed into nine IMFs and one residual (Nazir et al., 2019) as shown in Figs. 4 and 5 for Indus river inflow and Jhelum river inflow respectively. The white noise amplitude is set as 0.2 Di, Yang & Wang (2014), and a maximum number of ensemble members are selected as 1000. The dimension of extracted CEEMDAN based nine IMFs is further reduced (CEEMDANR) to save time and labor of modeling each IMF individually. To obtain CEEMDANR, except for the first two IMFs and the last residual, the remaining IMFs showing the same high and low multiscale components are added with each other, respectively. The first two IMFs are predicted alone as they both shown complex and the highest frequency, as depicted in Figs. 4 and 5, for Indus and Jhelum river, inflow data respectively. From the remaining seven IMFs, the first four IMFs and last three IMFs are added separately with each other as they showed the same high and low multiscale components. To predict the IMF1, IMF2, added high and low multiscale elements and residuals, GMDH, RGMDH and ARIMA models are applied, and the best one method with minimum MDE, MARE, and MSE is selected among all prediction methods. For prediction purpose, data is divided into 80% for training and while the second is 20% for testing. The training results of the first stage proposed model CEEMDANRMM for all case studies are presented in Table 3. The residuals from this stage model, CEEMDANRMM, are used as inputs in the secondstage model to get improved and final multisite rivers inflow prediction.
Rivers inflow  Model  MAD  NSE  MARE  MSE 

Indus inflow  VAR  4.9069  0.9899  0.0715  77.9964 
Jhelum inflow  3.7185  0.9042  0.0715  50.5069  
Chenab inflow  2.7334  0.9885  0.0976  28.4530  
Kabul inflow  4.6269  0.8849  0.1944  99.9116  
Indus inflow  ARIMACOP  4.3562  0.9915  0.0744  64.4888 
Jhelum inflow  3.6253  0.9158  0.1358  46.7833  
Chenab inflow  2.6468  0.9608  0.1043  24.9336  
Kabul inflow  4.7105  0.8844  0.1429  100.271  
Indus inflow  CEEMDANRMM  2.2145  0.9989  0.0652  8.0779 
Kabul inflow  1.2822  0.9967  0.0730  2.8825  
Chenab inflow  0.8694  0.9980  0.0576  1.2689  
Jhelum inflow  0.8664  0.9978  0.0454  1.2971  
Indus inflow  CVine based CEEMDANRMM  2.1771  0.9990  0.0770  7.8407 
Kabul inflow  0.9195  0.9978  0.0687  1.3767  
Chenab inflow  1.2826  0.9966  0.1458  2.8982  
Jhelum inflow  0.8985  0.9976  0.0691  1.4090 
The secondstage proposed CVine based CEEMDANRMM model results: to model the mutual dependance of multisite river inflow data, the residuals from the first stage proposed model, CEEMDANRMM is used. The accuracy of the first stage model is evaluated through the estimated residuals generated from CEEMDANRMM by using modified Qstatistic Ljung & Box (1978) and the Lagrange multiplier Evans & Patterson (1985) tests, which are used to check the serial correlation of estimated residuals. Both modified Qstatistics and Lagrange multiplier revealed that there is no autocorrelation in the estimated residuals for all case studies at the 5% level. The independent residuals from the first stage CEEMDANRMM model for the Indus and Jhelum rivers inflow are shown in Fig. 6. Before applying the CVine method, the correlation of residuals of multisite river inflow data needs to be estimated. For that reason, Kendall’s rank correlation measure is used. The estimated values of Kendall’s rank correlation are given in Table 4. From Table 4, it is depicted that there exists a correlation between all pairs of multisite river inflow data except for Jhelum and Chenab rivers inflow. Further, to proceed with CVine Copula, cumulative distribution functions are fitted using empirical, normal, and tdistribution functions. The appropriate distribution is verified according to the pvalue of the K–S test. The null hypothesis of the KS test is that the residuals follow a specified distribution here, we set normal and t distribution. The pvalue 0.200 against normal distribution, confirms that all residuals are determined appropriate because it is more significant than others distribution see Fig. 7. The advantage of the proposed firststage model CEEMDANRMM has also confirmed with the normal distribution that the errors confirm the assumption of IID as it can be seen from Fig. 7. To increase the prediction precision of river inflow data, this joint dependance structure of multisite rivers inflow is incorporated with CVine Copula. Different bivariate copula functions, as listed in Tables 1 and 2, are fitted to make the building block between pairs of rivers simultaneously of CVine Copula. The most appropriate fitted bivariate and conditional bivariate copula functions are selected based on the maximum loglikelihood, lower value of AIC and BIC. The selected CVine conditional structure with AIC and BIC values for the Indus and the Kabul is shown in Fig. 8, whereas for the Chenab and Jhelum it is shown in Fig. 9. For Jhelum and Chenab, it is cleared that the correlation is low for all pairs, as depicted in Fig. 9. The simulation for each river’s inflow data is done using Eqs. (11)–(13). Finally, the simulated values are added in the predicted values of the first stage (the CEEMDANRMM model) to get final values for all four case studies. The overall performance of the proposed model, CVine based CEEMDANRMM, are compared with benchmark models (VAR by Ledolter (1978), Copulabased ARMA Singh et al., 2018, and the firststage CEEMDANRMM (Nazir et al., 2019)) for all four case studies. The results of proposed and benchmark models are given in Table 3. The predicted river inflow data for Indus and Jhelum river inflow for proposed and benchmark models is depicted in Fig. 10 and for Kabul and Chenab is depicted in Fig. 11. From Table 3, it can be observed that the proposed twostage CVine based CEEMDANRMM model outperforms the other benchmark models for Indus and Kabul river inflow as shown in Table 3 with bold measure values. On the other hand, the rivers as Jhelum and Chenab showing insignificant relationships are better modeled only through the firststage proposed model (CEEMDANRMM) as compared to the twostage proposed model and other benchmark models as indicated in Table 3.
Rivers  Indus  Jhelum  Chenab  Kabul 

Indus  1.0000  0.4410  0.5441  0.7468 
Jhelum  1.0000  0.6794  0.5844  
Chenab  1.0000  0.6207  
Kabul  1.0000 
Discussion
The following discussion is inferred based on the training error presented in Table 3:
Overall comparison of the first stage and second stage proposed model: the overall performance of the proposed twostage CVine based CEEMDANRMM model shows prediction improvement on all other three methods listed in Table 3 MAD, NSE, MARE and MSE for the Indus and Kabul rivers inflow. However, for the Jhelum and the Chenab rivers inflow as they did not provide any significant correlation among pairs of variables as depicted in Fig. 9, only the firststage model (CEEMDANRMM; Nazir et al., 2019), provides satisfactory results for Jhelum and Chenab rivers inflow than all other existing work of Ledolter (1978) and Singh et al. (2018) and twostage novel CVine based CEEMDANRMM model. It can be observed from our results that by utilizing important information that is present in data, one can enhance the quality of complex hydrological time series data.
Comparison of existing benchmark models: for all four case studies, it can be observed from Table 4 that VAR performs poorly as compared with CEEMDANRMM and copulabased ARIMA model as it does not consider the multiscale characteristics of timevarying and nonlinear data. Moreover, by combining Copula with ARMA, the prediction performance of multisite rivers inflow is increased over a simple VAR model, as shown in Table 3 for all four case studies.
Overall it is concluded that for the significant correlation among rivers, our proposed CVine based CEEMDANRMM and for the nonsignificant association between rivers, our firststage proposed model CEEMDANRMM performs well over the works of Ledolter (1978) and Singh et al. (2018). It is concluded that the performance of multisite river inflow data can enhance by providing the maximum information which exists between complex multivariate time series data.
Conclusion
Prediction of multisite river inflow has become a hot topic for hydrological researchers today. In this study, the IRB of Pakistan has been selected for predicting the multisite river inflow by using its four main rivers: Indus, Kabul, Jhelum and Chenab. A novel CVine based CEEMDANRMM model is proposed to predict such multisite rivers inflow that considered its complex dependance structure. We found that the accuracy of prediction can be improved by appropriately modeling the dependance structure of the multisite river inflow data.
Further recommendations
In this paper, we proposed a CVine based CEEMDANRMM method to model multisite river inflow data, which proved fruitful over simple singlesite river inflow modeling by utilizing the dependance structure which exists between rivers. However, it is also seen that when there is no dependance between river inflow data, only the CEEMDANRMM model provides efficient results. Overall, these conclusions are applied to the river system studied and may be used for river systems with similar flow characteristics.