An improved framework to predict river flow time series data
 Published
 Accepted
 Received
 Academic Editor
 Matthew Wilson
 Subject Areas
 Statistics, Computational Science, Natural Resource Management
 Keywords
 Empirical Mode Decomposition (EMD), Wavelet Analysis (WA), Machine Learning (ML), Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) and Empirical Bayes Threshold (EBT), Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) and Empirical Bayes Threshold (EBT).
 Copyright
 © 2019 Nazir et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
 Cite this article
 2019. An improved framework to predict river flow time series data. PeerJ 7:e7183 https://doi.org/10.7717/peerj.7183
Abstract
Due to nonstationary and noise characteristics of river flow time series data, some preprocessing methods are adopted to address the multiscale and noise complexity. In this paper, we proposed an improved framework comprising Complete Ensemble Empirical Mode Decomposition with Adaptive NoiseEmpirical Bayesian Threshold (CEEMDANEBT). The CEEMDANEBT is employed to decompose nonstationary river flow time series data into Intrinsic Mode Functions (IMFs). The derived IMFs are divided into two parts; noisedominant IMFs and noisefree IMFs. Firstly, the noisedominant IMFs are denoised using empirical Bayesian threshold to integrate the noises and sparsities of IMFs. Secondly, the denoised IMF’s and noise free IMF’s are further used as inputs in datadriven and simple stochastic models respectively to predict the river flow time series data. Finally, the predicted IMF’s are aggregated to get the final prediction. The proposed framework is illustrated by using four rivers of the Indus Basin System. The prediction performance is compared with Mean Square Error, Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). Our proposed method, CEEMDANEBTMM, produced the smallest MAPE for all four case studies as compared with other methods. This suggests that our proposed hybrid model can be used as an efficient tool for providing the reliable prediction of nonstationary and noisy time series data to policymakers such as for planning power generation and water resource management.
Introduction
The economic development of any country is directly linked to the proper management of their water resources operations that can minimize the effects of various natural disasters such as floods and droughts. Therefore, river flow time series prediction is an imperative task, which plays a significant role for effective and appropriate water resource planning and management, early flood warning, irrigation, and hydropower generation (Yaseen et al., 2018a; Yaseen et al., 2018b). Several algorithms have been used for prediction and estimation for river flow time series data (Aichouri et al., 2015; Shathir & Saleh, 2016; AlMasudi, 2018; Gjika, Aurora & Arbesa, 2019). The BoxJenkins methodology technique (Box & Jenkins, 1970) is commonly used in the literature (Shathir & Saleh, 2016) because it can be used in a wide class of models i.e., Autoregressive (AR), Moving Averages (MA), Autoregressive Integrated Moving Averages Model (ARIMA), Seasonal ARIMA. But the problem with such statistical models is that they only considered the stationary and linear behavior of the process (Di, Yang & Wang, 2014). However, river flow time series data is nonlinear, nonstationary, multiscale and noisecorrupted (Di, Yang & Wang, 2014) as stochastic nature of several factors (e.g., rainfall, evaporation, and temperature). This complex nonstationary, multiscale and noisecorrupted characteristics make the prediction a challenging task. During the past decades, this issue has been addressed by developing some datadriven models i.e., Artificial Neural Network (ANN), model tree, Support Vector Machine (SVM), Adaptive InferenceBased Neural Network (AIBNN), Extreme Learning Machine (ELM) (Ali et al., 2018). Yaseen et al. (2018a) and Yaseen et al. (2018b) demonstrated the viability of ELM and enhanced version of ELM method to forecast river flow data as compared to other statistical models. However, the datadriven models ignore the timevarying and noise characteristics of river flow processes which deprives the researcher to efficiently predict such data. To address the drawback of such datadriven models, several hybrid models are introduced to extract the timevarying information and reduce noises which ultimately increase the prediction accuracy of datadriven models (Toth, Brath & Montanari, 2000; Di, Yang & Wang, 2014; Su et al., 2016; Kang et al., 2017; Hadi & Tombul, 2018; Yaseen et al., 2018a; Yaseen et al., 2018b). The hybrid model uses data preprocessing methods such as Singular Spectrum Analysis (SSA), Wavelet Analysis (WA), Empirical Mode Decomposition (EMD) and EmpiricalEMD (EEMD) with datadriven models also called intelligence models. An advantage of such data decomposition methods is that they used not only for decomposing the data into timefrequency components but also used to separate noises from data. Wu & Chau (2011) coupled data preprocessing techniques i.e., SSA with ANN to accurately model the rainfall and river flow data. Azadeh et al. (2011) demonstrated the ability to use the data preprocessing method to enhance the precision of datadriven models. They used various data processing techniques and reported that the processed nonlinear data is efficiently forecasted with simple statistical and intelligent models. Gjika, Aurora & Arbesa (2019) found that proper consideration of the volatility nature of nonlinear data through data preprocessing methods could improve the prediction quality. Santos et al. (2019) proposed a model by coupling WA and ANN techniques. They used WA to transform the daily flow time series data to enhance the precision of the ANN model. The data preprocessing methods used to decompose the nonstationary and nonlinear data into physical modes of timefrequency components (Han & Liu, 2009; Azadeh et al., 2011; Wang et al., 2015; Peng et al., 2017). The derived timefrequency components, also called multiscale components (Nazir et al., 2019), are predicted through a mixture of datadriven models accurately (Azadeh et al., 2011). Further, the timefrequency components, are filtered out by using appropriate thresholds. The reason for using a filter is to preserve significant features of original time series data while removing noises or sparsity from multiscale components. A growing body of research on denoising found that the extracted high and low multiscale components in different fields can be found by using linear and nonlinear thresholds. Moreover, many other methods of thresholds including the Stein Unbiased Risk Estimator (SURE) (Candes, SingLong & Trzasko, 2012; Hansen, 2017), the fixed and soft threshold (Di, Yang & Wang, 2014; Nazir et al., 2019), and minimax algorithms (Hansen, 2017) are also used to remove noises and to preserve important information from complex data. Peng et al. (2017) developed a novel hybrid model by employing an empirical wavelet transform estimator to remove the redundant noises from river flow data. Further, the denoised data are predicted through Particular Swarm Optimization basedANN model (PSOANN). They demonstrated the efficiency of their proposed model over simple PSOANN model without denoising. Di, Yang & Wang (2014) proposed a hybrid model by considering two data preprocessing methods i.e., EMD and WA with a soft and hard threshold to find the denoised timevarying information to decrease the complexity of hydrological series. Holzfuss & Kadtke (1993) suggested that the noise reduction methods coupled with the radial basis functions may enhance the quality of traditional statistical and datadriven models.
However, WA, EMD, and EEMDbased noise removal techniques have their own drawbacks in extracting the optimal multiscale components. WAbased hard and soft threshold first comprised of the choice of mother wavelet, which is subjectively selected among many wavelet basis functions. The subjective selection of mother wavelet may also cause errors which decrease the performance of hard and SURE thresholds. Moreover, the EMD is a purely datadriven technique, used for preprocessing data, which is effected through its own mathematical property of mode mixing problem resulting in spurious timefrequency information. To overcome the drawback of EMD, the EEMD is introduced which is an improved version of EMD to solve the mode mixing problem. The EEMD added Gaussian white noise successively to solve noiseassisted (Hadi & Tombul, 2018; Yaseen et al., 2018a; Yaseen et al., 2018b). Although an improved EEMD has proved useful for denoising hydrological time series data (Jiao, Guo & Ding, 2016), it also has some drawbacks that may deprive the researchers in extracting accurate multiscale components i.e., Intrinsic Mode Function (IMF) by simple averaging them. However, to cope with the simple averaging problem of EEMD, Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) have been proposed by Torres et al. (2011). The application of CEEMDAN is successfully applied to derive the timescale components of hydrological time series data (Antico, Schlotthauer & Torres, 2014). Johnstone & Silverman (2005) have considered an empirical Bayesian approach to threshold the multiscale components derived from WA. They argued that the empirical Bayesian method efficiently models the sparsity and noises of complex data by considering multiple priors for each level. One would hope such methods which possibly estimate thresholds that stably reflect the noises from sparse multiscale to enhance the accuracy and reliability of prediction performance of complex river flow data.
In this study, we aimed to improve denoising stage of the hybrid model by the novel framework i.e., CEEMDANEmpirical Bayesian Threshold estimator (EBT) estimator to optimally reduce noises from IMFs which are further used as inputs to get a precise prediction of river flow data which plays a decisive role in the accurate prediction. The principal motivation of choosing EBT is that it is purely a databased method which deals different level of noises efficiently because of some high multiscale components derived from CEEMDAN relatively sparse than the lower timescale components.
Proposed Method
The proposed method is comprised of four stages such as decomposed, denoised using novel threshold, prediction, and aggregation. The CEEMDAN method is used as a decomposition tool to handle the nonlinear and nonstationary data by extracting IMFs. The extracted IMFs are further divided into two parts; one part is comprised of noisiest IMFs which contains errors and sparsity and the second part is noise free IMFs. The noisiest IMF’s coupled with novel Empirical Bayesian Threshold (EBT) estimator to get a ride from noises and sparsity to denoise IMF’s. Then the denoised IMF’s are predicted through the complex datadriven model and noise free IMF’s are predicted through a simple stochastic model. Finally, all the predicted IMF’s are aggregated to get the final prediction. The proposed framework of deriving multiscale IMF’s through CEEMDAN coupled with optimal denoising method i.e., EBT plays a vital role in the accurate prediction of river flow data. To our convenience, the proposed strategy is labialized as; CEEMDAN (decomposed), EBT (denoised) and Multi Models MM (datadriven and stochastic model) i.e., CEEMDANEBTMM and the proposed scheme is illustrated in Fig. 1.
Decomposition stage
EMD: To decompose nonlinear and nonstationary data, the EMD method introduced by Huang et al. (1998) which decomposed data into IMF’s by satisfying two conditions as follows: (a) The number of zero crossings and extreme, from complete data, must be equal or differ to one at most; (b) At all levels, mean value of envelope must be zero.
The complete EMD procedure is defined as:

Find all local maxima and minima from data $y\left(t\right)$, ( t = 1, 2, .., N). Use cubic splines interpolation to find an upper envelope from maxima e_{max(t)} and lower envelope from minima e_{min(t)}.

Calculate the average of upper and lower envelope $m\left(t\right)=\left({e}_{max\left(t\right)}+{e}_{min\left(t\right)}\right)\u22152$. Take the deviations between original time series data and calculated envelope mean as:

(1)$g\left(t\right)=y\left(t\right)m\left(t\right)$

Match the requirements of $g\left(t\right)$which is defined in (a) and (b) as an IMF, if the conditions are satisfied then mark this g(t) as ith IMF.

In the next step replace original time series $y\left(t\right)$ by $r\left(t\right)=y\left(t\right)g\left(t\right)$, if g(t) does not meet the (a) and (b) then just replace $y\left(t\right)$ with $g\left(t\right)$.

The process of (1–4) is repeated until no IMF is being extracted from the residue $r\left(t\right).$
In the end, original time series data will be written as the summation of all extracted IMF’s and residue as: (2)$y\left(t\right)=\sum _{i}^{l}{g}_{i}\left(t\right)+r\left(t\right)$
where ${g}_{i}\left(t\right)$ is the ith IMF and r(t) is the trend of the signal. However, EMD techniques have two drawbacks. The one is the endpoint problem that is the extreme values of two ends of the series can’t be determined properly which distort the IMF’s and the other is mode mixing aliasing in which same IMF contains even more than two frequencies.
EEMD: To resolve the mode mixing problem of EMD, (Wu & Huang, 2004) proposed EEMD. The EEMD decompose the nonlinear signals into IMF’s as follows:

Add a white Gaussian noise series to the original data set as follows:

(3)${y}_{2}\left(t\right)=y\left(t\right)+n\left(t\right)$

Decompose the new ${y}_{2}\left(t\right)$ with EMD and obtains the IMFs.

Repeat step (a) and (b) mth time i.e., (j = 1,2,…,m) with different white noises to get IMF’s from new series.

Find the ensemble means of all IMFs obtained as mthensemble time as follows: where k = 1, 2, .., K is kth IMF.

(4)${\overline{imf}}_{k}=\sum _{j=1}^{m}IM{F}_{jk}^{m}\u2215m$
where k = 1, 2, .., K is kth IMF.
CEEMDAN: Although the mode mixing problem is alleviated with EEMD, taking the simple averages of IMF’s without considering the independent addition of white noises could not completely remove noises from IMF’s. To tackle the simple averaging problem of EEMD, the CEEMDAN function is introduced by Torres et al. (2011). Here in our study, CEEMDAN is used to decompose the river flow data which is briefly described as follows:

The CEEMDAN add Gaussian noises like EEMD in signals as follows:

(5)${y}_{3}\left(t\right)=y\left(t\right)+{w}_{0}{n}^{j}\left(t\right)$

where w_{0} is the amplitude of the added white noises and (j = 1,2,…,m). Find the first IMF using simple EEMD defined as:

(6)$I\tilde{M{F}_{1}}=\sum _{j=1}^{m}IM{F}_{j1}^{m}\u2215m$

Compute the deviation of original signals from the first IMF as:

(7)$r}_{1}\left(t\right)=y\left(t\right){\overline{imf}}_{1$

Decompose ${r}_{1}\left(t\right)+{w}_{0}{n}^{j}\left(t\right)$ to get first IMF and find the second IMF as: (8)$I\tilde{M{F}_{2}}=\sum _{j=1}^{m}IM{F}_{j1}^{m}\u2215m$

Repeat the (2–3) until stoppage criteria are met and the residual contains not more than two extremes. Finally, the residual is defined as:

(9)$R\left(t\right)=y\left(t\right)\sum _{k=1}^{K}I\tilde{M{F}_{k}}$
However, the selection of a number of ensemble and amplitude of white noise is still an open challenge but here in our study we used a number of ensemble members as 100 and standard deviation of white noise is settled as 0.2 according to Di, Yang & Wang (2014).
Identification of noisiest IMFs: After deriving the IMF’s, next step is to screen out the noise only IMF’s and noise free IMFs (Wei et al., 2016). Two types of IMF’s are derived from CEEMDAN/EEMD; the IMF’s contains high frequency which is corrupted with sparsity and noises (Wei et al., 2016) and the second part of IMF’s comprised on low frequencies which are free from noises (Wei et al., 2016). To get numerical validation, crosscorrelation is calculated between all extracted IMF’s and original river flow data. The low crosscorrelation implies that the highfrequency IMF’s are overwhelmed with noises. Then, the noise only IMF’s are further denoised through the appropriate threshold to get the important features from them and to get rid of noises.
Denoising stage
After decomposing the nonlinear and nonstationary data into IMF components, an appropriate estimator is chosen to remove noises and sparsities from extracted IMFs. The reason for selecting an appropriate estimator is to find an optimal value of threshold as the highest threshold value would lead to biases, whilst the lowest threshold value would increase the noise variance. The IMFs which are extracted are mostly empirical Bayesian threshold estimator is adapted to denoised the noisiest IMFs. Later, the existing soft and hard threshold and improved threshold functions are used for comparison purposes. Details of all estimators are given below:
CEEMDAN/EEMDbased Empirical Bayesian Threshold: To estimate the sparseness and noises from decomposed IMF’s (Wei et al., 2016), an Empirical Bayes Threshold (EBT) inspired from wavelet denoising (Johnstone & Silverman, 2005; Jansen & Bultheel, 2014) is used. For the successful application of EBT, first, all the data is scaled transformed to efficiently select the prior distribution for noises and sparsities. After the scaled transformation data follows $N\left({\theta}_{i},1\right)$. Then a mixture of priors for θ_{i} are considered as follows: (10)${f}_{prior}\left(\theta \right)=\left(1w\right){\delta}_{0}\left(\theta \right)+w\gamma \left(\theta \right)$
where ${\delta}_{0}\left(\theta \right)$ is a zero part of scaled data and $\gamma \left(\theta \right)$ is a density of nonzero part. The density of prior should be chosen in such a way that it must belong to a family of distributions whose tails decays at polynomial rates. The parameters and weights of a mixture of prior distributions are estimated through maximum likelihood approach. The reason for using this method to estimate unknowns is that it estimates weights and parameters to be proportional to the likelihood function evaluated at the estimators based on data (Hossain, Kozubowski & Podgórski, 2018).
Finally, the posterior median ${\stackrel{\u0303}{\theta}}_{i}\left(imf,w\right)$ is calculated from a mixture of prior distribution is given as follows: (11)$\tilde{{F}_{1}}\left(\mu imf\right)={\int}_{\mu}^{\infty}{f}_{1}\left(\mu imf\right)d\mu$
which is used as a threshold rule for $\stackrel{\u0303}{\mu}$ given data (Johnstone & Silverman, 2005). In general, an estimation rule comprised on η (imf,t) defined for all t > 0, is a thresholding rule if and only if for all t > 0, $\eta \left(imf,t\right)$ is an antisymmetric and increasing function of data and η (imf,t)=0 if and only if $\leftimf\right\le t$ where t is defined as a median value which is calculated through the Eq. (11).
Other traditional threshold estimators: To compare the EBT estimator with other nonlinear estimators to suppress the noises and sparsities from noisiest IMFs, Soft Threshold (ST), Hard Threshold (HT) and Improved Threshold Function (ITF) are used which are most widely used in literature (Jeng et al., 2007; Candes, SingLong & Trzasko, 2012; Jansen & Bultheel, 2014) listed as follows, respectively;
(12)$IM{F}_{t,k}^{\prime}=\left\{\begin{array}{c}IM{F}_{t,k}\phantom{\rule{100.00015pt}{0ex}}\leftIM{F}_{t,k}\right\ge {T}_{k}\phantom{\rule{10.00002pt}{0ex}}\hfill \\ 0\phantom{\rule{120.00018pt}{0ex}}\leftIM{F}_{t,k}\right<{T}_{k}\phantom{\rule{10.00002pt}{0ex}}\hfill \end{array}\right.$
(13)$IM{F}_{t,k}^{\prime}=\left\{\begin{array}{c}sgn\left(IM{F}_{t,k}\right)\left(\leftIM{F}_{t,k}\right{T}_{k}\right)\phantom{\rule{70.0001pt}{0ex}}\leftIM{F}_{t,k}\right\ge {T}_{k}\phantom{\rule{10.00002pt}{0ex}}\hfill \\ 0\phantom{\rule{170.00026pt}{0ex}}\leftIM{F}_{t,k}\right<{T}_{k}\phantom{\rule{10.00002pt}{0ex}}\hfill \end{array}\right.$
and
(14)$IM{F}_{t,k}^{\prime}=\left\{\begin{array}{c}sgn\left(IM{F}_{t,k}\right)\left(\frac{\leftIM{F}_{t,k}\right{T}_{k}}{exp\left\{3\mathrm{\alpha}\left(\frac{IM{F}_{t,k}{T}_{k}}{{T}_{k}}\right)\right\}}\right)\phantom{\rule{50.00008pt}{0ex}}\leftIM{F}_{t,k}\right\ge {T}_{k}\phantom{\rule{10.00002pt}{0ex}}\hfill \\ 0\phantom{\rule{180.00027pt}{0ex}}\leftIM{F}_{t,k}\right<{T}_{k}\phantom{\rule{10.00002pt}{0ex}}\hfill \end{array}\right.$
where, T_{k} is the threshold value calculated as ${T}_{k}=a\sqrt{2{E}_{k}ln\left(N\right)}$, where k = 1,2,…..,K and a is constant takes the values between 0.4 to 1.4 with a step of 0.1 and E_{k} = median(IMF_{t,k}, t = 1, 2, .., N)∕0.6745 is median deviation of kth IMF.
Prediction and aggregation
In the prediction stage, the decomposeddenoised IMFs of river flow data are predicted through some datadriven and statistical models. Specifically, the denoised IMFs are predicted through a datadriven model whenever noisefree IMFs and the residual are predicted through a simple stochastic model. To train the model, 70%, 80%, and 90% data are used, and the remaining 30%, 20% and 10% data are used to test the accuracy of the models. The selected models are briefly described as follows:
The denoisedIMF prediction with the neural network: A datadriven technique has been proved a powerful tool to model complex nonlinear data (CampisiPinto, Adamowski & Oron, 2012). The MultiLayer Perceptron (MLP) which is the most popular submodel of NN (Talaee, 2014; Ali et al., 2017) consists of three layers of nodes used here for the prediction of denoised IMFs of river flow data. The complete layout of MLP is given in Talaee (2014). The structure of MLP comprised of NN structure. Three nodes are included in MLP as an input layer, hidden layer, and an output layer. First, the single output is calculated by using linear combinations of inputs which are further transferred to some nonlinear activation functions mathematically defined as by (15)$y=\phi \left(\sum _{i}^{n}{w}_{i}{x}_{i}+b\right)$
where w_{i} are the weights of inputs, x_{i} are the inputs, b is the biased value for each layer and φ is a nonlinear activation function which supplies output to the next layer. The mostly used activation function i.e., logistic function is used as activation function defined as follows: (16)$\frac{1}{\left(1+{e}^{\sum _{i}^{n}{w}_{i}{x}_{i}+b}\right)}$
For the optimization of neurons, the supervised learning algorithms called backpropagation and forwardpropagation can be used.
The noisefree IMF prediction with ARIMA model: To predict the noisefree IMF’s and the residual, an autoregressive moving average model (ARMA) is selected which is described as follows: (17)$IM{F}_{t}^{k}={\alpha}_{1}IM{F}_{t1}^{k}+\dots +{\alpha}_{p}IM{F}_{tp}^{k}+{\epsilon}_{t}^{k}+{\beta}_{1}{\epsilon}_{t1}^{k}+\dots +{\beta}_{q}{\epsilon}_{tq}^{k}$
where, $IM{F}_{t}^{k}$ is the kth IMF, $IM{F}_{tp}^{k}$ is pth lag value of kth IMF ${\epsilon}_{t}^{k}$ is the residual of kth IMF, p and q are autoregressive and moving average lags. Moreover, in some cases time series data is not stationary. To make such stationary, differences at an appropriate degree are used (Box & Jenkins, 1970). If such a situation occurs, then the model is known as ARIMA (p, d, q) where d is the differenced value used to make the nonstationary data stationary.
Case study and experimental design
Selection of study area
In this research, the largest water system of Pakistan is considered for the application of the proposed strategy. The Indus Basin System (IBS) is the largest river of Pakistan and it plays an important role specifically in power generation and irrigation department. The major tributaries of IBS i.e., River Jhelum, River Chenab, and River Kabul are selected for the present study. Pakistan is facing a large amount of frequent river flooding each year due to monsoon rain and melting snow or glaciers. As in Pakistan, glaciercovered 13,680 km^{2} area which is estimated 13% of the mountainous areas of Upper Indus Basin (UIB). Melted water from these 13% areas adds the significant contribution of water in these rivers. which leads to complex characteristics in river flow data. Therefore, for sustainable economic development and efficient water resources planning, there is a need for analyzing such complex characteristics and predict the behaviors of river flow data at IBS and its tributaries.
Data
To investigate the improved framework, four daily river flow data comprised on (1stJanuary to 19thJune) for the period of 2015–2018 is used in this study. The main river flow of Indus at Tarbela is considered with its three principal tributaries: Jhelum at Mangla, Chenab at Marala and Kabul at Nowshera. Data is measured in 1,000 ft/s. The described river flow data was obtained from the website of Pakistan Water and Power Development Authority (WAPDA).
Evaluation criteria
Evaluation of noise reduction methods: The performances of denoised series needs comprehensive evaluation after using appropriate noise reduction threshold methods. In this research, to check the performances of CEEMDAN/EEMDST CEEMDAN/EEMDHT, CEEMDAN/EEMDITF, and proposed framework i.e., CEEMDAN/EEMDEBT, SignaltoNoise Ratio (SNR), Mean Square Error (MSE) and Mean Absolute Error (MAE) (Nazir et al., 2019) are employed which are given as follows respectively; (18)$SNR=10log10\left(\frac{\sum _{t=1}^{N}{\left({y}_{ot}\right)}^{2}}{\sum _{t=1}^{N}{\left({y}_{pt}{y}_{ot}\right)}^{2}}\right)$ (19)$MSE=\sqrt{\frac{\sum _{t=1}^{N}{\left({y}_{ot}{y}_{pt}\right)}^{2}}{N}}$
and (20)$MAE=\frac{\sum _{t=1}^{N}{y}_{ot}{y}_{pt}}{N}$
where y_{ot} is the tth observed value and y_{pt} is the tth predicted value. Moreover, the performance of proposed model (i.e., CEEMDANEBTMM) and all other models including CEEMDAN/EEMDSTMM, CEEMDAN/EEMDHTMM, CEEMDAN/EEMITFMM and EEMDITFMM are compared using three popular statistical measures: MSE, MAE defined in Eqs. (19)–(20) and Mean Absolute Percentage Error (MAPE) given as follows; (21)$MAPE=\frac{{y}_{mo}{y}_{mp}}{\left{y}_{mo}\right}\ast 100$
where, y_{ot} and y_{pt} is defined above, y_{mo} is the mean value of observed values, y_{mp} is the mean value of predicted values and y_{sp} is the standard deviation of predicted values.
Results
This section provides results of proposed CEEMDANEBTMM model and benchmark models i.e., CEEMDAN/EEMDSTMM, CEEMDAN/EEMDHTMM, CEEMDAN/EEMITFMM, and EEMDITFMM in steps as follows:
Results of decomposition stage: First, to confirm the nonstationarity of river flow data, Augmented Dickey Fuller (ADF) unit root test (Khalili et al., 2013) is used for all selected case studies. The test is applied to data by taking the log in order to confirm nonstationarity. Results of ADF showed that all selected river flow data i.e., Indus river flow, Jhelum river flow, Chenab river flow, and Kabul river flow is significantly nonstationary with pvalue as 0.3353, 0.4135, 0.333 and 0.414 respectively. Then, the nonlinear and nonstationary data decomposed into different time scale oscillation called IMFs to reduce the nonstationarity by extracting the timevarying characteristics of daily river flow data from all selected four stations. The CEEMDAN decomposition technique is used to extract IMF’s of river flow data. All selected river flow data is decomposed into thirteen IMFs and one residual. The starting IMF’s represents the highest frequencies whereas the last half IMF’s showed the low frequencies and the residual represent the overall trend. The decomposed results of Indus River system is depicted in Fig. 2. The amplitude of white noise is set 0.2 as in Di, Yang & Wang (2014) and numbers of ensemble members are settled as 1,000. By inspecting the IMFs, it is noticed that each IMF component represents the oscillation characteristic Further, the crosscorrelation method is employed to find the noisiest IMFs from all thirteen IMFs. To do this, first the decomposed IMFs are further divided into two parts to find the noisiest IMFs by using the crosscorrelation between IMFs and original river flow data. The low correlation implies the more uncertainty present in IMFs. The first ten IMFs showed the least correlation with original river flow data indicated that these IMFs are overwhelmed with noises. The graph of crosscorrelation between Indus river flow data and first IMF and between Indus river flow and eleventh IMF is depicted in left and right corner of Fig. 3, which shows that the first IMF is filled with noises with low correlation at all lags and eleventh IMF is free from noises as it showed 0.75 correlation not only at lag zero but also at other lag values. For all four rivers, first tenth IMFs are characterized as noisiest IMFs and last three IMFs are labeled as noise free IMFs.
Results of denoising stage: the next step is to denoise the noisiest and sparse IMFs.
To eliminate noises from IMFs, EBT estimator is used as a filter which assumes a mixture of prior as defined in Eq. (10), for each IMF separately by considering the nature of IMFs. First, the scaled transformation is applied to get normal distribution so that each IMF follows $N\left({\theta}_{i},1\right)$. According to the nature of IMFs as depicted in Fig. 4, it is known that most of the coefficients in all IMFs are zero and some are nonzero in which fewer coefficients are either very low or high. By looking (Fig. 4) both zero and nonzero part of IMFs, a mixture of an atom of probability at zero and multiple distributions are considered for nonzero part (Johnstone & Silverman, 2005). Among all of them, Laplace distribution is configured out as prior distribution of θ_{i} with maximum SNR. Finally, the important coefficients of IMFs are preserved with posterior median threshold estimator described in Eq. (11) by attaining highest SNR value with minimum MSE and MAE given in Table 1 for all selected case studies. For comparison purpose, the conventional denoising methods as ST, HT and ITF are implemented on all river flow data. We observed that SNR of CEEMDANEBT based decomposed and denoised method is larger than all other CEEMDAN/EEMDST, CEEMDAN/EEMDHT and CEEMDAN/EEMDITF based methods as they don’t consider the sparsity and magnitude of noises separately to remove noises from data. The reconstructed denoised graph of proposed CEEMDAN/EEMDEBT and bench mark models CEEMDAN/EEMDHT,CEEMDAN/EEMDST and CEEMDAN/EEMDITF for Indus and Jhelum are shown in Figs. 5/6 respectively. From Figs. 5 and 6, it is shown that CEEMDAN/EEMDST over estimated noises for Jhelum, Chenab and Kabul river flow and, the performance of CEEMDAN/EEMD ITF is worst for Indus river flow however, the proposed CEEMDANEBT based model shown optimal performance for all case studies.
River Inflow  Method  SNR  MSE  MAE 

Indus Inflow  CEEMDANHT  15.4996  0.9924  0.7905 
CEEMDANST  −18.4131  3.1607  1.3692  
CEEMDANITF  −15.7423  3.1645  1.3713  
CEEMDANEBT  22.0440  1.03563  0.8741  
EEMDHT  −19.32131  49.8554  5.6131  
EEMDST  −36.7393  36.1929  4.4431  
EEMDITF  −36.4253  36.1766  4.4442  
EEMDEBT  −14.9984  49.31019  5.6353  
Jhelum Inflow  CEEMDANHT  4.0185  0.6706  0.7905 
CEEMDANST  −21.0096  2.4233  1.1978  
CEEMDANITF  −20.4669  2.4234  1.1985  
CEEMDANEBT  20.7036  0.0012  0.0142  
EEMDHT  −12.7805  3.8168  1.5496  
EEMDST  −28.2531  5.7458  1.7912  
EEMDITF  −28.0296  5.7463  1.7927  
EEMDEBT  9.7364  3.2357  1.4390  
Chenab Inflow  CEEMDANHT  −4.8656  0.3252  0.4230 
CEEMDANST  −17.3340  3.1607  1.3692  
CEEMDANITF  −16.4194  1.1085  0.7870  
CEEMDANEBT  31.6470  0.0004  0.0165  
EEMDHT  −8.1820  3.8706  1.5152  
EEMDST  −29.2759  4.9703  1.6133  
EEMDITF  −29.0425  4.9693  1.6148  
EEMDEBT  −12.8752  3.5997  1.5085  
Kabul Inflow  CEEMDANHT  10.8756  0.4823  0.4986 
CEEMDANST  −22.289  2.2647  1.1272  
CEEMDANITF  −21.6878  2.2649  1.1280  
CEEMDANEBT  31.9856  0.0003  0.0144  
EEMDHT  −2.4276  5.4761  1.7961  
EEMDST  −32.0408  8.5625  2.0934  
EEMDITF  −31.8501  8.5601  2.0949  
EEMDEBT  −10.2153  5.1512  1.7999 
Results of prediction stage: The decomposed and denoised IMFs of all selected case studies are further predicted through datadriven and statistical model. The denoised IMFs are predicted through MLPneural network model. Training is performed by using forward and backpropagation by setting the learning rate parameter between 0.1 to 1. The backpropagation method with the optimal learning rate is selected to test the model. The second part of the decomposed IMFs comprised of noisefree IMFs and the residual, which are predicted through simple traditional statistical model (i.e., ARIMA (p, d, q)) for all case studies. The river flow data of all four rivers are splitted, 70%, 80% and 90% for the training set and 30%, 20% and 10% for testing set. The results achieved by such splitting criteria are not much deviated from each other so the only values of 80% training errors are given in Table 2. After a successful estimation of each IMF and residual, the accuracy is measured with MAE, MAPE, and MSE. The training results of proposed models with a comparison to all other models for all four river flow i.e., Indus flow, Jhelum flow, Chenab flow, and Kabul flow are presented in Table 2. The results of the proposed model i.e., CEEMDANEBTMM, demonstrate its effectiveness by showing minimum MAD, MAPE and RMSE values for Indus river flow. For Jhelum and Chenab, the proposed CEEMDANEBTMM model shown least MAE, and MAPE values and by attaining lower MAPE value for Kabul river flow comparative to all other methods. The predicted graph of proposed CEEMDANEBTMM model and EEMDEBTMM with their benchmark models for Indus and Jhelum river flow are shown in Figs. 7 and 8 respectively.
Rivers  Method  MAE  MAPE  MSE 

Indus River Inflow  CEEMDANHTMM  9.5938  0.1317  13.7910 
CEEMDANSTMM  8.7531  0.1413  17.2590  
CEEMDANITFMM  8.4614  0.1349  14.7238  
CEEMDANEBTMM  5.6837  0.1071  9.8704  
EEMDHTMM  16.7567  5.0868  18.7316  
EEMDSTMM  12.7899  0.1480  17.2683  
EEMDITFMM  12.7459  0.1648  17.7453  
EEMDEBTMM  17.4940  0.1338  15.4142  
Jhelum River Inflow  CEEMDANHTMM  6.9537  0.3502  0.0547 
CEEMDANSTMM  6.0630  0.0577  0.3912  
CEEMDANITFMM  6.0558  0.0407  0.1949  
CEEMDANEBTMM  6.0185  0.0400  0.2609  
EEMDHTMM  7.1364  0.1157  1.5763  
EEMDSTMM  8.8392  0.1227  1.7755  
EEMDITFMM  6.4300  0.1451  2.4853  
EEMDEBTMM  7.2961  0.0414  0.2015  
Chenab River Inflow  CEEMDANHTMM  6.6962  0.0115  0.0140 
CEEMDANSTMM  6.2191  0.0110  0.0141  
CEEMDANITFMM  6.2198  0.0217  0.0504  
CEEMDANEBTMM  6.0027  0.0025  0.1934  
EEMDHTMM  6.9454  0.0818  0.7159  
EEMDSTMM  5.3718  1.0425  115.923  
EEMDITFMM  6.5058  0.3943  16.678  
EEMDEBTMM  7.0002  0.0196  0.0410  
Kabul River Inflow  CEEMDANHTMM  8.5972  0.2274  7.0561 
CEEMDANSTMM  7.8189  0.1676  3.8311  
CEEMDANITFMM  7.8192  0.2328  3.3163  
CEEMDANEBTMM  8.5971  0.1660  3.7578  
EEMDHTMM  8.7354  0.2093  5.9963  
EEMDSTMM  6.6151  0.2384  0.2028  
EEMDITFMM  6.6042  0.1864  0.2881  
EEMDEBTMM  9.0280  0.1878  4.8166 
Discussion
Decomposition and denoising results: In order to understand the applicability of our proposed CEEMDANEBT model, the other EEMDbased decomposition method and three different denoising methods (i.e., EEMDHT, EEMDST and EEMDITF were employed. The evaluation of denoising of proposed CEEMDANEBT and benchmark models were carried out using SNR, MSE and MAE measures for all river flow data. From Table 1, it is clear that overall decomposition based on CEEMDAN performs well with improved denoising method (i.e., CEEMDANEBT) that efficiently eliminating noises by considering the mixture of priors for IMFs with highest SNR value and lowest MSE and MAE values; however, other existing denoising methods (i.e., CEEMDANHT, CEEMDANST, and CEEMDANITF) performed low with low SNR value and high MSE values. Moreover, comparative to CEEMDAN, the other EEMD based decomposition and denoising methods i.e., EEMDEBT, EEMDST, EEMDHT, and EEMD ITF for all case studies are also performed poor as the EEMD based decomposition exhibits clear mode mixing as shown in Fig. 6, where results for Indus and Jhelum river data are plotted. The MSE and MAE values of EEMDEBT, EEMDST, EEMDHT, and EEMD ITF methods having SNR values are very low and MSE and MAE values high due to poor decomposition and denoising methods for all four rivers data which implies that CEEMDAN has the ability to optimally extract the IMFs which are further processed with optimal denoising methods to get smooth noise free IMFs as shown in Fig. 5, where denoised results of Indus and Jhelum river flow are plotted. From the results shown in Table 1, it is concluded that our proposed strategy (i.e., CEEMDANEBT) performs better than that of all other methods (i.e., CEEMDAN/EEMDST, CEEMDAN/EEMDHT and CEEMDAN/EEMDITF and EEMDEBT based decomposition and denoising methods).
Final prediction model: To verify the superiority of proposed (i.e., CEEMDANEBTMM) strategy to model the complex river flow data, we choose the CEEMDAN/EEMDSTMM, CEEMDAN/EEMDHT, CEEMDAN/EEMDITF AND EEMDEBTMM models to analyze the prediction results of nonlinear and noisecorrupted data. Our proposed prediction framework based on decomposition and novel strategy of denoising performs well as compared to all other decomposition and denoising methods. As shown in Table 2 and Fig. 7, the proposed model (i.e., CEEMDANEBTMM) exhibits a good prediction performance for Indus river flow with least MAE, MSE and MAPE values. The MAE and MAPE values of Jhelum and Chenab river flow are also better than the other models. For Kabul river flow, our proposed model showed its efficiency with the least MAPE value. The other models are not consistent in terms of efficiency, their prediction behaviors for each river vary as CEEMDANHTMM shows effective performance for Indus river flow but was poor in predicting Kabul river flow. However, overall the proposed CEEMDANEBTMM model shows consistent and excellence prediction results which indicate that with appropriate decomposition and the novel improvement of denoising technique, one can overall improve the performance of existing datadriven models to handle the river flow data.
In this study, the river flow of the selected Indus River system exhibited seasonal persistence at several multiscales which manifested through data decomposition method, and was used to enhance the prediction accuracy of river flow. Moreover, instead of overall season, Kalhoro et al. (2017) shown that there is a further subseasonal variation i.e., high flow, called Kharif season, and low flow, called Rabi season, present in Indus River system. The majority of Pakistan’s population lives in rural areas and have the occupation of farming, which is directly influenced by the severity of both subseasonal variations. In the future, there is a need for establishing reliable subseason identification and prediction methods that will be fruitful for efficient water allocation in both seasons, which can ultimately boost the economy of Pakistan.
Conclusion
In this paper, due to the nonlinearity and noises complexity of the river flow data, we proposed a strategy to improve the prediction of datadriven models with appropriate decomposition and a novel denoising method. Our proposed denoising method improves the performance of the CEEMDANbased decomposition method by improving the timescale components which enhance the prediction accuracy of datadriven models. The proposed method comprised of four steps such as decomposition, denoising and prediction, and aggregation. The performance of the proposed CEEMDANEBTMM model is evaluated using four daily river flow data of IBS. The CEEMDANEBTMM having the smallest MAPE values for all four case studies compared to other benchmark models. The improved results suggest that the proposed hybrid model can be used as an efficient tool for providing a reliable prediction of river flow data to policymakers for planning power generation and water resource management.
Supplemental Information
CEEMDANEBTMM strategy to model the complex river flow data
Data for the Indus Basin System
The data can be used to validate the methods utilized in the manuscript. The Rfiles can be executed for utilizing this data set