## Data Tool

This tool has been invaluable in all of our research papers.

## What is Data tool?

Agrimetsoft proudly presents the "Data Tool," a comprehensive Excel add-in designed to enhance data analysis capabilities. With the "Data Tool," users can efficiently sort and reshape their data, and calculate essential efficiency criteria such as RMSE (Root Mean Square Error), NSE (Nash-Sutcliffe Efficiency), and Index of Agreement and etc. Moreover, the tool enables users to draw CDF (Cumulative Distribution Function) and PDF (Probability Density Function) charts, check three methods of trend tests, and check the distribution of data, among other valuable functionalities. This powerful add-in empowers users with advanced data analysis tools, making it an indispensable asset for researchers and professionals seeking to elevate their data analysis and decision-making processes.

#### How to Install & Uninstall Data Tool in Excel?

#### Introduction to Data Tool?

**Introducing the Data Tool:** an all-in-one Excel add-in offering a wide array of powerful data analysis capabilities. The list of abilities includes:

**Menu:**

#### Data Section in Data-Tool

**1-** Reshape Data in Excel

Before utilizing the various efficiency criteria available in the Data Tool, users may need to reshape their data. For this purpose, the Data Tool offers two options: R2D (Reshape 2Dim) and R1D (Reshape 1Dim). With R2D, users can reshape two-dimensional data, while R1D enables reshaping two-dimensional data. For a clearer understanding and a practical example, users can refer to the help video of the Data Tool. The help video provides detailed instructions and showcases how to effectively use the Data Tool for reshaping data, allowing users to leverage its full potential for their data analysis needs.

To reshape your data using the Data Tool:

- Open your Excel file and navigate to the "Data Tool" tab.
- Select your data and click on either R1D (Reshape 1Dim) or R2D (Reshape 2Dim), depending on the format of your data. If your data is in one column, click on R1D (Reshape 1Dim); otherwise, choose R2D (Reshape 2Dim).
- If you click on R1D (Reshape 1 Dim), you will be prompted to enter the number of columns you want in the textbox provided in the window.

**Note:** Ensure that the number of rows in the data you select is divisible by the number of columns you enter in the text box for R1D (Reshape 1 Dim).

By following these steps and utilizing the Data Tool's data reshaping capabilities, users can efficiently prepare their data for further analysis and seamlessly proceed to evaluate model efficiency with confidence.

#### Reshape Data In Excel

**2-** Transform daily data into monthly, seasonal, or yearly

Clearly, the purpose of this section is to explain the data time-conversion process available in the Data Tool. You have six convenient options to transform your data into different time scales:

**CDDM (Column daily data to month):** This option allows converting daily data arranged in columns into monthly format. You should have at least 365 rows for each year.
**RDDM (Row daily data to month):** With this option, daily data organized in rows can be converted to monthly format. You should have at least 365 columns
**RMS (Row monthly to seasonal):** You can select this option to convert monthly data arranged in rows to a seasonal format. Additionally, there are two options first is to customize the season according to your preferences and the second is season 12 which means JFM, FMA, MAM, AMJ, MJJ, JJA, JAS, ASO, SON, OND, NDJ, DJF.
**CMS (Column monthly to seasonal):** In this feature, you can conveniently convert monthly data arranged in columns to a seasonal format. Moreover, you have the flexibility to utilize multiple columns as stations or variables and perform the conversion simultaneously. This allows for a seamless and efficient transformation of your data, catering to various data analysis needs with ease.
**RMAM (Row Monthly to average Monthly):**If you have monthly data spanning n years, resulting in an n*12 array, this feature allows you to calculate the mean or standard deviation of each month across all n years. The outcome will be presented as an n*1 vector, providing valuable insights into the overall trends and variations for each month throughout the entire data period.
**CMY (Column monthly to yearly):**If you possess multiple monthly station data, organized in columns where each column represents a station, this feature enables you to convert the data into average yearly or sum yearly formats. By utilizing this functionality, you can efficiently obtain the annual averages or sums for each station, simplifying the analysis of data across the entire time period.

**Note:** Please note that when utilizing this feature, it is essential to select only the relevant data without headers or NaN (Not a Number) values. Additionally, you have the flexibility to combine the results obtained from various features, allowing for a more comprehensive and robust analysis. By leveraging multiple features, you can enhance the depth and accuracy of your data analysis, leading to more informed and valuable insights.

By converting your data into the desired format, you can seamlessly proceed to select and calculate efficiency criteria using the Data Tool. Follow the steps below to achieve this:

- Open your data in Excel file
- Navigate to Data Tool Tab
- Select the relevant data (exclude headers) that requires conversion.
- Click on either CDDM (Column Daily Data to Month) or RDDM (Row Daily Data to Month) depending on the format of your data.
- Optionally, you can check "Add Date Column" and enter Year/Month details if needed.
- If your data includes a leap year with 366 days, check "Use Leap Year" and specify the First Year accordingly.
- Proceed by clicking either "To Monthly In Rows" or "To Monthly In Column" as per your preference.
- You can easily copy the resulting table by clicking on it and pressing Ctrl+A to select all data, then copying them.
- Alternatively, choose "Send to new sheet" to export the data to a new sheet within your workbook.

#### How to summarize daily data to monthly?

**3-** RMSE (Root Mean Square Error)

The Root Mean Square Error (RMSE) (also called the root mean square deviation, RMSD) is a frequently used measure of the difference between values predicted by a model and the values actually observed from the environment that is being modeled. These individual differences are also called residuals, and the Root Mean Square Error serves to aggregate them into a single measure of predictive power. Root Mean Square Error measures how much error there is between two data sets. In other words, Root Mean Square Error compares a predicted value and an observed or known value.

The Root Mean Square Error is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; Root Mean Square Error is a measure of how to spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results. The Root Mean Square Error of a model prediction with respect to the estimated variable Xmodel is defined as the square root of the mean squared error:

where X-obs has observed values and X-model has modeled values at time/place i

The Root Mean Square Error is always non-negative, and a value of 0 (almost never achieved in practice) would indicate a perfect fit to the data. In general, a lower Root Mean Square Error is better than a higher one. However, comparisons across different types of data would be invalid because the measure is dependent on the scale of the numbers used, so the value of Root Mean Square Error is between 0 to Inf, and it depends on the considered variable.

For example, the amount of precipitation starts from 0 to the maximum number which can occur in a region. The best value of Root Mean Square Error is equal to 0 since in this case, the Root Mean Square Error's value shows there is not any difference between the observation data and the simulation or model data. The value of Root Mean Square Error is always positive, representing zero in the ideal case. The effect of each error on Root Mean Square Error is proportional to the size of the squared error; thus larger errors have a disproportionately large effect on Root Mean Square Error. Consequently, Root Mean Square Error is sensitive to outliers.

However, the Root Mean Square Error values can be used to distinguish model performance in a calibration period with that of a validation period as well as to compare the individual model performance to that of other predictive models. The Root Mean Square Error measures overall discrepancies between observed and estimated values and the smaller, the better. The Root Mean Square Error has the same units as the variable under analysis.

To adjust for large rare errors, we calculate the Root Mean Square Error with Data Tool. By squaring the errors before we calculate their mean and then taking the square root of the mean, we arrive at a measure of the size of the error that gives more weight to the large but infrequent errors than the mean. We can also compare Root Mean Square Error and Mean Absolute Error to determine whether the forecast contains large but infrequent errors. The larger the difference between Root Mean Square Error and Mean Absolute Error the more inconsistent the error size.

### 4- NSE (Nash Sutcliffe model Efficiency coefficient)

The Nash-Sutcliffe efficiency (NSE) is a normalized statistic that determines the relative magnitude of the residual variance compared to the measured data variance (Nash and Sutcliffe, 1970). Nash-Sutcliffe efficiency indicates how well the plot of observed versus simulated data fits the 1:1 line. Nash-Sutcliffe efficiency is computed as shown in the following equation:

Where, OBSi refers to observation data for the desirable variable (such as precipitation, discharge, sediment, runoff, and etc.), SIMi represents the value of simulation or output of the used model for the variable. Nash Sutcliffe efficiencies range from -Inf to 1. Essentially, the closer to 1, the more accurate the model is.

NSE = 1, corresponds to a perfect match of the model to the observed data. NSE = 0, indicates that the model predictions are as accurate as the mean of the observed data, Inf < NSE < 0, indicates that the observed mean is a better predictor than the model.

Nash-Sutcliffe efficiency can be used to quantitatively describe the accuracy of model outputs other than discharge. Nash-Sutcliffe efficiency can be used to describe the predictive accuracy of other models as long as there is observed data to compare the model results with Data Tool. For example, Nash Sutcliffe efficiency has been reported in scientific literature for model simulations of discharge; water quality constituents such as sediment, nitrogen, and phosphorus loading. Other applications are the use of Nash-Sutcliffe coefficients to optimize parameter values of geophysical models, such as models to simulate the coupling between isotope behavior and soil evolution (Legates and McCabe, 1999).

The largest disadvantage of the Nash-Sutcliffe efficiency is the fact that the differences between the observed and predicted values are calculated as squared values. As a result, larger values in a time series are strongly overestimated whereas lower values are neglected (Legates and McCabe, 1999). For the quantification of runoff predictions, this leads to an overestimation of the model performance during peak flows and an underestimation during low flow conditions. Similar to R2, the Nash-Sutcliffe efficiency is not very sensitive to systematic model over- or under prediction especially during low flow periods.

### 5- Pearson coefficient

Pearson correlation coefficient is the test statistics that measure the statistical relationship, or association, between two continuous variables. It is known as the best method of measuring the association between variables of interest because it is based on the method of covariance. It gives information about the magnitude of the association, or correlation, as well as the direction of the relationship. The Pearson correlation coefficient is a measure of the strength of the linear relationship between two variables.

It is referred to as Pearson's correlation or simply as the correlation coefficient. If the relationship between the variables is not linear, then the correlation coefficient does not adequately represent the strength of the relationship between the variables. Pearson correlation coefficient calculates as follows:

where n is the total sets of model data or station-observed data; O and P are the observed data and data model, respectively; and Pi and Oi are the average values of model data and observed data, respectively.

Correlations (Pearson correlation coefficient) are never lower than -1. A correlation of -1 indicates that the data points in a scatter plot lie exactly on a straight descending line; the two variables are perfectly negatively linearly related.

A correlation (Pearson correlation coefficient) of 0 means that two variables don't have any linear relation whatsoever. However, some nonlinear relation may exist between the two variables.

Correlation coefficients (Pearson correlation coefficient) are never higher than 1. A correlation coefficient of 1 means that two variables are perfectly positively linearly related; the dots in a scatter plot lie exactly on a straight ascending line.

The degree of correlation (Pearson correlation coefficient):

1. Perfect: If the Pearson correlation coefficient's value is near +1 and -1, then it said to be a perfect correlation: as one variable increases, the other variable tends to also increase (if positive) or decrease (if negative).

2. High degree of Pearson correlation coefficient: If the coefficient value (Pearson correlation coefficient) lies between +-0.50 and +-1, then it is said to be a strong correlation.

3. A moderate degree of Pearson correlation coefficient: If the value (Pearson correlation coefficient) lies between +- 0.30 and +- 0.49, then it is said to be a medium correlation.

4. Low degree of Pearson correlation coefficient: When the value (Pearson correlation coefficient) lies below +.29, then it is said to be a small correlation.

5. No correlation: When the value is zero.

One of the important questions which arise is related to the Pearson correlation coefficient is: Do the two variables have to be measured in the same units?

No, the two variables can be measured in entirely different units. For example, you could correlate precipitation amount (mm/day) with the percent of humidity in a day (%). Here, the units are completely different; precipitation is measured in mm/day and humidity measured in %. Indeed, the calculations for the Pearson correlation coefficient were designed such that the units of measurement do not affect the calculation. This allows the correlation coefficient to be comparable and not influenced by the units of the variables used.

The following guidelines have been proposed to check the amount of the Pearson correlation coefficient, in summary:

### 6- Calculate MBE (Mean Bias Error) coefficient

The Mean bias error is usually not used as a measure of the model error as high individual errors in prediction can also produce a low MBE. Mean bias error is primarily used to estimate the average bias in the model and to decide if any steps need to be taken to correct the model bias. Mean bias error (MBE) captures the average bias in the prediction and is calculated as:

Where n is the number of samples, O refers to observation data and P represents the prediction or model data. The long-term performance of a correlation for estimating a value is provided by the Mean bias error. It allows the comparison of actual deviation between the estimated and the measured value for each term. A smaller value of Mean bias error is preferred and ideally, it should be zero. A positive value gives the average amount of overestimation in the calculated value and vice versa. One drawback of this test is that overestimation of an individual observation will cancel underestimation in a separate observation (Dincer et al., 2015).

A positive bias or error in a variable (such as wind speed) represents the data from datasets is overestimated and vice versa, whereas for the variable's direction (such as wind direction) a positive bias represents a clockwise deviation and vice versa. The lower values of errors and considerably higher value of correlation coefficient for the variable and direction are of greater importance.

### 7- MAE (Mean Absolute Error) coefficient

The simplest measure of forecast accuracy is called Mean Absolute Error (MAE). Mean Absolute Error is simply, as the name suggests, the mean of the absolute errors. The absolute error is the absolute value of the difference between the forecasted value and the actual value. Mean Absolute Error measures accuracy for continuous variables. Mean Absolute Error tells us how big of an error we can expect from the forecast on average. The Mean Absolute Error measures the average magnitude of the errors in a set of predictions, without considering their direction. The Mean Absolute Error is the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.

Both Mean Absolute Error and Root Mean Square Error express average model prediction error in units of the variable of interest. The Mean Absolute Error and the Root Mean Square Error can range from 0 to Inf and are indifferent to the direction of errors. They are negatively-oriented scores, which means lower values are better. The value of Mean Absolute Error is calculating by the following equation:

n = the number of errors,

Σ = summation symbol (which means "add them all up"),

|Oi - Pi| = the absolute errors.

### 8- index of agreement (d)

Willmott (1981) proposed an index of agreement (d) as a standardized measure of the degree of model prediction error which varies between 0 and 1. The index of agreement represents the ratio of the mean square error and the potential error. The agreement value of 1 indicates a perfect match, and 0 indicates no agreement at all. The index of agreement can detect additive and proportional differences in the observed and simulated means and variances; however, d is overly sensitive to extreme values due to the squared differences. The index of the agreement is calculated as follows:

Where Oi refers to observation data and Pi shows the prediction or model data. We applied to Mean Absolute Error, Root Mean Square Error, index of agreement, and Nash Sutcliffe efficiency for checking the accuracy of CMIP5 output models i.e. maximum and minimum temperatures and precipitation. For better understanding see the Ullah et al. (2018). Also, for assessing the performance of AgMERRA precipitation data these criteria have been applied by Salehnia et al. (2017).

### 9- Deviation of runoff Volume

The deviation of runoff volumes Dv, also known as the percentage bias, is perhaps the simplest goodness-fit criterion. The deviation of the runoff volumes is given by WMO (1986), and its value is calculated using the following equation:

where Si is the simulated discharge for each time step and Oi is the observed value. N is the total number of values within the period of analysis. For a perfect model, deviation of runoff volumes is equal to zero. The smaller the deviation of runoff volumes value, the better the performance of the model (Gupta et al., 1999). In the case of Dv, there is no need to increase the number of accuracy criteria by using different formulas for a single year or season and for an average of several years or seasons.

### 10- Deviation of Gain from daily means

In contrast to R2, there are no significant problems in computing the average DG for periods comprising several years. The coefficient of gain from the daily mean, DG, compares model results with daily mean discharge values, which vary throughout the year. DG can vary between 0 and 1, with 0 being a perfect model (WMO, 1986). This coefficient is calculated:

where Oi is the measured daily discharge, Pi is the computed daily discharge, and Obar i is the average measured daily discharge.

**11-** Check 12 Distributions

To utilize this feature, simply select a column of data, and upon clicking on it, the tool will present the results of several tests for all 12 distributions to assess the distribution. These tests include the Chi-Square test, Kolmogorov-Smirnov (KS) test, and Anderson-Darling test. By performing these tests, the tool assists in determining the distribution characteristics of the selected data, providing valuable insights for further analysis and decision-making.

**12-** CDF Graph

This tool leverages the Accord Library, utilizing a specific class to implement its functionality. To use the CDF (Cumulative Distribution Function) feature, follow these simple steps:

- Choose your desired distribution.
- Select your data by marking the columns.
- Click on the CDF Button.

Each selected column represents a data series, and the tool will automatically calculate and draw the CDF chart for every column, considering the chosen distribution. Ensure that you only select the numerical data and exclude headers or any non-numeric text from the selection.

By following these instructions, you can effectively visualize the cumulative distribution of your data, enabling you to gain deeper insights into its statistical properties.

**13-** PDF Graph

This tool leverages the Accord Library, utilizing a specific class to implement its functionality. To use the PDF (Probability Density Function) feature, follow these simple steps:

- Choose your desired distribution.
- Select your data by marking the columns.
- Click on the PDF Button.

Each selected column represents a data series, and the tool will automatically calculate and draw the PDF chart for every column, considering the chosen distribution. Ensure that you only select the numerical data and exclude headers or any non-numeric text from the selection.

By following these instructions, you can effectively visualize the Probability Density of your data, enabling you to gain deeper insights into its statistical properties.

#### How to calculate PDF(Probability Density Function) By Gumbel Distribution?

**14-** Multi-Variate Linear Regression

Multiple linear regression and multivariate linear regression are commonly used terms, and while they are related, they do have a subtle difference that sets them apart.

**Multiple Linear Regression:**

Multiple linear regression involves a regression model with multiple independent variables (predictors) and a single dependent variable. The primary objective is to predict a single outcome variable based on several predictors. The equation for multiple linear regression takes the form:

y = b0 + b1 × X1 + b2 × X2 + ... + bn × Xn

In multiple linear regression, the dependent variable y is predicted based on multiple independent variables x1, x2, ..., xn, and the model includes coefficients b0, b1, b2, ..., bn associated with each independent variable. The objective is to estimate these coefficients to achieve the best fit to the data and effectively explain the relationship between the predictors and the outcome variable.

To do Multiple linear regression you can use this free online tool or watch this videos to learn about How you can make Regression Analysis in Excel

**Multivariate Linear Regression:**

Multivariate linear regression, in contrast, focuses on modeling the relationship between multiple dependent variables and multiple independent variables. This technique is applied when there are two or more outcome variables of interest, and the objective is to comprehend how these variables are collectively influenced by the independent variables. It allows for a comprehensive analysis of the interdependencies among multiple dependent and independent variables, offering valuable insights into complex relationships within the data.

The equation for multivariate linear regression is more generalized and takes the matrix form:

Y = X × B

Where Y is the matrix of dependent variables, X is the matrix of independent variables, and B is the matrix of coefficients. Each column of Y represents a different dependent variable, while each column of X represents a different independent variable. The coefficients in B represent the effects of each independent variable on each dependent variable, allowing for a comprehensive analysis of the relationships among multiple variables simultaneously.

In summary, multiple linear regression involves a single outcome variable and multiple predictors, while multivariate linear regression deals with multiple outcome variables and multiple predictors. The primary focus of multiple linear regression is to predict a single outcome, whereas multivariate linear regression analyzes relationships among multiple dependent variables concurrently.

This tool is using the Accord library to fit and use the Multivariate Linear Regression. To do Multivariate linear regression you can use Data-Tool. Watch the below video:

#### Multi Variate Linear Regression VS. Multiple Linear Regression

**15-** Trend Test

This tool offers three methods for conducting Trend Tests: the Mann-Kendall test, the Hypothesis Test for Regression Slope, and White's Test. Each method provides a unique approach to analyzing trends and patterns in the data, allowing users to choose the most suitable method based on their specific research or analysis requirements. In the Mann-Kendall (MK) test, you only need to select the variable of interest. However, in the other two methods, namely the Hypothesis Test for Regression Slope and White's Test, you are required to select both the time and the variable, meaning two columns of data need to be provided for the analysis. This difference in input requirements allows for flexibility in the type of analyses performed, catering to different scenarios and research objectives.

What is the Hypothesis Test for Regression Slope?

What is the WHITE's Trend Test?

The Mann-Kendall Trend Test, also known as the MK test, is a widely used method for analyzing data collected over time to detect consistently increasing or decreasing trends in hydrometeorological time series. This statistical test is non-parametric test, making it suitable for all distributions, but it requires that the data have no serial correlation. If your data follows a normal distribution, you may alternatively opt for a simple linear regression analysis.

The MK-Test in Data-Tool doesn't consider the tied groups. We removed it from the formula. The reference is here https://doi.org/10.1016/j.atmosres.2013.10.012

#### How to do Mann-Kendall Test in Excel by using Data-Tool | Trend Analysis

**16-** BoxPlot

We used BoxPlot in Excel 2016 for this item. It's so simple item and you should select the data series in columns and then click on the BoxPlot button. Then you will the BoxPlot and you can edit the chart. You should not select the headers or any text, you should select just the data.