DATA ANALYZER
A question that often arises in the education of stochastic simulation results is how to measure the error of the stochastic simulation. One may compare the ensemble from the simulation with the ensemble from the experimental data, or compare it with the ensemble from an ''exact'' simulation such as SSA. In either way, the answer requires calculation of the distance between two distributions.
We provide several functions to calculate two distribution distances. The first is the total variance distance, which is related to the histogram, defined as

where
and
are the probability density functions of
and
respectively.
The second is the Kolmogorov distance, which is related to the cumulative distribution function, defined as

where
and
are the distribution functions of
and
respectively.
The details of the provided functions are listed below:
histoplot_int is a histogram plot subroutine which takes a vector as the first argument. It assumes that the vector elements are integers. If a real-valued vector is given, that vector will automatically be rounded to integers. The probability for a variable to be at a particular value (integer) is given by the number of samples, whose value is equal to that integer, divided by the total number of samples. The plot gives the calculated probabilities for all the possible values. The mean, variance and standard deviation are also calculated for the variable.
histoplot_real is a histogram plot subroutine which takes a vector as the first argument, and a positive integer as the second argument. The positive integer $K$ is the number of bins. Since we do not assume that the vector contains only integers, bins are used to measure the histogram. The bins are constructed by dividing the whole interval into $K$ subgroups (bins). The probability that a random variable falls into a bin is the number of samples which are in that bin divided by the total number of samples. The plot gives the calculated probabilities for all the bins. The mean, variance and standard deviation are also calculated for the variable.
Cdfplot is a cumulative distribution function (cdf) plot subroutine which takes a vector as the first argument. The cdf value is calculated as the number of samples that are smaller than a particular value, divided by the total number of samples. Here we do not distinguish whether the input vector are integers. The mean, variance and standard variance are also calculated for the variable.
histodistance_int calculates the histogram distance between two random variables. The first two arguments are the vectors of the samples of two random variables. The numbers of samples (sizes of the two vectors) do not have to be equal. It assumes that the two vectors contain integers. If not, they will automatically be rounded to integers. The distance calculates the sum of the absolute value of the difference between the calculated probabilities of the two ensembles. Since all data are integers, the probabilities are calculated for integer values rather than bins.
histodistance_real calculates the histogram distance between two random variables. The first two arguments are the vectors of the samples of two random variables. The numbers of samples (sizes of the two vectors) do not have to be equal. The third argument is a positive integer representing the number of bins. Here we do not assume the data are integers. Thus the probabilities are calculated based on bins. The rest is the same as the function histodistance_int.
kolmogorovdistance calculates the Kolmogorov distance between two random variables. The first two arguments are the vectors of the samples of two random variables. The numbers of samples (sizes of the two vectors) do not have to be equal. The Kolmogorov distance measures the maximum distance between the measured cdfs between the two random variables.
Note: To use the our MPI toolbox, users should have MATLAB to run these MATLAB subroutines..
For details, please read the UserGuide(Please provide link to our user guide).