[7] The main factor in choosing a method is often a trade-off between the bias and the variance of the estimate,[8] although the nature of the (suspected) distribution of the data may also be a factor,[7] as well as the sample size and the size of the alphabet of the probability distribution.
[9] The histogram approach uses the idea that the differential entropy of a probability distribution
The histogram is itself a maximum-likelihood (ML) estimate of the discretized frequency distribution [citation needed]), where
If the data is one-dimensional, we can imagine taking all the observations and putting them in order of their value.
This is a very rough estimate with high variance, but can be improved, for example by thinking about the space between a given value and the one m away from it, where m is some fixed number.
The method gives very accurate results, but it is limited to calculations of random sequences modeled as Markov chains of the first order with small values of bias and correlations.
This is the first known method that takes into account the size of the sample sequence and its impact on the accuracy of the calculation of entropy.
[17] Practically, the DNN is trained as a classifier that maps an input vector or matrix X to an output probability distribution over the possible classes of random variable Y, given input X.
In practice, the probability distribution of Y is obtained by a Softmax layer with number of nodes that is equal to the alphabet size of Y. NJEE uses continuously differentiable activation functions, such that the conditions for the universal approximation theorem holds.