Student's t-distribution

The Student's t distribution plays a role in a number of widely used statistical analyses, including Student's t test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the difference between two population means, and in linear regression analysis.

[3] The following images show the density of the t distribution for increasing values of

Suppose X1, ..., Xn are independent realizations of the normally-distributed, random variable X, which has an expected value μ and variance σ2.

Moreover, it is possible to show that these two random variables (the normally distributed one Z and the chi-squared-distributed one V) are independent.

Notice that the unknown population variance σ2 does not appear in T, since it was in both the numerator and the denominator, so it canceled.

Gosset intuitively obtained the probability density function stated above, with

Thus for inference purposes the t statistic is a useful "pivotal quantity" in the case when the mean and variance

are unknown population parameters, in the sense that the t statistic has then a probability distribution that depends on neither

As a result, the location-scale t distribution arises naturally in many Bayesian inference problems.

It thus gives the probability that a value of t less than that calculated from observed data would occur by chance.

Therefore, the function A(t | ν) can be used when testing whether the difference between the means of two sets of data is statistically significant, by calculating the corresponding value of t and the probability of its occurrence if the two sets of data were drawn from the same population.

For statistical hypothesis testing this function is used to construct the p-value.

: Other properties of this version of the distribution are:[13] Student's t distribution arises in a variety of statistical estimation problems where the goal is to estimate an unknown parameter, such as a mean value, in a setting where the data are observed with additive errors.

In any situation where this statistic is a linear function of the data, divided by the usual estimate of the standard deviation, the resulting quantity can be rescaled and centered to follow Student's t distribution.

Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution.

These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining.

[citation needed] Suppose the number A is so chosen that when T has a t distribution with n − 1   degrees of freedom.

Therefore, if we find the mean of a set of observations that we can reasonably expect to have a normal distribution, we can use the t distribution to examine whether the confidence limits on that mean include some theoretically predicted value – such as the value predicted on a null hypothesis.

If the data are normally distributed, the one-sided (1 − α) upper confidence limit (UCL) of the mean, can be calculated using the following equation: The resulting UCL will be the greatest average value that will occur for a given confidence interval and population size.

If an improper prior proportional to ⁠1/ σ² ⁠ is placed over the variance, the t distribution also arises.

However, it is not always easy to identify outliers (especially in high dimensions), and the t distribution is a natural choice of model for such data and provides a parametric approach to robust statistics.

The likelihood can have multiple local maxima and, as such, it is often necessary to fix the degrees of freedom at a fairly low value and estimate the other parameters taking this as given.

Venables and Ripley[citation needed] suggest that a value of 5 is often a good choice.

[16] These processes are used for regression, prediction, Bayesian optimization and related problems.

[17] The following table lists values for t distributions with ν degrees of freedom for a range of one-sided or two-sided critical regions.

Then with confidence interval calculated from we determine that with 90% confidence we have a true mean lying below In other words, 90% of the times that an upper threshold is calculated by this method from particular samples, this upper threshold exceeds the true mean.

And with 90% confidence we have a true mean lying above In other words, 90% of the times that a lower threshold is calculated by this method from particular samples, this lower threshold lies below the true mean.

[citation needed] In the case of stand-alone sampling, an extension of the Box–Muller method and its polar form is easily deployed.

[25] In the English-language literature, the distribution takes its name from William Sealy Gosset's 1908 paper in Biometrika under the pseudonym "Student".

Another version is that Guinness did not want their competitors to know that they were using the t test to determine the quality of raw material.

Statistician William Sealy Gosset , known as "Student"