Fitting Data¶
To parameterise your data as a normal distribution, you need only calculate the mean and standard deviation of your values. However, to determine whether or not your data should be considered normal requires a test.
For example, the Shapiro-Wilk test measures deviation from normality provided you have enough data points. It can also be too sensitive when you have a large number of data points, in which case a QQ plot can help determine visually how far your data is from normality.
A Poisson distribution should have equal mean and variance, but you can also test goodness-of-fit with the Chi-Squared test.
Exercises¶
Test whether the LakeHuron and treering data conform to normal distributions
Use the Shapiro-Wilk test and look at the result of qqnorm and qqline
Multimodal Data¶
Sometimes when you look at the distribution of your data, you may see that it has multiple clear peaks. A bimodal distribution for example has two peaks because the underlying data is generated by two distinct distributions. This can be a strong indicator that something is wrong with your data, if you expect the only variation to be measurement noise for instance, or that you have a strong batch effect if you are looking at multiple different experiments. If the underlying distributions are the same, with similar coefficients of variation, then you can do a good job of accounting for the differences between the two and still analyse your data effectively - more on that later.
If you need to resolve a finite number of distributions, a technique called Expectation Maximisation or EM is typical, and is available in the mixtools R package. The normalmixEM for instance allows you resolve data that is a mixture of two normal distributions.
Exercises¶
Load the iris data set
Plot histograms for the Petal.Length and Petal.Width columns, what do you observe?
Using mixtools resolve the two distributions in each case