Fitting Data#
To parameterise your data as a normal distribution, you need only calculate the mean and standard deviation of your values. However, to determine whether or not your data should be considered normal requires a test.
For example, the Shapiro-Wilk test measures deviation from normality provided you have enough data points. It can also be too sensitive when you have a large number of data points, in which case a QQ plot can help determine visually how far your data is from normality.
A Poisson distribution should have equal mean and variance, but you can also test goodness-of-fit with the Chi-Squared test.
Exercises#
Test whether the LakeHuron and treering data conform to normal distributions
Use the Shapiro-Wilk test and look at the result of qqnorm and qqline
Multimodal Data#
Sometimes when you look at the distribution of your data, you may see that it has multiple clear peaks. A bimodal distribution for example has two peaks because the underlying data is generated by two distinct distributions. This can be a strong indicator that something is wrong with your data, if you expect the only variation to be measurement noise for instance, or that you have a strong batch effect if you are looking at multiple different experiments. If the underlying distributions are the same, with similar coefficients of variation, then you can do a good job of accounting for the differences between the two and still analyse your data effectively - more on that later.
If you need to resolve a finite number of distributions, a technique called Expectation Maximisation or EM is typical, and is available in the mixtools R package. The normalmixEM for instance allows you resolve data that is a mixture of two normal distributions.
Exercises#
Load the iris data set
Plot histograms for the Petal.Length and Petal.Width columns, what do you observe?
Using mixtools resolve the two distributions in each case