Fitting Data

To parameterise your data as a normal distribution, you need only calculate the mean and standard deviation of your values. However, to determine whether or not your data should be considered normal requires a test.

For example, the Shapiro-Wilk test measures deviation from normality provided you have enough data points. It can also be too sensitive when you have a large number of data points, in which case a QQ plot can help determine visually how far your data is from normality.

A Poisson distribution should have equal mean and variance, but you can also test goodness-of-fit with the Chi-Squared test.

Exercises

  • Test whether the LakeHuron and treering data conform to normal distributions

  • Use the Shapiro-Wilk test and look at the result of qqnorm and qqline

+ show/hide code

Multimodal Data

Sometimes when you look at the distribution of your data, you may see that it has multiple clear peaks. A bimodal distribution for example has two peaks because the underlying data is generated by two distinct distributions. This can be a strong indicator that something is wrong with your data, if you expect the only variation to be measurement noise for instance, or that you have a strong batch effect if you are looking at multiple different experiments. If the underlying distributions are the same, with similar coefficients of variation, then you can do a good job of accounting for the differences between the two and still analyse your data effectively - more on that later.

If you need to resolve a finite number of distributions, a technique called Expectation Maximisation or EM is typical, and is available in the mixtools R package. The normalmixEM for instance allows you resolve data that is a mixture of two normal distributions.

Exercises

  • Load the iris data set

  • Plot histograms for the Petal.Length and Petal.Width columns, what do you observe?

  • Using mixtools resolve the two distributions in each case

+ show/hide code