Functions#
R does more than just simple calculations or allowing you to import and look at data. Its power comes from functions. There is a wide selection of different functions in R, some of them are built into R and some of them can be made accessible by downloading packages.
Basic functions in R#
A function requires input arguments, some necessary, such as the data you want to run the function on, and some optional, such as the choice of method or additional parameters. As most optional arguments already have a pre-set default value it can be tricky to grasp how many arguments the function has. We will now look at a very simple first function mean in R.
First, if we want to understand a function, we read its help file.
# Get help
?mean
This prints out the documentation of that function. The first paragraph provides a description of what the function does. The second paragraph shows how to use the function in your script or the console. It also explains if there are any default values set for any of the arguments. The third paragraph takes you through all the different arguments and explains each of them. In our example, the only necessary argument x is an object that we want to apply this function to. The paragraph called Value explains what the output of the function will be. At the very bottom of the documentation you can also find some examples of how to use the function. If we don’t even know if a function exists, we can use the double question mark to search for key words
# #search keywords
??substring
This will open a page called Search Results, in the Help section of the bottom right panel suggesting functions that are most likely to answer your request you can see from which library each function comes from and a short description of what it does.
Now let’s start using the mean function with a vector that contains all numbers from 1 to 10. Arguments for a function can be declared both by their position or their name. A function expects to see the arguments in a specific order, so the first argument without a name is expected to be the first argument in the function. As already discussed, the mean function only needs one input argument x.
# Find the mean of a vector
c <- 1:10
#method 1: using the predefined postions
mean(x)
#method 2: declare input by name
mean(x = c)^
Exercise 0.6
Try calculating the sum of the same vector using the sum function
#defining vector c
c <- 1:10
#calculate sum
sum <- sum(c)
Extract the length of the vector using the length function
len <- length(c)
Calculate the mean using the results from the first two exercises and compare it to the result using mean. Can you see how using functions reduces the length of your code?
mean_calc <- sum/len
show(mean_calc)
#we can see that the result of mean_calc and mean are the same. However, we used up 3 lines to code the median. The function median only uses one line and is much more efficient.
Calculate the median of the vector using the median function
#calculate median of c
med <- median(c)
show(med)
We will use the swiss data set to test the mean function again. First, we will have a look at what this data set contains.
#loading swiss data set
data(swiss)
#view swiss data set
View(swiss)
#calculating mean for fertility
#method 1: using the predefined postions
mean(swiss$Fertility)
#method 2: declare input by name
mean( x = swiss$Fertility)
Let’s look at another function called sd. Sd calculates the standard deviation.
#calculating standard deviation for fertility
sd(swiss$Fertility)
You can also use a function to find the object with the largest or smallest value in a vector using the max or min function.
#finding object with largest/smallest in vector x
c <- 2:30
max(c) #= 30
min(c) #= 2
Exercise 0.7
Explore the swiss data set. The following questions can guide you:
How catholic is the region with the highest fertility?
#how catholic is the region with the highest fertility
#get all columns for max fertility
swiss[swiss$Fertility == max(swiss$Fertility),]
#only get Catholic column
swiss[swiss$Fertility == max(swiss$Fertility), "Catholic"] #93.4
Is there a difference in infant mortality between low-education and high education areas? (hint: define high as > 10 and low as <= 10)
#difference in mean between high and low education areas
#slicing data frame
low_education <- swiss[swiss$Education <= 10,]
high_education <- swiss[swiss$Education > 10,]
#calculating means
mean(low_education$Infant.Mortality) #20.2
mean(high_education$Infant.Mortality) #19.48824
Is education higher in regions with lower agriculture? (hint: use min , max and mean)
#how does education affect agriculture?
#slicing data frame
low_agriculture <- swiss[swiss$Agriculture <= 50,]
high_agriculture <- swiss[swiss$Agriculture > 50,]
#calculating means
mean(low_agriculture$Education) #16.38095
mean(high_agriculture$Education) #6.615385
#calculating maxima and minima
swiss[swiss$Agriculture == max(swiss$Agriculture),]
swiss[swiss$Agriculture == min(swiss$Agriculture),]
Functions and class#
Many R functions are written so that they behave differently depending on what class of variable they are given. For instance, the summary function gives additional information about a variable, and what it shows depends on the variable’s class.
# Class discrimination
x <- 1:10
summary(x)
data(swiss)
summary(swiss)
data(Titanic)
summary(Titanic)
So when a function does something unexpected, consider what mode or class the variables you gave it have.
Introduction to statistical functions in R#
R also provides a large range of statistical functions. A commonly used one is the correlation function cor. Again, have a look at the documentation to learn what the input arguments for this function need to be.
#look at documentation
?cor
The documentation tells us that we need at least one argument x. The default correlation method is set to pearson. Let’s say we want to investigate if there is a correlation between fertility and catholic.
cor(swiss$Fertility, swiss$Catholic)
The function gives you a correlation 1x1 matrix. Your inputs do not necessarily have to be vectors, you can also input an entire matrix or data frame.
#correlation between the entire swiss data frame and fertility
cor(swiss, swiss$Fertility)
Next, we will change the correlation method (check out the documentation again to see which ones you can pick from).
#change method
cor(swiss$Fertility, swiss$Catholic, method = “spearman”)
You can also use R for significance testing. There is a huge amount of statistical tests available. We will only have a look at the t.test function at this point. Have a look at the iris data set.
#load iris data set
data(iris)
#iris data set
View(iris)
We now want to see if there is a significant difference in petal length between the two species setosa and versicolor. The t.test function calculates a “Welch Two samples t-test”.
#calculate t test
t.test(iris[iris$Species == "setosa",]$Petal.Length, iris[iris$Species == "versicolor",]$Petal.Length)
This will print out the summary of the t test in your consol. If you are planning on using the output for further calculation or simulations it makes sense to store the result in a variable.
#calculate t test and save in variable t_test
t_test <- t.test(iris[iris$Species == "setosa",]$Petal.Length, iris[iris$Species == "versicolor",]$Petal.Length)
The output is now stored as a list called t_test. You can easily access the different quantities using the dollar sign or double square brackets. For example, we can extract the t-statistic from our calculation
#get t-statistics
t_test$statistics
t_test[[statistics]]
To get an overview of all quantities provided by the function you can use the names function.
#overview over all quantities
names(t_test)
Exercise 0.8
Go back to your results in exercise 0.7. Are the results statistically significant?
Show how correlated are the infant mortality and the education level , than the correlation between the agricultural activity and the education level
cor(swiss$Infant.Mortality, swiss$Education) # -0.09932185 low correlation
cor(swiss$Agriculture, swiss$Education) # -0.639522 ,slight negative correlation -> higher education score lower agricultural activity
Using the same thresholds as Exercise 0.7 (10 for education, 50 for agriculture) , test whether there are statistically significant differences between :
The low and highly educated regions in infant mortality
#difference in mean between high and low education areas
low_education <- swiss[swiss$Education <= 10,]
high_education <- swiss[swiss$Education > 10,]
res_edu <- t.test(low_education$Infant.Mortality, high_education$Infant.Mortality)
res_edu$p.value # = 0.44, not significant
The regions with low and high agricultural activity and the education score
low_agriculture <- swiss[swiss$Agriculture <= 50,]
high_agriculture <- swiss[swiss$Agriculture > 50,]
res_agri <-t.test(low_agriculture$Education, high_agriculture$Education)
res_agri$p.value # = 0.001394277, signficant difference