Functions¶
R does more than just simple calculations or allowing you to import and look at data. Its power comes from functions. There is a wide selection of different functions in R, some of them are built into R and some of them can be made accessible by downloading packages.
Basic functions in R¶
A function requires input arguments, some necessary, such as the data you want to run the function on, and some optional, such as the choice of method or additional parameters. As most optional arguments already have a pre-set default value it can be tricky to grasp how many arguments the function has. We will now look at a very simple first function mean in R.
First, if we want to understand a function, we read its help file.
# Get help
?mean
This prints out the documentation of that function. The first paragraph provides a description of what the function does. The second paragraph shows how to use the function in your script or the console. It also explains if there are any default values set for any of the arguments. The third paragraph takes you through all the different arguments and explains each of them. In our example, the only necessary argument x is an object that we want to apply this function to. The paragraph called Value explains what the output of the function will be. At the very bottom of the documentation you can also find some examples of how to use the function. If we don’t even know if a function exists, we can use the double question mark to search for key words
# #search keywords
??substring
Now let’s start using the mean function with a vector that contains all numbers from 1 to 10. Arguments for a function can be declared both by their position or their name. A function expects to see the arguments in a specific order, so the first argument without a name is expected to be the first argument in the function. As already discussed, the mean function only needs one input argument x.
# Find the mean of a vector
c <- 1:10
#method 1: using the predefined postions
mean(x)
#method 2: declare input by name
mean(x = c)
Exercises 1¶
try calculating the sum of the same vector using the sum function
extract the length of the vector using the length function
calculate the mean using the results from the first two exercises and compare it to the result using mean. Can you see how using functions reduces the length of your code?
calculate the median of the vector using the median function
Functions 2¶
We will use the swiss data set to test the mean function again. First, we will have a look at what this data set contains.
#loading swiss data set
data(swiss)
#view swiss data set
View(swiss)
#calculating mean for fertility
#method 1: using the predefined postions
mean(swiss$Fertility)
#method 2: declare input by name
mean( x = swiss$Fertility)
Let’s look at another function called sd. Sd calculated the standard deviation.
#calculating standard deviation for fertility
sd(swiss$Fertility)
You can also use a function to find the object with the largest or smallest value in a vector using the max or min function.
#finding object with largest/smallest in vector x
c <- 2:30
max(c) #= 30
min(c) #= 2
Exercises 2¶
Explore the swiss data set. The following questions can guide you:
How catholic is the region with the highest fertility?
Is there a difference in infant mortality between low-education and high education areas? (hint: define high as > 10 and low as <= 10)
Is education higher in regions with lower agriculture? (hint: use min , max and mean)
Functions and class¶
Many R functions are written so that they behave differently depending on what class of variable they are given. For instance, the summary function gives additional information about a variable, and what it shows depends on the variable’s class.
# Class discrimination
x <- 1:10
summary(x)
data(swiss)
summary(swiss)
data(Titanic)
summary(Titanic)
So when a function does something unexpected, consider what mode or class the variables you gave it have.
Introduction to statistical functions in R¶
R also provides a large range of statistical functions. A commonly used one is the correlation function cor. Again, have a look at the documentation to learn what the input arguments for this function need to be.
#look at documentation
?cor
The documentation tells us that we need at least one argument x. The default correlation method is set to pearson. Let’s say we want to investigate if there is a correlation between fertility and catholic.
cor(swiss$Fertility, swiss$Catholic)
The function gives you a correlation 1x1 matrix. Your inputs do not necessarily have to be vectors, you can also input an entire matrix or data frame.
#correlation between the entire swiss data frame and fertility
cor(swiss, swiss$Fertility)
Next, we will change the correlation method (check out the documentation again to see which ones you can pick from).
#change method
cor(swiss$Fertility, swiss$Catholic, method = “spearman”)
You can also use R for significance testing. There is a huge amount of statistical tests available. We will only have a look at the t.test function at this point. Have a look at the iris data set.
#load iris data set
data(iris)
#iris data set
View(iris)
We now want to see if there is a significant difference in petal length between the two species setosa and versicolor. The t.test function calculates a “Welch Two samples t-test”.
#calculate t test
t.test(iris[iris$Species == "setosa",]$Petal.Length, iris[iris$Species == "versicolor",]$Petal.Length)
This will print out the summary of the t test in your consol. If you are planning on using the output for further calculation or simulations it makes sense to store the result in a variable.
#calculate t test and save in variable t_test
t_test <- t.test(iris[iris$Species == "setosa",]$Petal.Length, iris[iris$Species == "versicolor",]$Petal.Length)
The output is now stored as a list called t_test. You can easily access the different quantities using the dollar sign or double square brackets. For example, we can extract the t-statistic from our calculation
#get t-statistics
t_test$statistics
t_test[[statistics]]
To get an overview of all quantities provided by the function you can use the names function.
#overview over all quantities
names(t_test)
Exercise 3¶
Go back to your results in exercise 2. Are the results statistically significant?
+ show/hide code