Session 1.6. Introduction to R with RStudio¶
Index:¶
Working directory
Basic functions for vectors
Combine vectors and create matrices
Types of data within R
Types of objects within R
Select data
Factors
Working directory¶
Where are we? The getwd()
function gives us our current working directory:
getwd()
We can change the current working directory by ‘File/Change working directory’ or with the setwd()
:
setwd("S:\home\biolcourse-XX")
With list.files()
or dir()
function we list the files present in the working directory:
list.files()
dir()
Basic functions for vectors¶
The most basic function within R probably is the c()
function. It combines values into a vector.Let’s create a vector by combining 4 numeric values:
c(1,3,7,24)
We do the same but store the result in an object called x:
x<-c(1,3,7,24)
x
We can also combine text. Text always goes quoted:
y<-c("a","b","c","d")
y
Both x and y vectors are stored in the RAM memory. We can list all the objects in the memory with the ls()
function:
ls()
We can remove objects from the memory. Let’s remove the y object:
rm(y)
ls()
We can also enter numeric sequences. We can do it in three ways:
1:5
c(1,2,3,4,5)
seq(1,5)
How can we know how the seq()
works? The help page for a function can be accessed throuh ?funcion_name
:
?seq
Let’s exploit the by
and length.out
parameters:
seq(1,20)
seq(1,20,by=2)
seq(1,20,length.out=40)
We can combine functions. For example c()
and seq()
:
c(1,seq(4,10),12)
Something less abstract: lets introduce the age of some of us:
ages<-c(18,25,22,23,21,19)
ages
Vector can have names for each value they contain. Lets set the vector’s names:
names(ages)
names(ages)<-c("Albert","Andrea","Ines","Rolf","Sylvia","Student X")
ages
We can get the mean, standard deviation and other metrics:
sum(ages) # sum
mean(ages) # mean
sd(ages) # standard deviation
min(ages) # minimum
max(ages) # maximum
range(ages) # range
length(ages) # longitude
sort(ages) # sort values
sort(ages,decreasing=T) # sort decreasing
We can store these metrics as objects and concatenate them with c()
:
mean.ages<-mean(ages)
mean.ages
ls()
summary.ages<-c(mean(ages),sd(ages),min(ages),max(ages),length(ages))
names(summary.ages)<-c("Mean","sd","minimum","maximum","Nº obs.")
summary.ages
What if there are unknown values?
ages<-c(18,25,22,23,21,19,NA)
ages
names(ages)
names(ages)<-c("Albert","Andrea","Ines","Rolf","Sylvia","Student X")
We compute the mean again:
mean(ages)
Well, something is wrong. Let’s ask for help:
?mean
The mean()
has a parameter that controls the treatment of NAs.
mean(ages)
mean(ages,na.rm=F)
mean(ages,na.rm=T)
Generally, it is recommended to use NAs when we have unknown values. However we can get rid of them, what may be useful for some analyses:
ages<-na.exclude(ages)
ages
Combine vectors and create matrices¶
Now we create an object with some height values:
heigths<-c(1.75,1.80,1.63,1.82,2.00,1.72)
names(heigths)<-names(ages)
heigths
Combine ages and heights with cbind()
or rbind()
functions:
res<-rbind(ages,heigths)
res
res<-cbind(ages,heigths)
res
How can we flip it?
t(res)
Now we are workin with a matrix (no a vector). Thus, new possibilities appear:
dim(res) # dimensions
ncol(res) # no of columns
nrow(res) #no of rows
dimnames(res) # names
rownames(res) # name of rows
colnames(res) # name of columns
We can add the values by rows or columns:
colSums(res)
rowSums(res)
Or we can apply any function by rows or by columns:
apply(res,2,sum) # identical to colSums()
apply(res,1,sum) # identical rowSums()
apply(res,2,mean)
apply(res,2,sd)
apply(res,2,min)
apply(res,2,max)
Have can we create a matrix? With the matrix()
function. A matrix, within R, is in fact a vector with a dimensions attribute. So to create a matrix we need to enter a vector and set the desired number of rows or columns. By default, the matrices are filled up by columns
We enter the same values:
res2<-matrix(c(18,25,22,23,21,19,1.75,1.80,1.63,1.82,2.00,1.72),ncol=2)
Now we set the names of both rows and columns:
colnames(res2)<-c("ages","heigths")
rownames(res2)<-c("Albert","Andrea","Ines","Rolf","Sylvia","Student X")
res
res2
Types of data within R¶
There exist 5 main data types in R (or atomic classes):
Logical: TRUE, FALSE
Numerical: 1, 4.5, 122, etc.
Integer: 1, 5, 122, etc.
Complex: 1+0i, 2+4i, etc.
Character: “a”, “b”, “hello”, etc.
Let’s create some different data types and explore how to detect the data type.
An integer vector:
v1<-seq(1,20)
v1
class(v1)
A numeric vector:
v2<-seq(1,20,by=0.5)
v2
class(v2)
A character vector:
v3<-c("a","b","c")
v3
class(v3)
A logical vector:
v4<-v1==5
v4
class(v4)
Types of objects within R¶
The most usual objects in R are of 5 types:
Vectors: the concatenation of one dimension of data of the same class.
They are the minimal unit with which compose the rest of the objects.
All the elements within it have to be of the same type.
Factors: used for representing categorical data. They may be understood as integer vectors for which each integer has a label associated.
They may be ordered or not.
They are key for most statistical tests.
Matrices: formally, a vector with two dimensions. In practice, a 2D array of data of the same type.
Lists: formally is a vector of elements of different class. That is, an object which is composed of objects.
The results from statistical tests are usually lists.
Data frames: formally lists composed of objects with the same length.
Apparently they are matrices, but they can contain objects of different class (numeric, characters, factors, etc.)
They are the main object used for statistics within R: each file corresponds to an observational unit (sample, individual, etc.) and each column is a measured variable of these units.
Let’s se some examples with a more realistic dataset. We are going to use the BP dataset. This dataset contains information on the records of 100 adults from a small cross-sectional survey in 2001 investigating blood pressure and its determinants in a community. It is a data frame with 6 variables:
id: identifier of each individual
sex: male/female
sbp: systolic blood pressure
dbp: diastolic blood presuer
saltadd: whether salt was added to diet
birthdate: date of birth
load("S:\masterdata\BP.RData")
Let’s explore it:
head(BP)
dim(BP)
BP[1:10,]
What kind of object is the BP dataset?
class(BP)
What kind of data are the first two variables of the dataset?
class(BP[,1])
class(BP[,2])
We perform a regression between the sbp and dbp variables:
reg1<-lm(BP$sbp~BP$dbp)
summary(reg1)
We save the result:
result<-summary(reg1)
class(result)
str(result)
Select Data¶
Vectors¶
Select by position:
heigths[1] # first value
heigths[3] # third value
heigths[1:3] # first to third value
heigths[c(1,3)] # first and third value
heigths[-3] # all values excepth the third
heigths[-c(4,5)] # all except 4th and 5th
Selection by criteria:
good.values<-which(heigths>1.75)
heigths[good.values]
good.values<-which(heigths<1.90)
heigths[good.values]
good.values<-which(heigths==1.80)
heigths[good.values]
heigths[which(heigths==1.80)]
Matrices¶
Select by position:
res[1,1] # first row, first column
res[1,2] # first row, second column
res[1,] # first row, all columns
res[,2] # all raws, second column
res[1:3,] # rows first to third, all columns
res[c(1,2,4),] # rows 1, 2 and 4, all comuns
res[-1,] # all except the first row
Selection by criteria:
good.values<-which(res[,1]>20)
res[good.values,]
Data frames¶
The selection of data from data frames works as the matrices. However we can select the columns by name through data_frame$column_name
:
BP$sbp
Factors¶
Let’s create a factor from a character vector:
f1<-factor(c("T1","T1","T2","T2","T2"))
f1
Let’s create a factor from a numeric vector:
f2<-factor(c(1,1,2,2,2),label=c("T1","T2"))
f2
There are factors that has a clear ordering:
f3<-factor(c("high","high", "medium","low", "low"))
f3
How can we order it?
f3.ord<-factor(f3, levels=c(c("low","medium","high")), ordered=T)
f3.ord
Change the name of levels:
levels(f3.ord)
levels(f3.ord)<-c("h","m","l")
f3.ord
Levels may be merged by changing properly their names:
levels(f3.ord)<-c("h","l","l")
f3.ord
Create a dta frame with two factors:
gender<-factor(c("M","M","M","M","F","F","F","F"))
employment<-factor(c("yes","yes","no","no","yes","yes","no","no"))
gender
employment
some.data<-data.frame(gender,employment)
some.data
Combine factors with interaction()
:
interaction(some.data$gender,some.data$employment)
A nicer separator:
interaction(some.data$gender,some.data$employment,sep=" - ")
Add it to the data frame:
some.data$merged<-interaction(some.data$gender,some.data$employment,sep=" - ")
some.data
Further resources¶
Manuals:
Cheat sheets: