Session 1.6. Introduction to R with RStudio#

Index:#

  • Working directory

  • Basic functions for vectors

  • Combine vectors and create matrices

  • Types of data within R

  • Types of objects within R

  • Select data

  • Factors

Working directory#

Where are we? The getwd() function gives us our current working directory:

getwd()

We can change the current working directory by ‘File/Change working directory’ or with the setwd():

setwd("S:\home\biolcourse-XX")

With list.files() or dir() function we list the files present in the working directory:

list.files()
dir()

Basic functions for vectors#

The most basic function within R probably is the c() function. It combines values into a vector.Let’s create a vector by combining 4 numeric values:

c(1,3,7,24)

We do the same but store the result in an object called x:

x<-c(1,3,7,24)
x

We can also combine text. Text always goes quoted:

y<-c("a","b","c","d")
y

Both x and y vectors are stored in the RAM memory. We can list all the objects in the memory with the ls() function:

ls()

We can remove objects from the memory. Let’s remove the y object:

rm(y)
ls()

We can also enter numeric sequences. We can do it in three ways:

1:5
c(1,2,3,4,5)
seq(1,5)

How can we know how the seq() works? The help page for a function can be accessed throuh ?funcion_name:

?seq

Let’s exploit the by and length.out parameters:

seq(1,20)
seq(1,20,by=2)
seq(1,20,length.out=40)

We can combine functions. For example c() and seq():

c(1,seq(4,10),12)

Something less abstract: lets introduce the age of some of us:

ages<-c(18,25,22,23,21,19)
ages

Vector can have names for each value they contain. Lets set the vector’s names:

names(ages)
names(ages)<-c("Albert","Andrea","Ines","Rolf","Sylvia","Student X")
ages

We can get the mean, standard deviation and other metrics:

sum(ages) # sum
mean(ages) # mean
sd(ages) # standard deviation
min(ages) # minimum
max(ages) # maximum
range(ages) # range
length(ages) # longitude
sort(ages) # sort values
sort(ages,decreasing=T) # sort decreasing

We can store these metrics as objects and concatenate them with c():

mean.ages<-mean(ages)
mean.ages
ls()

summary.ages<-c(mean(ages),sd(ages),min(ages),max(ages),length(ages))
names(summary.ages)<-c("Mean","sd","minimum","maximum","Nº obs.")
summary.ages

What if there are unknown values?

ages<-c(18,25,22,23,21,19,NA)
ages
names(ages)
names(ages)<-c("Albert","Andrea","Ines","Rolf","Sylvia","Student X")

We compute the mean again:

mean(ages)

Well, something is wrong. Let’s ask for help:

?mean

The mean() has a parameter that controls the treatment of NAs.

mean(ages)
mean(ages,na.rm=F)
mean(ages,na.rm=T)

Generally, it is recommended to use NAs when we have unknown values. However we can get rid of them, what may be useful for some analyses:

ages<-na.exclude(ages)
ages

Combine vectors and create matrices#

Now we create an object with some height values:

heigths<-c(1.75,1.80,1.63,1.82,2.00,1.72)
names(heigths)<-names(ages)
heigths

Combine ages and heights with cbind() or rbind() functions:

res<-rbind(ages,heigths)
res
res<-cbind(ages,heigths)
res

How can we flip it?

t(res)

Now we are workin with a matrix (no a vector). Thus, new possibilities appear:

dim(res) # dimensions
ncol(res) # no of columns
nrow(res) #no of rows

dimnames(res) # names
rownames(res) # name of rows
colnames(res) # name of columns

We can add the values by rows or columns:

colSums(res)
rowSums(res)

Or we can apply any function by rows or by columns:

apply(res,2,sum) # identical to colSums()
apply(res,1,sum) # identical rowSums()
apply(res,2,mean)
apply(res,2,sd)
apply(res,2,min)
apply(res,2,max)

Have can we create a matrix? With the matrix() function. A matrix, within R, is in fact a vector with a dimensions attribute. So to create a matrix we need to enter a vector and set the desired number of rows or columns. By default, the matrices are filled up by columns

We enter the same values:

res2<-matrix(c(18,25,22,23,21,19,1.75,1.80,1.63,1.82,2.00,1.72),ncol=2)

Now we set the names of both rows and columns:

colnames(res2)<-c("ages","heigths")
rownames(res2)<-c("Albert","Andrea","Ines","Rolf","Sylvia","Student X")

res
res2

Types of data within R#

There exist 5 main data types in R (or atomic classes):

  • Logical: TRUE, FALSE

  • Numerical: 1, 4.5, 122, etc.

  • Integer: 1, 5, 122, etc.

  • Complex: 1+0i, 2+4i, etc.

  • Character: “a”, “b”, “hello”, etc.

Let’s create some different data types and explore how to detect the data type.

An integer vector:

v1<-seq(1,20)
v1
class(v1)

A numeric vector:

v2<-seq(1,20,by=0.5)
v2
class(v2)

A character vector:

v3<-c("a","b","c")
v3
class(v3)

A logical vector:

v4<-v1==5
v4
class(v4)

Types of objects within R#

The most usual objects in R are of 5 types:

Vectors: the concatenation of one dimension of data of the same class.

  • They are the minimal unit with which compose the rest of the objects.

  • All the elements within it have to be of the same type.

Factors: used for representing categorical data. They may be understood as integer vectors for which each integer has a label associated.

  • They may be ordered or not.

  • They are key for most statistical tests.

Matrices: formally, a vector with two dimensions. In practice, a 2D array of data of the same type.

Lists: formally is a vector of elements of different class. That is, an object which is composed of objects.

  • The results from statistical tests are usually lists.

Data frames: formally lists composed of objects with the same length.

  • Apparently they are matrices, but they can contain objects of different class (numeric, characters, factors, etc.)

  • They are the main object used for statistics within R: each file corresponds to an observational unit (sample, individual, etc.) and each column is a measured variable of these units.

Let’s se some examples with a more realistic dataset. We are going to use the BP dataset. This dataset contains information on the records of 100 adults from a small cross-sectional survey in 2001 investigating blood pressure and its determinants in a community. It is a data frame with 6 variables:

  • id: identifier of each individual

  • sex: male/female

  • sbp: systolic blood pressure

  • dbp: diastolic blood presuer

  • saltadd: whether salt was added to diet

  • birthdate: date of birth

load("S:\masterdata\BP.RData")

Let’s explore it:

head(BP)
dim(BP)
BP[1:10,]

What kind of object is the BP dataset?

class(BP)

What kind of data are the first two variables of the dataset?

class(BP[,1])
class(BP[,2])

We perform a regression between the sbp and dbp variables:

reg1<-lm(BP$sbp~BP$dbp)
summary(reg1)

We save the result:

result<-summary(reg1)
class(result)
str(result)

Select Data#

Vectors#

Select by position:

heigths[1] # first value
heigths[3] # third value
heigths[1:3] # first to third value
heigths[c(1,3)] # first and third value
heigths[-3] # all values excepth the third
heigths[-c(4,5)] # all except 4th and 5th

Selection by criteria:

good.values<-which(heigths>1.75)
heigths[good.values]

good.values<-which(heigths<1.90)
heigths[good.values]

good.values<-which(heigths==1.80)
heigths[good.values]

heigths[which(heigths==1.80)]

Matrices#

Select by position:

res[1,1] # first row, first column
res[1,2] # first row, second column
res[1,] # first row, all columns
res[,2] # all raws, second column
res[1:3,] # rows first to third, all columns
res[c(1,2,4),] # rows 1, 2 and 4, all comuns
res[-1,] # all except the first row

Selection by criteria:

good.values<-which(res[,1]>20)
res[good.values,]

Data frames#

The selection of data from data frames works as the matrices. However we can select the columns by name through data_frame$column_name:

BP$sbp

Factors#

Let’s create a factor from a character vector:

f1<-factor(c("T1","T1","T2","T2","T2"))
f1

Let’s create a factor from a numeric vector:

f2<-factor(c(1,1,2,2,2),label=c("T1","T2"))
f2

There are factors that has a clear ordering:

f3<-factor(c("high","high", "medium","low", "low"))
f3

How can we order it?

f3.ord<-factor(f3, levels=c(c("low","medium","high")), ordered=T)
f3.ord

Change the name of levels:

levels(f3.ord)
levels(f3.ord)<-c("h","m","l")
f3.ord

Levels may be merged by changing properly their names:

levels(f3.ord)<-c("h","l","l")
f3.ord

Create a dta frame with two factors:

gender<-factor(c("M","M","M","M","F","F","F","F"))
employment<-factor(c("yes","yes","no","no","yes","yes","no","no"))

gender
employment

some.data<-data.frame(gender,employment)
some.data

Combine factors with interaction():

interaction(some.data$gender,some.data$employment)

A nicer separator:

interaction(some.data$gender,some.data$employment,sep=" - ")

Add it to the data frame:

some.data$merged<-interaction(some.data$gender,some.data$employment,sep=" - ")
some.data

Further resources#

Manuals:

Cheat sheets: