In this introduction to R, the reader will master the basics of this beautiful open source language with hands-on experience. With over 2 million users worldwide R is rapidly becoming the leading programming language in statistics and data science. Every year, the number of R users grows by 40% and an increasing number of organizations are using it in their day-to-day activities.
Download and install R at this link
Download and install Rstudio (free version) at this link
Values can be assigned to variables with the operators <-
, =
or ->
.
R functions are invoked by their name, then followed by the parenthesis, and zero or more arguments.
Additional functionality beyond those offered by the core R library are available with R packages. In order to install an additional package, the install.packages
function can be invoked.
There are two ways to invoke functions from add-on packages: using the package namespace or loading the package.
There are several basic R data types that are of frequent occurrence in routine R calculations.
Decimal values are called numerics in R. It is the default computational data type. If a decimal value is assigned to a variable x
as follows, x
will be of numeric type.
## [1] "numeric"
Furthermore, even if an integer is assigned to a variable x
, it is still being saved as a numeric value.
## [1] FALSE
In order to create an integer variable in R, the as.integer
function can be invoked.
## [1] TRUE
Integers can also be declared by appending an L
suffix.
## [1] TRUE
Complex numbers are of complex
type
## [1] "complex"
Basic functions which support complex arithmetic are:
## [1] 3
## [1] 4
## [1] 5
## [1] 0.9272952
## [1] 3-4i
A logical value is often created via comparison between variables.
## [1] TRUE
Standard logical operations are &
(and), |
(or), and !
(not).
## [1] FALSE
## [1] TRUE
## [1] FALSE
A character object is used to represent string values in R. Two character values can be concatenated with the paste
function.
## [1] "[email protected]"
However, it is often more convenient to create a readable string with the sprintf
function, which has a C language syntax.
## [1] "Sam has 100 dollars"
And to replace the first occurrence of the word “little” by another word “big” in the string, the sub
function can be applied.
## [1] "Mary has a big lamb."
More functions for string manipulation can be found in the R documentation.
The basic data structure in R is the vector. They are usually created with the c()
function, short for combine:
## [1] 1 2 3
Vectors can contain only similar data types. If this is not the case, some conversion takes place.
## [1] "FALSE" "1" "2"
A matrix is a collection of similar data types arranged in a two-dimensional rectangular layout. They are usually created with the matrix()
function:
matrix(data = c(1,2,3,4,5,6), # the data elements
ncol = 3, # number of columns
nrow = 2, # number of rows
byrow = TRUE) # fill matrix by rows
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
# Declaring a named matrix
matrix(data = c(1,2,3,4,5,6), # the data elements
ncol = 3, # number of columns
nrow = 2, # number of rows
byrow = TRUE, # fill matrix by rows
dimnames = list( # list containing names
c('r1','r2'), # rownames
c('c1','c2','c3') # colnames
))
## c1 c2 c3
## r1 1 2 3
## r2 4 5 6
# Generating a named matrix
M <- matrix(data = c(1,2,3,4,5,6), # the data elements
ncol = 3, # number of columns
nrow = 2, # number of rows
byrow = TRUE) # fill matrix by rows
rn <- c('r1','r2') # vector of rownames
cn <- c('c1','c2','c3') # vector of colnames
rownames(M) <- rn # assign rownames
colnames(M) <- cn # assign colnames
M
## c1 c2 c3
## r1 1 2 3
## r2 4 5 6
A data frame is used for storing data tables. It is similar to a matrix
but data.frame
can contain heterogeneous inputs while a matrix
cannot. In matrix
only similar data types can be stored whereas in a data.frame
there can be different data types. They are usually created with the data.frame()
function. Beware data.frame()
’s default behaviour which turns strings into factors (a factor is a vector that can contain only predefined values, and is used to store categorical data). Use stringsAsFactors = FALSE
to suppress this behaviour:
v1 <- c(10,20,30) # numeric vector
v2 <- c('a','b','c') # character vector
v3 <- c(TRUE,TRUE,FALSE) # logical vector
data.frame(v1, v2, v3, stringsAsFactors = FALSE) # data.frame
## v1 v2 v3
## 1 10 a TRUE
## 2 20 b TRUE
## 3 30 c FALSE
# Declaring a named data.frame
v1 <- c(10,20,30) # numeric vector
v2 <- c('a','b','c') # character vector
v3 <- c(TRUE,TRUE,FALSE) # logical vector
data.frame('c1' = v1, # column named 'c1'
'c2' = v2, # column named 'c2'
'c3' = v3, # column named 'c3'
row.names = c('r1', 'r2', 'r3'), # vector of rownames
stringsAsFactors = FALSE) # suppress character conversion
## c1 c2 c3
## r1 10 a TRUE
## r2 20 b TRUE
## r3 30 c FALSE
# Generating a named data.frame
v1 <- c(10,20,30) # numeric vector
v2 <- c('a','b','c') # character vector
v3 <- c(TRUE,TRUE,FALSE) # logical vector
rn <- c('r1','r2','r3') # vector of rownames
cn <- c('c1','c2','c3') # vector of colnames
df <- data.frame(v1, v2, v3,stringsAsFactors = FALSE) # data.frame
rownames(df) <- rn # assign rownames
colnames(df) <- cn # assign colnames
df
## c1 c2 c3
## r1 10 a TRUE
## r2 20 b TRUE
## r3 30 c FALSE
A list
is a generic structure which can be thought as an ordered set of objects. They are usually created with the list()
function:
## [[1]]
## [,1]
## [1,] 100
##
## [[2]]
## X1 X2 X3
## 1 1 2 3
##
## [[3]]
## [1] "a" "b" "c" "d"
# Declaring a named list
list('matrix' = matrix(100), # matrix
'data.frame' = data.frame(1,2,3), # data.frame
'vector' = c('a','b','c','d')) # vector
## $matrix
## [,1]
## [1,] 100
##
## $data.frame
## X1 X2 X3
## 1 1 2 3
##
## $vector
## [1] "a" "b" "c" "d"
# Generating a named list
M <- matrix(100) # matrix
df <- data.frame(1,2,3) # data.frame
v <- c('a','b','c','d') # vector
n <- c('matrix','data.frame','vector') # vector of names
l <- list(M, df, v) # list
names(l) <- n # assign names
l
## $matrix
## [,1]
## [1,] 100
##
## $data.frame
## X1 X2 X3
## 1 1 2 3
##
## $vector
## [1] "a" "b" "c" "d"
Generally, an environment
is similar to a list
, with four important exceptions:
To create an environment manually, use new.env()
.
## <environment: 0x0000000012095a70>
Values in a vector
are retrieved by using the single square bracket []
operator.
## aaa bbb ccc ddd eee
## "a" "b" "c" "d" "e"
## ccc
## "c"
## aaa bbb ddd eee
## "a" "b" "d" "e"
## <NA>
## NA
## bbb ccc eee eee
## "b" "c" "e" "e"
## bbb ddd eee
## "b" "d" "e"
## ddd bbb
## "d" "b"
## ccc
## "c"
# the logical vector will be recycled if it is shorter than the vector to subset
i <- c(FALSE,TRUE) # -> c(FALSE,TRUE,FALSE,TRUE,FALSE)
s[i]
## bbb ddd
## "b" "d"
## ccc ddd eee
## "c" "d" "e"
Values in a matrix
are retrieved by using the single square bracket []
operator.
M <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE)
rownames(M) <- c('r1','r2','r3')
colnames(M) <- c('c1','c2','c3','c4')
M # print the full matrix
## c1 c2 c3 c4
## r1 1 2 3 4
## r2 5 6 7 8
## r3 9 10 11 12
## [1] 7
## c1 c2 c3 c4
## 1 2 3 4
## r1 r2 r3
## 1 5 9
## c1 c2 c3 c4
## r2 5 6 7 8
## r3 9 10 11 12
## c2 c4
## r1 2 4
## r2 6 8
## r3 10 12
## c2 c4
## r1 2 4
## r3 10 12
## c1 c2 c3 c4
## r1 1 2 3 4
## r3 9 10 11 12
## c2 c4
## r1 2 4
## r2 6 8
## r3 10 12
## c2 c4
## 10 12
## c1 c2 c3 c4
## 1 2 3 4
# the logical vector will be recycled if it is shorter than the number of rows/columns to subset
i <- c(TRUE,FALSE) # -> c(TRUE,FALSE,TRUE)
M[i,]
## c1 c2 c3 c4
## r1 1 2 3 4
## r3 9 10 11 12
# select the column named 'c4' where 'c3' is less than twice 'c1'
i <- M[,'c3'] < 2*M[,'c1']
M[i,'c4']
## r2 r3
## 8 12
Elements of a data.frame
are retrieved by using the single square bracket []
operator as seen with matrix
. Here, also the $
or [[]]
operators can be used to retrieve columns.
## age sex
## 1 48 M
## 2 18 F
## 3 51 M
## [1] 48 18 51
# retrieve the age of males ("M")
i <- df$sex == "M" # equivalent to df[["sex"]]=="M" or df[,"sex"]=="M"
df$age[i] # equivalent to df[["age"]][i] or df[i,"age"]
## [1] 48 51
A list is subsetted using the single square bracket []
operator.
l <- list(
'data' = data.frame('age' = c(48,18,51), 'sex' = c('M','F','M')),
'letters' = c('a','b','c'),
'extra' = c(1:5)
)
l # print full list
## $data
## age sex
## 1 48 M
## 2 18 F
## 3 51 M
##
## $letters
## [1] "a" "b" "c"
##
## $extra
## [1] 1 2 3 4 5
## $data
## age sex
## 1 48 M
## 2 18 F
## 3 51 M
##
## $extra
## [1] 1 2 3 4 5
## $extra
## [1] 1 2 3 4 5
##
## $letters
## [1] "a" "b" "c"
## $data
## age sex
## 1 48 M
## 2 18 F
## 3 51 M
##
## $letters
## [1] "a" "b" "c"
Objects in a list
are retrieved by using the operator [[]]
or $
.
## [1] "a" "b" "c"
## age sex
## 1 48 M
## 2 18 F
## 3 51 M
An environment
is not subsettable, i.e. the []
operator cannot be used. Objects in an environment
are retrieved by using the operator [[]]
, $
or the function get()
.
## [1] 1
## [1] 1
## [1] 1
Remember that an environment
is similar to a list
, but has a reference semantics.
x <- list() # using a list
x$a <- 1 # assign 1 to the element "a" in x
y <- x # COPY x to y
x$a <- 2 # assign 2 to the element "a" in x
y$a # what happens to the element "a" in y?
## [1] 1
x <- new.env() # using an environment
x$a <- 1 # assign 1 to the element "a" in x
y <- x # REFERENCE x to y
x$a <- 2 # assign 2 to the element "a" in x
y$a # what happens to the element "a" in y?
## [1] 2
Arithmetic operations of vectors and matrices are performed element-by-element, data.frames are treated as matrices when containing one data type only. If two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector. For example, the following vectors u
and v
have different lengths, and their sum is computed by recycling values of the shorter vector u
.
u <- c(10, 20, 30)
v <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
M <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), ncol = 3, nrow = 3, byrow = TRUE)
# vector + vector
u + v
## [1] 11 22 33 14 25 36 17 28 39
## [1] 11 21 31
## [1] 20 40 60
## [,1] [,2] [,3]
## [1,] 2 3 4
## [2,] 5 6 7
## [3,] 8 9 10
## [,1] [,2] [,3]
## [1,] 11 12 13
## [2,] 24 25 26
## [3,] 37 38 39
## [,1] [,2] [,3]
## [1,] 2 4 6
## [2,] 8 10 12
## [3,] 14 16 18
## [,1] [,2] [,3]
## [1,] 10 20 30
## [2,] 80 100 120
## [3,] 210 240 270
## [,1]
## [1,] 140
## [2,] 320
## [3,] 500
A time series is a series of data points indexed in time order. In R, all data types for which an order is defined can be used to index a time series. If the operator <
is defined for a data type, then the data type can be used to index a time series.
Date
today <- Sys.Date() # current Date
yesterday <- today - 1 # subtract 1 day
yesterday < today # the order is defined for Date
## [1] TRUE
POSIXct
now <- Sys.time() # current time
ago <- now - 3600 # subtract 3600 seconds
ago < now # the order is defined for POSIXct
## [1] TRUE
Character
## [1] TRUE
Numeric
## [1] TRUE
Complex
## Error in 2 + (0+0i) < 1 + (0+3i): invalid comparison with complex values
The zoo
package consists of the methods for totally ordered indexed observations. All indexes discussed above can be used. The package aims at performing calculations containing irregular time series of numeric vectors, matrices and factors. The package is an infrastructure that tries to do all basic things well, but it doesn’t provide modeling functionality.
The below set of exercises shows some of zoo concepts.
Declaration
## 1 2 3 4 5
## 100 123 43 343 22
# create a unidimensional zoo object indexed by numeric
x <- c(100, 123, 43, 343, 22)
i <- c(0, 0.2, 0.4, 0.5, 1)
zoo(x = x, order.by = i)
## 0 0.2 0.4 0.5 1
## 100 123 43 343 22
# create a unidimensional zoo object indexed by character
x <- c(100, 123, 43, 343, 22)
i <- c('z', 'b', 'd', 'c', 'a')
zoo(x = x, order.by = i)
## a b c d z
## 22 123 343 43 100
# create a multidimensional zoo object indexed by Date
x <- data.frame('price' = c(100,99.3,100.2), 'volume' = c(9.9,1.3,3.6))
i <- as.Date(c('2018/01/01', '2018/02/23', '2018/05/01'), format = "%Y/%m/%d")
zoo(x = x, order.by = i)
## price volume
## 2018-01-01 100.0 9.9
## 2018-02-23 99.3 1.3
## 2018-05-01 100.2 3.6
# create a multidimensional zoo object indexed by POSIXct
x <- data.frame('price' = c(100,99.3,100.2), 'volume' = c(9.9,1.3,3.6))
i <- as.POSIXct(c('20180101 120631', '20180223 085145', '20180501 182309'), format = "%Y%m%d %H%M%S")
zoo(x = x, order.by = i)
## price volume
## 2018-01-01 12:06:31 100.0 9.9
## 2018-02-23 08:51:45 99.3 1.3
## 2018-05-01 18:23:09 100.2 3.6
Manipulation
# assign colnames
x <- data.frame(c(100,99.3,100.2), c(9.9,1.3,3.6))
z <- zoo(x = x)
colnames(z) <- c('p','v')
z
## p v
## 1 100.0 9.9
## 2 99.3 1.3
## 3 100.2 3.6
# assign indexes
index(z) <- as.Date(c('2018/01/01', '2018/02/23', '2018/05/01'), format = "%Y/%m/%d")
z
## p v
## 2018-01-01 100.0 9.9
## 2018-02-23 99.3 1.3
## 2018-05-01 100.2 3.6
## [1] "2018-01-01"
## [1] "2018-05-01"
## p v
## 2018-01-01 100.0 9.9
## 2018-05-01 100.2 3.6
## 2018-01-01 2018-02-23 2018-05-01
## 100.0 99.3 100.2
## p v
## 2018-01-01 100.0 9.9
## 2018-02-23 105.0 1.3
## 2018-05-01 100.2 3.6
## p v
## 2018-01-01 100 9.9
## 2018-02-23 105 1.3
## p v
## 2018-02-23 5.0 -8.6
## 2018-05-01 -4.8 2.3
## p v
## 2018-01-01 105.0 1.3
## 2018-02-23 100.2 3.6
## p v
## 2018-02-23 100 9.9
## 2018-05-01 105 1.3
# merge series
z.next <- lag(z, k = 1)
z.prev <- lag(z, k = -1)
z.merged <- merge(z, z.next, z.prev)
z.merged
## p.z v.z p.z.next v.z.next p.z.prev v.z.prev
## 2018-01-01 100.0 9.9 105.0 1.3 NA NA
## 2018-02-23 105.0 1.3 100.2 3.6 100 9.9
## 2018-05-01 100.2 3.6 NA NA 105 1.3
## p.z v.z p.z.next v.z.next p.z.prev v.z.prev
## 2018-01-01 100.0 9.9 105.0 1.3 NA NA
## 2018-02-23 105.0 1.3 100.2 3.6 100 9.9
## 2018-05-01 100.2 3.6 100.2 3.6 105 1.3
## p.z v.z p.z.next v.z.next p.z.prev v.z.prev
## 2018-01-01 100.0 9.9 105.0 1.3 100 9.9
## 2018-02-23 105.0 1.3 100.2 3.6 100 9.9
## 2018-05-01 100.2 3.6 NA NA 105 1.3
## p.z v.z p.z.next v.z.next p.z.prev v.z.prev
## 2018-02-23 105 1.3 100.2 3.6 100 9.9
Arithmetic operations are performed element-by-element on matching indexes of the two zoo obejcts. If the operation involves a zoo and a vector object, then the operation is performed on the whole zoo object.
x <- matrix(101:112, nrow = 3, ncol = 4, byrow = TRUE)
z <- zoo(x)
# add 1 to the whole series
z + 1
##
## 1 102 103 104 105
## 2 106 107 108 109
## 3 110 111 112 113
##
## 1 0 0 0 0
## 2 105 106 107 108
## 3 218 220 222 224
##
## 2 4 4 4 4
## 3 4 4 4 4
##
## 2 0.03960396 0.03921569 0.03883495 0.03846154
## 3 0.03809524 0.03773585 0.03738318 0.03703704
##
## 1 103 104 105 106
## 2 107 108 109 110
The xts
package provides an extensible time series class, enabling uniform handling of many R time series classes by extending zoo
. An xts
object can be indexed by the Date
, POSIXct
, chron
, yearmon
, yearqtr
, DateTime
data types but not by numeric
or character
.
The methods seen for zoo
objects can be applied to xts
. The below set of exercises shows some of additional xts specific concepts.
# create an xts object
dates <- seq(as.Date("2017-05-01"), length=1000, by="day") # generate a sequence of dates
data <- c(price = cumprod(1+rnorm(1000, mean = 0.001, sd = 0.01))) # generate some random data
x <- xts(x = data, order.by = dates) # create the xts object
colnames(x) <- 'price' # assign colnames
head(x) # print the first observations
## price
## 2017-05-01 0.9953952
## 2017-05-02 0.9940995
## 2017-05-03 1.0105887
## 2017-05-04 1.0123118
## 2017-05-05 1.0146329
## 2017-05-06 1.0330492
## price
## 2017/05/01 0.9953952
## 2017/05/02 0.9940995
## 2017/05/03 1.0105887
## 2017/05/04 1.0123118
## 2017/05/05 1.0146329
## 2017/05/06 1.0330492
## Daily periodicity from 2017-05-01 to 2020-01-25
## price
## 2017/05/01 0.9953952
## price
## 2020/01/25 3.039245
## price
## 2020/01/20 3.059311
## 2020/01/21 3.059618
## 2020/01/22 3.095431
# convert to OHLC
# valid periods are "seconds", "minutes", "hours", "days", "weeks", "months", "quarters","years"
x.ohlc <- to.period(x, period = 'quarters')
head(x.ohlc)
## x.Open x.High x.Low x.Close
## 2017/06/30 0.9953952 1.106982 0.9940995 1.106982
## 2017/09/30 1.1025282 1.217479 1.0792620 1.137135
## 2017/12/31 1.1268060 1.268922 1.1245544 1.233901
## 2018/03/31 1.2586082 1.574023 1.2307955 1.574023
## 2018/06/30 1.5432948 1.632296 1.5026029 1.574151
## 2018/09/30 1.6134651 1.940108 1.5884681 1.865520
# calculate the yearly mean
ep <- endpoints(x.ohlc, on = "years")
period.apply(x.ohlc , INDEX = ep, FUN = mean)
## x.Open x.High x.Low x.Close
## 2017-12-31 1.074910 1.197794 1.065972 1.159339
## 2018-12-31 1.565839 1.813814 1.542075 1.747158
## 2019-12-31 2.227582 2.608671 2.177939 2.480005
## 2020-01-25 2.954804 3.095431 2.932865 3.039245
In order to control the execution of the expressions flow in R, we make use of the control structures.
This task is carried out only if this condition is returned as TRUE
.
## [1] "executing if"
The if-else combination is probably the most commonly used control structure in R (or perhaps any language). This structure allows you to test a condition and act on it depending on whether it’s true or false.
## [1] "executing else"
You can have a series of tests by following the initial if with any number of else if
s.
if(1>2){
print('executing if')
} else if(1<2) {
print('executing else-if')
} else {
print('executing else')
}
## [1] "executing else-if"
In R, for loops take an interator variable and assign it successive values from a sequence or vector. For loops are most commonly used for iterating over the elements of an object (list, vector, etc.).
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
While loops begin by testing a condition. If it is true, then they execute the loop body. Once the loop body is executed, the condition is tested again, and so forth, until the condition is false, after which the loop exits. While loops can potentially result in infinite loops if not written properly. Use with care!
## [1] 2
## [1] 3
## [1] 4
## [1] 5
repeat
initiates an infinite loop right from the start. These are not commonly used in statistical or data analysis applications but they do have their uses. The only way to exit a repeat loop is to call break
.
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
We use break statement inside a loop (repeat, for, while) to stop the iterations and flow the control outside of the loop. While in a nested looping situation, where there is a loop inside another loop, this statement exits from the innermost loop that is being evaluated.
## [1] 1
next
jumps to the next cycle without completing a particular iteration. In fact, it jumps to the evaluation of the condition holding the current loop. Next statement enables to skip the current iteration of a loop without terminating it.
## [1] 1
## [1] 3
## [1] 4
https://bookdown.org/rdpeng/rprogdatascience/loop-functions.html
R has some functions which implement looping in a compact form to make your life easier.
lapply()
: Loop over a list and evaluate a function on each element
sapply()
: Same as lapply but try to simplify the result
apply()
: Apply a function over the margins of an array
The lapply()
function does the following simple series of operations:
Here’s an example of applying the mean()
function to all elements of a list. If the original list has names, the the names will be preserved in the output.
## $a
## [1] 5.5
##
## $b
## [1] 50.5
You can use lapply()
to evaluate a function multiple times each with a different argument. Below, is an example where I call the runif()
function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers.
## [[1]]
## [1] 0.159674
##
## [[2]]
## [1] 0.1445159 0.1491804
##
## [[3]]
## [1] 0.5144343 0.4928273 0.6163428
##
## [[4]]
## [1] 0.44742289 0.05567672 0.00539631 0.22183420
When you pass a function to lapply()
, lapply()
takes elements of the list and passes them as the first argument of the function you are applying. In the above example, the first argument of runif()
is n
, and so the elements of the sequence 1:4
all got passed to the n argument of runif()
.
Functions that you pass to lapply()
may have other arguments. For example, the runif()
function has a min
and max
argument too. Here is where the ...
argument to lapply()
comes into play. Any arguments that you place in the ...
argument will get passed down to the function being applied to the elements of the list.
Here, the min = 0
and max = 10
arguments are passed down to runif()
every time it gets called.
## [[1]]
## [1] 8.509632
##
## [[2]]
## [1] 2.673462 5.986003
##
## [[3]]
## [1] 6.085997 9.921584 1.911900
##
## [[4]]
## [1] 7.53390585 2.42387337 3.27452220 0.03535495
The sapply()
function behaves similarly to lapply()
; the only real difference is in the return value. sapply()
will try to simplify the result of lapply()
if possible. Essentially, sapply()
calls lapply()
on its input and then applies the following algorithm:
## a b
## 5.5 50.5
The apply()
function is used to a evaluate a function over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix
or data.frame
.
Here we create a 20 by 10 matrix of Normal random numbers.
Compute the mean of each column: MARGIN = 2
.
## [1] 0.53081152 -0.06304540 0.24425092 -0.19853115 -0.01399796
## [6] 0.13955840 -0.03783351 -0.10886087 -0.13717015 0.26902905
Compute the mean of each row: MARGIN = 1
.
## [1] 0.41106532 0.35026554 0.25650970 -0.19076865 0.20650285
## [6] 0.08113752 -0.17569949 0.24936880 0.08190925 -0.02491800
## [11] 0.33695582 -0.25993749 0.39263868 -0.28434147 -0.01747967
## [16] -0.03815295 -0.49851526 0.46967806 -0.02593651 -0.07186038
Abstracting code into many small functions is key for writing nice R code. Functions are defined by code with a specific format:
where
functionName
: the name of the function (case sensitive)arg1
, arg2
, arg3
, ...
: input valuesarg3=NULL
: default value. If arg3
is not provided when calling the function, NULL
will be used insteadreturn()
: the output valueDefine a function to compute the sum of the first n
integer numbers.
Compute the sum of the first 10 integers
## [1] 55
Define a function to compute the p
norm of a vector x
. By default, compute the Euclidean norm (p = 2
).
Compute the Euclidean norm of the vector c(1,1)
## [1] 1.414214
Compute the 3-norm of the vector c(1,1)
## [1] 1.259921
Compute the \(\infty\)-norm of the vector c(1,1)
## [1] 1
If you use an R function, the function first creates a temporary local environment. This local environment is nested within the global environment, which means that, from that local environment, you also can access any object from the global environment (not considered a good practice). As soon as the function ends, the local environment is destroyed along with all the objects in it.
# define function
test1 <- function(){
teststring <- 'This object is destroyed as soon as the function ends!'
return(invisible())
}
# run function
test1()
# try to access teststring
teststring
## Error in eval(expr, envir, enclos): object 'teststring' not found
If R sees any object name, it first searches the local environment. If it finds the object there, it uses that one else it searches in the global environment for that object.
# global i
i <- 1
# define function
test2 <- function(){
# there is no i in the local environment -> search in parent environment
i <- i*10
# return
return(i)
}
# run function
test2()
## [1] 10
## [1] 1
https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html
Many computations in R can be made faster by the use of parallel computation. Generally, parallel computation is the simultaneous execution of different pieces of a larger computation across multiple computing processors or cores.
The parallel
package can be used to send tasks (encoded as function calls) to each of the processing cores on your machine in parallel.
The mclapply()
function essentially parallelizes calls to lapply()
. The first two arguments to mclapply()
are exactly the same as they are for lapply()
. However, mclapply()
has further arguments (that must be named), the most important of which is the mc.cores
argument which you can use to specify the number of processors/cores you want to split the computation across. For example, if your machine has 4 cores on it, you might specify mc.cores = 4
to break your parallelize your operation across 4 cores (although this may not be the best idea if you are running other operations in the background besides R).
The first thing you might want to check with the parallel
package is if your computer in fact has multiple cores that you can take advantage of.
## [1] 8
The mclapply()
function (and related mc*
functions) works via the fork mechanism on Unix-style operating systems. Because of the use of the fork mechanism, the mc*
functions are generally not available to users of the Windows operating system.
## Error in mclapply(1:7, FUN = function(x) return(x), mc.cores = cores - : 'mc.cores' > 1 is not supported on Windows
Using the forking mechanism on your computer is one way to execute parallel computation but it’s not the only way that the parallel package offers. Another way to build a “cluster” using the multiple cores on your computer is via sockets.
Building a socket cluster is simple to do in R with the makeCluster()
function.
The cl
object is an abstraction of the entire cluster and is what we’ll use to indicate to the various cluster functions that we want to do parallel computation.
To do a lapply()
operation over a socket cluster we can use the parLapply()
function.
# sample function
test <- function(){
Sys.sleep(2)
return(TRUE)
}
# call "test" in parallel apply
parLapply(cl = cl, 1:7, fun = function(x) {
test()
})
## Error in checkForRemoteErrors(val): 7 nodes produced errors; first error: could not find function "test"
You’ll notice, unfortunately, that there’s an error in running this code. The reason is that while we have loaded the sulfate data into our R session, the data is not available to the independent child processes that have been spawned by the makeCluster()
function. The data, and any other information that the child process will need to execute your code, needs to be exported to the child process from the parent process via the clusterExport()
function. The need to export data is a key difference in behavior between the “multicore” approach and the “socket” approach.
# export "test" to the cluster nodes
clusterExport(cl, "test")
# call "test" in parallel apply
parLapply(cl = cl, 1:7, fun = function(x) {
test()
})
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
How long does it take?
# parallel
t0 <- proc.time()
xx <- parLapply(cl = cl, 1:7, fun = function(x) {
test()
})
t1 <- proc.time()
t1-t0
## user system elapsed
## 0.01 0.00 2.17
## user system elapsed
## 0.03 0.00 14.08
clusterEvalQ()
evaluates a literal expression on each cluster node. It can be used to load packages into each node.
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
Once you’ve finished working with your cluster, it’s good to clean up and stop the cluster child processes (quitting R will also stop all of the child processes).
http://heather.cs.ucdavis.edu/~matloff/158/RcppTutorial.pdf
The Rcpp
package provides C++ classes that greatly facilitate interfacing C or C++ code in R packages using the .Call()
interface provided by R. It provides a powerful API on top of R, permitting direct interchange of rich R objects (including S3, S4 or Reference Class objects) between R and C++.
Maintaining C++ code in it’s own source file provides several benefits (recommended). However, it’s also possible to do inline declaration and execution of C++ code, which will be used in the following example.
Let’s implement the Fibonacci sequence both in R and C++:
\[F_n = F_{n-1}+F_{n-2}\] with \(F_0 = 0\) and \(F_1=1\).
Rcpp::cppFunction("
int fibC(const int n){
if(n==0) return(0);
if(n==1) return(1);
return(fibC(n-1) + fibC(n-2));
}")
Compare the performance:
## Unit: microseconds
## expr min lq mean median uq max neval
## fibR(20) 7060.3 7670.40 8242.514 8020.75 8605.1 11757.5 100
## fibC(20) 29.4 30.25 47.658 33.90 39.2 1116.4 100
Download the full code to generate this document and reproduce the examples. The file is in R Markdown, format for making dynamic documents with R. An R Markdown document is written in markdown, an easy-to-write plain text format, and contains chunks of embedded R code.
Download
Exercise: create the pdf version of this web page
Hint: download the file above and have a look at the introductory 1-min video of the official Rmarkdown guide
Comments
All text after the sign
#
within the same line is considered a comment.