---
title: "Basic statistical concepts for Finance"
author: "Emanuele Guidotti"
output:
html_document:
highlight: zenburn
self_contained: no
theme: yeti
toc: yes
toc_float: yes
pdf_document:
toc: yes
word_document:
toc: yes
---
https://guidotti.dev
```{css, echo = FALSE}
@media (max-width: 768px) {
pre code, pre, code {
white-space: pre !important;
overflow-x: scroll !important;
word-break: keep-all !important;
word-wrap: initial !important;
}
}
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, cache = TRUE)
options(knitr.kable.NA = '', width = 100)
Sys.setenv(LANG = "en")
set.seed(123)
```
# Returns VS Log-Returns
Load the quantmod package to retrieve financial data (see tutorial on [Data Acquisition in R](https://storage.guidotti.dev/tutorial/data-acquisition-in-r.html)).
```{r}
# load the "quantmod" package
require('quantmod')
```
Get Apple stock quotes (daily).
```{r}
# get quotes
AAPL <- getSymbols('AAPL', auto.assign = FALSE)
```
## Returns
$$R_t = \frac{P_t-P_{t-1}}{P_{t-1}} = \frac{P_t}{P_{t-1}}-1$$
Note the use of the `lag` function in the following code snippet. The function shifts the time base of a `xts` object back by a given number of observations. __Warning__: the function behaves differently for `zoo`, `numeric` or other object classes.
```{r}
# shift the series back by one period and compute returns
r <- AAPL$AAPL.Close/lag(AAPL$AAPL.Close, k = 1) - 1
```
### Log-Returns
$$r_t = ln(P_t)-ln(P_{t-1})$$
```{r}
# shift the series back by one period and compute log-returns
r.log <- log(AAPL$AAPL.Close) - log(lag(AAPL$AAPL.Close, k = 1))
```
Returns cannot be computed for the first observation, as there is no previous observation to use. R returns `NA` in this case. Clean the data.
```{r}
# drop NA from data
r <- na.omit(r)
r.log <- na.omit(r.log)
```
Log-returns are approximately equal to standard returns when they are small. Note: __approximately__ does not mean __equal__.
$$r_t=ln(P_t)-ln(P_{t-1})=ln\Bigl(\frac{P_t}{P_{t-1}}\Bigl)=ln\Bigl(\frac{P_t-P_{t-1}+P_{t-1}}{P_{t-1}}\Bigl)=ln(1+R_t) \approx R_t$$
where the last passage can be easily obtained by [Taylor expansion](https://en.wikipedia.org/wiki/Taylor_series). Compare the returns.
```{r}
# build a data.frame containing ret and log-ret
x <- data.frame(r, r.log)
colnames(x) <- c('ret', 'log.ret')
head(x)
```
Plot the returns. For high quality charts, refer to the `ggplot2` package.
```{r}
# base R plot function
plot(x$log.ret ~ x$ret, main = 'Returns VS Log-Returns', ylab = 'Log-Returns', xlab = 'Returns', xlim = c(-0.21, 0.21), ylim = c(-0.21, 0.21))
# add the line y = x
abline(b = 1, a = 0, col = 'green')
```
### Multi-Period Returns
$$R_{t,s}=\frac{P_t}{P_s}-1=\frac{P_t}{P_{t-1}}\frac{P_{t-1}}{P_s}-1=\frac{P_t}{P_{t-1}}\frac{P_{t-1}}{P_{t-2}}...\frac{P_{s+1}}{P_s}-1=\\
=(R_{t}+1)(R_{t-1}+1)...(R_{s+1}+1) - 1=\\
=\Bigl(\prod_{i=s+1}^{t}(R_i+1)\Bigl)-1$$
```{r}
# direct calculation
AAPL$AAPL.Close['2018-12-31'][[1]]/AAPL$AAPL.Close['2015-12-31'][[1]] - 1
# compound calculation
r.i <- window(r, start = '2016-01-01', end = '2018-12-31') # extract returns
prod(r.i+1) - 1
```
### Multi-Period Log-Returns
$$r_{t,s}=ln(P_t)-ln(P_s)=ln(P_t)-ln(P_s)=ln(P_t)-ln(P_{t-1})+ln(P_{t-1})-ln(P_s)=\\
=ln(P_t)-ln(P_{t-1})+ln(P_{t-1})-ln(P_{t-2}) ... +\;ln(P_{s+1})-ln(P_s)=\\
=r_t+r_{t-1}+...\;+r_{s+1}=\sum_{i=s+1}^tr_i$$
```{r}
# direct calculation
log(AAPL$AAPL.Close['2018-12-31'][[1]]) - log(AAPL$AAPL.Close['2015-12-31'][[1]])
# compound calculation
r.i <- window(r.log, start = '2016-01-01', end = '2018-12-31') # extract returns
sum(r.i)
```
### Cumulative Returns
Compute the standard returns with respect to the first observation at each point in time ($R_{t,0}$ for every $t$). Use the following approaches:
- (RIGHT) cumulate standard daily returns
- (RIGHT) cumulate log-returns and convert to standard return using the following property: $r=ln(1+R)\rightarrow R=exp(r)-1$
- (WRONG) cumulate log-returns as if they were standard returns. This holds approximately, and the error becomes larger and larger when cumulating more and more returns
```{r}
# cumulate standard returns
r.cum <- cumprod(1+r) - 1
# cumulate log-returns and convert to standard returns
r.log.cum.right <- exp(cumsum(r.log)) - 1
# cumulate log-returns as if they were standard returns
r.log.cum.wrong <- cumprod(1+r.log) -1
# create a unique xts object
x <- merge(r.cum, r.log.cum.wrong, r.log.cum.right)
colnames(x) <- c('R', 'r.wrong', 'r.right')
# plot and compare
plot(x, legend.loc = 'topleft', main = 'Cumulative return computation')
```
The green line coincides with the black one. This can be checked printing the values.
```{r}
# print values
head(x)
```
## Adjusted Price
The closing price is the 'raw' price which is just the cash value of the last transacted price before the market closes. Adjusted closing price amends a stock's closing price to accurately reflect that stock's value after accounting for any corporate actions (e.g. dividends, splits). It is considered to be the true price of that stock and is often used when examining historical returns or performing a detailed analysis of historical returns. Compute returns using the adjusted price.
```{r}
# compute returns using adjusted prices
r.adj <- AAPL$AAPL.Adjusted/lag(AAPL$AAPL.Adjusted, k = 1) - 1
# drop NA
r.adj <- na.omit(r.adj)
# cumulate returns
r.adj.cum <- cumprod(1+r.adj) - 1
# add adjusted returns to the ones computed in the previous section
x <- merge(x, r.adj.cum)
colnames(x)[4] <- 'R.adj'
# plot and compare
plot(x, legend.loc = 'topleft', main = 'Cumulative return computation')
```
# Central Limit Theorem
The central limit theorem establishes that when __independent and identically distributed__ (i.i.d.) random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed.
$$\frac{S_n-\mu_n}{\sigma_n}\rightarrow N(0,1)$$
Define the following function to test the Central Limit Theorem
```{r}
# n: number of iid random variables to sum up
# plot: plotting results? Default FALSE
CLT <- function(n, plot = FALSE){
# number of trials
N <- 100000
# build a matrix with
# - 'N' rows (numer of trials)
# - 'n' columns (number of iid random variables to sum up)
# - 'n*N' uniform random variables: 'runif' function
u <- matrix(data = runif(n = n * N), ncol = n, nrow = N)
# for each row, sum up all columns -> generate Sn
z <- rowSums(u)
# normalize according to the theorem
z <- (z - mean(z))/sd(z)
# if plot
if(plot){
# generate histogram of Sn
hist(z, freq = FALSE, breaks = 100)
# add normal distribution
x <- seq(-3,3,0.01)
lines(x = x, y = dnorm(x = x), col = 'blue')
}
# else return Sn values
else{
return(z)
}
}
```
Run the function with several values for `n`. The distribution of Sn converges to a Normal distribution as n increases.
```{r}
CLT(n = 1, plot = TRUE)
CLT(n = 2, plot = TRUE)
CLT(n = 10, plot = TRUE)
```
# Normality of financial returns
Plot AAPL log-returns. Are they Normal?
```{r}
# histogram of AAPL log-returns
hist(r.log, breaks = 50, main = "Density of Log-Returns", xlab = 'Log-Returns', freq = FALSE)
# mean
r.log.mean <- mean(r.log)
# standard deviation
r.log.sd <- sd(r.log)
# density of N(mu, sigma)
x <- seq(min(r.log), max(r.log), by = 0.001)
y <- dnorm(x = x, mean = r.log.mean, sd = r.log.sd)
# add fitted normal distribution
lines(x = x, y = y, col = 'blue')
```
AAPL daily log-returns, and stock returns in general, are not Normal ([Fama 1965, Journal of Business](http://static.stevereads.com/papers_to_read/the_behavior_of_stock_market_prices.pdf), Nobel price 2013). On the other hand, daily log-returns can be thought as the cumulative sum of intraday log-returns and, according to the central limit theorem, they should be normally distributed. This means that intraday stock returns are __not__ iid and the central limit theorem does not hold in this case.
# Law of Large Numbers
The law of large numbers is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed. In other words, the mean is a good estimator of the expected value.
$$\bar{X_n} \rightarrow E[X]$$
Estimate the expected value of a uniform random variable that takes values in [0,1].
```{r}
# generate 100 uniform random variables and compute the mean
mean(runif(n = 100, min = 0, max = 1))
# generate 100000 uniform random variables to increase precision
mean(runif(n = 100000, min = 0, max = 1))
```
For simulated data it is possible to increase the precision of the estimation by increasing the number of trials. For real data, where the sample size is finite, the precision cannot be increased arbitrarily and we need to understand how uncertain the estimation is. In other words, we need an estimation of the precision, together with the estimation of the mean. This is usually called _standard error_ or _standard deviation of the mean_. When realized returns are iid, the standard error of the mean can be easily computed by $\sigma_n/\sqrt{n}$, where $\sigma_n$ is the standard deviation of the realized returns and $n$ is the sample size. When realized returns are not iid, it is under certain conditions ("stationarity of returns") still possible to compute the standard error, but the required formula is slightly more complicated. Moreover, the estimation of the mean, for large $n$, is normally distributed according to the central limit theorem.
$$\bar{X_n} \rightarrow N\Bigl(\bar{X_n}, \frac{\sigma_n}{\sqrt{n}}\Bigl)$$
Estimate the expected daily return for AAPL and its distribution.
```{r}
# estimatation of the expected daily return
mu <- mean(r)
# estimation of the standard deviation of the sample
sigma <- sd(r)
# sample size
n <- length(r)
# standard devation of the mean
st.err <- sigma/sqrt(n)
# plot the estimated distribution of the expcted return
x <- seq(mu-5*st.err, mu+5*st.err, length.out = 1000)
plot(x = x, y = dnorm(x = x, mean = mu, sd = st.err), type = 'l', ylab = 'Density', xlab = 'Expected Daily Return')
# which is the most likely expected return?
mu
# how likely is the expected return to be less than 0.05%?
pnorm(q = 0.0005, mean = mu, sd = st.err, lower.tail = TRUE)
```
Remark. The law of large numbers holds when performing the __same__ experiment a large number of times. This is not the case for financial and economic data. Assuming returns coming from the same experiment every day ignores the different economic conditions, sentiment and environment of the different periods.
# Stochastic Process
A stochastic process can be defined as a collection of random variables.
## Brownian Motion
Increments are normally distributed:
$$\Delta X_t \sim N(\mu, \sigma)$$
The following code snippet defines a function to simulate trajectories of a Brownian Motion.
```{r}
# n: number of trajectories to simulate
# t: number of periods for a single trajectory
# mu: mean of the increments
# sigma: standard deviation of the increments
# x.0: initial value
# plot: plot trajectories? default TRUE
BM.sim <- function(n, t = 252, mu = 0.001, sigma = 0.05, X.0 = 1, plot = TRUE){
# generate increments and store in a matrix with:
# - 't' rows (numer of time points for each trajectory)
# - 'n' columns (number of trajectories to simulate)
r <- matrix(data = rnorm(n = n*t, mean = mu, sd = sigma), ncol = n, nrow = t)
# for each column (trajectory), sum increments starting from x.0
X.t <- apply(r, MARGIN = 2, function(x) X.0 + cumsum(x))
# plot and return
if(plot) plot(as.zoo(X.t), screens = 'single', ylab = 'Trajectory')
return(X.t)
}
```
Simulate the Brownain Motion.
```{r}
# simulate 100 trajectories
bm <- BM.sim(100)
```
The Brownian Motion can assume negative values. This is not well-suited for financial applications.
## Geometric Brownian Motion
Returns are normally distributed. More precisely, log returns are normally distributed and simple returns are log-normal distributed:
$$\frac{\Delta X_t}{X_t} \sim N(\mu, \sigma)$$
The following code snippet defines a function to simulate trajectories of a Geometric Brownian Motion.
```{r}
# n: number of trajectories to simulate
# t: number of periods for a single trajectory
# mu: mean of the returns
# sigma: standard deviation of the returns
# x.0: initial value
# plot: plot trajectories? default TRUE
# log.scale: plot on a log-scale? default FALSE
GBM.sim <- function(n, t = 252, mu = 0.001, sigma = 0.05, X.0 = 1, log.scale = FALSE, plot = TRUE){
# generate returns and store in a matrix with:
# - 't' rows (numer of time points for each trajectory)
# - 'n' columns (number of trajectories to simulate)
r <- matrix(data = rnorm(n = n*t, mean = mu, sd = sigma), ncol = n, nrow = t)
# for each column (trajectory), cumulate returns starting from x.0
X.t <- apply(r, MARGIN = 2, function(x) X.0 * cumprod(1+x))
# plot and return
if(plot){
if(log.scale) plot(as.zoo(X.t), screens = 'single', log = 'y', ylab = 'Trajectory')
else plot(as.zoo(X.t), screens = 'single', ylab = 'Trajectory')
}
return(X.t)
}
```
Simulate the Geometric Brownain Motion.
```{r}
# set RNG seed
set.seed(123)
# simulate 100 trajectories
gbm <- GBM.sim(100)
```
The Geometric Brownian Motion is positive defined. Plot the trajectories on a log-scale and check.
```{r}
# set the same RNG seed
set.seed(123)
# simulate 100 trajectories and plot on a log-scale
gbm <- GBM.sim(100, log.scale = TRUE)
```
# Exercise
Compute the mean, standard deviation, skewness, and kurtosis for both returns and log-returns for each of the seven investment strategies that can be found in [this file](https://storage.guidotti.dev/course/asset-pricing-unine-2019-2020/basic-statistical-concepts-for-finance.csv). The file contains stock data from the [website of Kenneth R. French](
https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html). Which is your favourite strategy?
```{r}
# import data
data <- read.csv('https://storage.guidotti.dev/course/asset-pricing-unine-2019-2020/basic-statistical-concepts-for-finance.csv', sep = ';', stringsAsFactors = FALSE)
# convert data to xts
data <- xts(data[,-1], order.by = as.yearmon(data[,1]))
# returns and log-returns
r <- data/lag(data,1)-1
r.log <- log(data)-log(lag(data,1))
# drop NA
r <- na.omit(r)
r.log <- na.omit(r.log)
# skewness function
skewness <- function(x){
mean((x-mean(x))^3)/sd(x)^3
}
# kurtosis function
kurtosis <- function(x){
mean((x-mean(x))^4)/var(x)^2 - 3
}
```
```{r}
# mean: returns
apply(r, MARGIN = 2, mean)
# mean: log-returns
apply(r.log, MARGIN = 2, mean)
# sd: returns
apply(r, MARGIN = 2, sd)
# sd: log-returns
apply(r.log, MARGIN = 2, sd)
# skewness: returns
apply(r, MARGIN = 2, skewness)
# skewness: log-returns
apply(r.log, MARGIN = 2, skewness)
# kurtosis: returns
apply(r, MARGIN = 2, kurtosis)
# kurtosis: log-returns
apply(r.log, MARGIN = 2, kurtosis)
# cumulative returns for each strategy
r.cum <- apply(r, MARGIN = 2, function(x) cumprod(1+x))
# convert to xts
r.cum <- xts(r.cum, as.yearmon(rownames(r.cum)))
# plot
plot(r.cum, legend.loc = 'topleft')
# plot on a log-scale
plot(log(r.cum), legend.loc = 'topleft')
```
# Code Download
Download the full code to generate this document and reproduce the examples. The file is in [R Markdown](https://rmarkdown.rstudio.com/), format for making dynamic documents with R. An R Markdown document is written in markdown, an easy-to-write plain text format, and contains chunks of embedded R code. \
[Download](https://storage.guidotti.dev/course/asset-pricing-unine-2019-2020/basic-statistical-concepts-for-finance.Rmd)
__Exercise__: create the pdf version of this web page \
__Hint__: download the file above and have a look at the introductory 1-min video of the [official Rmarkdown guide](https://rmarkdown.rstudio.com/lesson-1.html).