Files

A comma-separated values (CSV) file is a delimited text file that generally uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by the delimiter. CSV is a common data exchange format that is widely supported by consumer, business, and scientific applications. R makes it easy to export and import data in CSV format.

Local Files

Export data to a csv file

data("mtcars")                          # load the mtcars dataset
write.csv(mtcars, file = 'mtcars.csv')  # export to file

Import data from a csv file

x <- read.csv('mtcars.csv')             # read file 
head(x)                                 # print data

##                   X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Remote Files

Some data providers offer data in csv format on their website. The STOXX website, a financial index provider, is one of these. Open this link for the EURO STOXX 50 Index: tab Data -> Historical Data provides some open source files for histroical prices. Clicking on EUR Price will open this link. The read.csv() function can read this file directly from the internet.

# read.csv is very flexible. For the full list of arguments type ?read.csv
x <- read.csv('https://www.stoxx.com/document/Indices/Current/HistoricalData/h_3msx5e.txt', sep = ';') 
head(x)

##         Date Symbol Indexvalue  X
## 1 08.10.2019   SX5E    3432.76 NA
## 2 09.10.2019   SX5E    3462.11 NA
## 3 10.10.2019   SX5E    3493.96 NA
## 4 11.10.2019   SX5E    3569.92 NA
## 5 14.10.2019   SX5E    3556.26 NA
## 6 15.10.2019   SX5E    3598.65 NA

rownames(x) <- as.Date(x[,1], format = '%d.%m.%Y') # assign rownames
x[,c(1,ncol(x))] <- NULL                           # drop the first and last column
head(x)                                            # print data

##            Symbol Indexvalue
## 2019-10-08   SX5E    3432.76
## 2019-10-09   SX5E    3462.11
## 2019-10-10   SX5E    3493.96
## 2019-10-11   SX5E    3569.92
## 2019-10-14   SX5E    3556.26
## 2019-10-15   SX5E    3598.65

R Packages

The ‘quantmod’ Package

The quantmod package provides a very suitable function for downloading financial data from the web. This function is called getSymbols. The function works with a variety of sources.

# install the package
install.packages('quantmod')

# load the package
require(quantmod)

For stocks and shares, the yahoo source is used. Symbols can be found here.

# retrieve Facebook quotes
x <- getSymbols(Symbols = 'FB', src = 'yahoo', auto.assign = FALSE)   
tail(x)

##            FB.Open FB.High FB.Low FB.Close FB.Volume FB.Adjusted
## 2019-12-27  208.67  208.93 206.59   208.10  10284200      208.10
## 2019-12-30  207.86  207.90 203.90   204.41  10524300      204.41
## 2019-12-31  204.00  205.56 203.60   205.25   8953500      205.25
## 2020-01-02  206.75  209.79 206.27   209.78  12077100      209.78
## 2020-01-03  207.21  210.40 206.95   208.67  11188400      208.67
## 2020-01-06  206.70  212.78 206.52   212.60  17058900      212.60

For currencies and metals, the oanda source is used. Symbols are the instruments’ ISO codes separated by /. ISO codes can be found here.

# retrieve the historical euro/dollar exchange rate
x <- getSymbols(Symbols = 'EUR/USD', src = 'oanda', auto.assign = FALSE)   
tail(x)

##             EUR.USD
## 2020-01-01 1.121254
## 2020-01-02 1.119305
## 2020-01-03 1.115983
## 2020-01-04 1.115940
## 2020-01-05 1.115970
## 2020-01-06 1.118231

For economics series, the FRED source is used. Symbols can be found here.

# retrieve the historical Gross Domestic Product for Japan
x <- getSymbols(Symbols = 'JPNNGDP', src = 'FRED', auto.assign = FALSE)   
tail(x)

##             JPNNGDP
## 2018-04-01 549364.8
## 2018-07-01 546061.4
## 2018-10-01 545914.5
## 2019-01-01 552838.9
## 2019-04-01 555897.8
## 2019-07-01 559222.7

RESTful APIs

An Application Program Interface (API) is basically a messenger that takes a request, tells a system what you want to do and then returns the response back to you. A RESTful API is an API that uses HTTP requests to GET, PUT, POST and DELETE data. The httr R package is a useful tool for working with HTTP. Each API has its very specific usage and documentation.

# install the package
install.packages('httr')

# load the package
require(httr)

CRAN downloads

The API of the CRAN downloads database. Documentation available here

Example. Which was the most downloaded package of the last month?

baseurl <- 'https://cranlogs.r-pkg.org/'        # API base url. See documentation
endpoint <- 'top/'                              # API endpoint. See documentation
period <- 'last-month/'                         # API parameter. See documentation
count <- 1                                      # API parameter. See documentation
url <- paste0(baseurl, endpoint, period, count) # build full url 
x <- GET(url)                                   # retrieve url
data <- content(x)                              # extract data
data                                            # print data

## $start
## [1] "2019-12-08T00:00:00.000Z"
## 
## $end
## [1] "2020-01-06T00:00:00.000Z"
## 
## $downloads
## $downloads[[1]]
## $downloads[[1]]$package
## [1] "magrittr"
## 
## $downloads[[1]]$downloads
## [1] "4221675"

The most downloaded package between 2019-12-08 and 2020-01-06 was magrittr with a total of 4221675 downloads.

KuCoin API

The API of KuCoin, cryptocurrency exchange. Documentation available here

Example. Retrieve and plot Bitcoin price every minute in the last 24 hours.

# set GMT timezone. See documentation
Sys.setenv(TZ='GMT')                        
# API base url. See documentation
baseurl <- 'https://api.kucoin.com'  
# API endpoint. See documentation
endpoint <- '/api/v1/market/candles'    
# today and yesterday in seconds
today <- as.integer(as.numeric(Sys.time()))  
yesterday <- today - 24*60*60
# API parameters. See documentation
param <- c(symbol = 'BTC-USDT', type = '1min', startAt = yesterday, endAt = today)
# build full url. See documentation
url <- paste0(baseurl, endpoint, '?', paste(names(param), param, sep = '=', collapse = '&')) 
# retrieve url
x <- GET(url)    
# extract data
x <- content(x)      
data <- x$data
# formatting
data <- sapply(1:length(data), function(i) {
  # extract single candle
  candle <- as.numeric(data[[i]])
  # formatting. See documentation
  return( c(time = candle[1], open = candle[2], close = candle[3], high = candle[4], low = candle[5]) )
})
# convert to xts
datetime <- as.POSIXct(data[1,], origin = '1970-01-01')
data <- xts(t(data[-1,]), order.by = datetime)
# plot closing values
plot(data$close, main = 'Bitcoin price in dollars')

Web Scraping

Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used. The rvest package is a useful tool to scrape information from web pages.

# install the package
install.packages('rvest')

# load the package
require(rvest)

Example. Write a function to retrieve articles from Google Scholar given a generic query string q.

getArticles <- function(q){
  # build url
  url <- paste0('https://scholar.google.com/scholar?hl=en&q=', q)
  # sanitize url
  url <- URLencode(url)
  # get results
  res <- read_html(url) %>%           # get url
    html_nodes('div.gs_ri h3 a') %>%  # select titles by css selector 
    html_text()                       # extract text
  # return results
  return(res)
}

# retrieve articles about web scraping in r
getArticles('web scraping in r')

##  [1] "Automated data collection with R: A practical guide to web scraping and text mining"                                          
##  [2] "Web Scraping with Python: Collecting More Data from the Modern Web"                                                           
##  [3] "A primer on theory-driven web scraping: Automatic extraction of big data from the Internet for use in psychological research."
##  [4] "Web scraping and Naïve Bayes classification for job search engine"                                                            
##  [5] "The use of web-scraping software in searching for grey literature"                                                            
##  [6] "RCrawler: An R package for parallel web crawling and scraping"                                                                
##  [7] "Web scraping with Python"                                                                                                     
##  [8] "Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation"                    
##  [9] "Web scraping made simple with sitescraper"                                                                                    
## [10] "Web Scraping With R"

Code Download

Download the full code to generate this document and reproduce the examples. The file is in R Markdown, format for making dynamic documents with R. An R Markdown document is written in markdown, an easy-to-write plain text format, and contains chunks of embedded R code.
Download

Exercise: create the pdf version of this web page
Hint: download the file above and have a look at the introductory 1-min video of the official Rmarkdown guide

Data Acquisition in R

Emanuele Guidotti