A comma-separated values (CSV) file is a delimited text file that generally uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by the delimiter. CSV is a common data exchange format that is widely supported by consumer, business, and scientific applications. R makes it easy to export and import data in CSV format.
Export data to a csv file
Import data from a csv file
## X mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Some data providers offer data in csv format on their website. The STOXX website, a financial index provider, is one of these. Open this link for the EURO STOXX 50 Index: tab Data -> Historical Data provides some open source files for histroical prices. Clicking on EUR Price will open this link. The read.csv()
function can read this file directly from the internet.
# read.csv is very flexible. For the full list of arguments type ?read.csv
x <- read.csv('https://www.stoxx.com/document/Indices/Current/HistoricalData/h_3msx5e.txt', sep = ';')
head(x)
## Date Symbol Indexvalue X
## 1 08.10.2019 SX5E 3432.76 NA
## 2 09.10.2019 SX5E 3462.11 NA
## 3 10.10.2019 SX5E 3493.96 NA
## 4 11.10.2019 SX5E 3569.92 NA
## 5 14.10.2019 SX5E 3556.26 NA
## 6 15.10.2019 SX5E 3598.65 NA
rownames(x) <- as.Date(x[,1], format = '%d.%m.%Y') # assign rownames
x[,c(1,ncol(x))] <- NULL # drop the first and last column
head(x) # print data
## Symbol Indexvalue
## 2019-10-08 SX5E 3432.76
## 2019-10-09 SX5E 3462.11
## 2019-10-10 SX5E 3493.96
## 2019-10-11 SX5E 3569.92
## 2019-10-14 SX5E 3556.26
## 2019-10-15 SX5E 3598.65
The quantmod
package provides a very suitable function for downloading financial data from the web. This function is called getSymbols
. The function works with a variety of sources.
For stocks and shares, the yahoo
source is used. Symbols can be found here.
# retrieve Facebook quotes
x <- getSymbols(Symbols = 'FB', src = 'yahoo', auto.assign = FALSE)
tail(x)
## FB.Open FB.High FB.Low FB.Close FB.Volume FB.Adjusted
## 2019-12-27 208.67 208.93 206.59 208.10 10284200 208.10
## 2019-12-30 207.86 207.90 203.90 204.41 10524300 204.41
## 2019-12-31 204.00 205.56 203.60 205.25 8953500 205.25
## 2020-01-02 206.75 209.79 206.27 209.78 12077100 209.78
## 2020-01-03 207.21 210.40 206.95 208.67 11188400 208.67
## 2020-01-06 206.70 212.78 206.52 212.60 17058900 212.60
For currencies and metals, the oanda
source is used. Symbols are the instruments’ ISO codes separated by /
. ISO codes can be found here.
# retrieve the historical euro/dollar exchange rate
x <- getSymbols(Symbols = 'EUR/USD', src = 'oanda', auto.assign = FALSE)
tail(x)
## EUR.USD
## 2020-01-01 1.121254
## 2020-01-02 1.119305
## 2020-01-03 1.115983
## 2020-01-04 1.115940
## 2020-01-05 1.115970
## 2020-01-06 1.118231
For economics series, the FRED
source is used. Symbols can be found here.
# retrieve the historical Gross Domestic Product for Japan
x <- getSymbols(Symbols = 'JPNNGDP', src = 'FRED', auto.assign = FALSE)
tail(x)
## JPNNGDP
## 2018-04-01 549364.8
## 2018-07-01 546061.4
## 2018-10-01 545914.5
## 2019-01-01 552838.9
## 2019-04-01 555897.8
## 2019-07-01 559222.7
An Application Program Interface (API) is basically a messenger that takes a request, tells a system what you want to do and then returns the response back to you. A RESTful API is an API that uses HTTP requests to GET, PUT, POST and DELETE data. The httr
R package is a useful tool for working with HTTP. Each API has its very specific usage and documentation.
The API of the CRAN downloads database. Documentation available here
Example. Which was the most downloaded package of the last month?
baseurl <- 'https://cranlogs.r-pkg.org/' # API base url. See documentation
endpoint <- 'top/' # API endpoint. See documentation
period <- 'last-month/' # API parameter. See documentation
count <- 1 # API parameter. See documentation
url <- paste0(baseurl, endpoint, period, count) # build full url
x <- GET(url) # retrieve url
data <- content(x) # extract data
data # print data
## $start
## [1] "2019-12-08T00:00:00.000Z"
##
## $end
## [1] "2020-01-06T00:00:00.000Z"
##
## $downloads
## $downloads[[1]]
## $downloads[[1]]$package
## [1] "magrittr"
##
## $downloads[[1]]$downloads
## [1] "4221675"
The most downloaded package between 2019-12-08 and 2020-01-06 was magrittr with a total of 4221675 downloads.
The API of KuCoin, cryptocurrency exchange. Documentation available here
Example. Retrieve and plot Bitcoin price every minute in the last 24 hours.
# set GMT timezone. See documentation
Sys.setenv(TZ='GMT')
# API base url. See documentation
baseurl <- 'https://api.kucoin.com'
# API endpoint. See documentation
endpoint <- '/api/v1/market/candles'
# today and yesterday in seconds
today <- as.integer(as.numeric(Sys.time()))
yesterday <- today - 24*60*60
# API parameters. See documentation
param <- c(symbol = 'BTC-USDT', type = '1min', startAt = yesterday, endAt = today)
# build full url. See documentation
url <- paste0(baseurl, endpoint, '?', paste(names(param), param, sep = '=', collapse = '&'))
# retrieve url
x <- GET(url)
# extract data
x <- content(x)
data <- x$data
# formatting
data <- sapply(1:length(data), function(i) {
# extract single candle
candle <- as.numeric(data[[i]])
# formatting. See documentation
return( c(time = candle[1], open = candle[2], close = candle[3], high = candle[4], low = candle[5]) )
})
# convert to xts
datetime <- as.POSIXct(data[1,], origin = '1970-01-01')
data <- xts(t(data[-1,]), order.by = datetime)
# plot closing values
plot(data$close, main = 'Bitcoin price in dollars')
Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used. The rvest
package is a useful tool to scrape information from web pages.
Example. Write a function to retrieve articles from Google Scholar given a generic query string q
.
getArticles <- function(q){
# build url
url <- paste0('https://scholar.google.com/scholar?hl=en&q=', q)
# sanitize url
url <- URLencode(url)
# get results
res <- read_html(url) %>% # get url
html_nodes('div.gs_ri h3 a') %>% # select titles by css selector
html_text() # extract text
# return results
return(res)
}
## [1] "Automated data collection with R: A practical guide to web scraping and text mining"
## [2] "Web Scraping with Python: Collecting More Data from the Modern Web"
## [3] "A primer on theory-driven web scraping: Automatic extraction of big data from the Internet for use in psychological research."
## [4] "Web scraping and Naïve Bayes classification for job search engine"
## [5] "The use of web-scraping software in searching for grey literature"
## [6] "RCrawler: An R package for parallel web crawling and scraping"
## [7] "Web scraping with Python"
## [8] "Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation"
## [9] "Web scraping made simple with sitescraper"
## [10] "Web Scraping With R"
Download the full code to generate this document and reproduce the examples. The file is in R Markdown, format for making dynamic documents with R. An R Markdown document is written in markdown, an easy-to-write plain text format, and contains chunks of embedded R code.
Download
Exercise: create the pdf version of this web page
Hint: download the file above and have a look at the introductory 1-min video of the official Rmarkdown guide