Introduction into the R programming language

class: center, middle, inverse, title-slide

# Introduction into the R programming language
## IOS Regensburg
### Christoph Rust
### January 07/08 2020

---

## Copyright

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

---
## Prerequisites

- Laptop with R and RStudio installed
- We will make use of the following directories:
  + R_Code/
  + R_Data/
  + R_Graphics/
  + R_Tables/

---
## Aim of this course

- Basic understanding of the R language (and how it differs from using Stata)
- Enable you to find help on your own
- How to prepare data with `dplyr`
- Data visualization with `ggplot2` and base graphics
- Econometric/statistical models in R
- Making use of spatial data
- We will work on several problem sets throughout the course

---
## Agenda

1. [General infos](#general)

2. [Introduction](#introduction)

3. [Objects in R and language elements](#objects)

4. [Input/Output](#io)

4. [Data preparation (dplyr)](#data)

5. [Data visualization (ggplot)](#graphics)

6. [Econometric models in R](#models)

10. [Further topics](#further-topics)

---
## References

### Some textbooks

- Ligges, U. (2008), [Programmieren mit R](https://www.springer.com/de/book/9783540799979), Springer.
 - Kleiber, C. & Zeileis, A. (2008), [Applied Econometrics with R](https://www.springer.com/de/book/9780387773162), Springer.
 - Braun, J. & Murdoch, D.(2007), [A first course in statistical programming with R](http://einspem.upm.edu.my/wopr2017/2016.pdf)
 - Hadley Wickham, [Advanced R](https://adv-r.hadley.nz/)
 - Slides: [www.christophrust.de/R-intro/slides](http://www.christophrust.de/R-intro/slides/index.html)
 - [R Website](https://www.r-project.org/)

---
class: inverse, center, middle
name: general
# 1. General infos

---
## General intro

**R** is...

- a free implementation of the programming languate **S**, purposes of **S** were:
  + working interactively with data
  + let the user easily become a programmer
  + nice graphics for data visualisation
  + make code reusable
- an interpreted language (the REPL evaluates expressions)
- a *functional* language (functions are first-level objects)
- an *objectoriented* language (has classes and methods)
- a *vectorized* language (objects are internally represented as vectors)

- development of R started in 1992, version 2.0 released in 2005
- influenced by Scheme (a Lisp variant)
- R itself developed by the R core team, many user-contributed packages

---
### Advantages of R

- a free (open-source, GPL2/3) software, full code can be viewed and checked
- very close to (statistical) research
- easily extendable with packages
- runs on almost all platforms

### Disadvantages of R

- no graphical user interface (but R-Studio)
- no interactive graphics (but Shiny)
- interpreted language, therefore sometimes slow compared to compiled languages.
  Compiled code (C/C++, Fortran, Rust) can be included to get around this.

---
class: inverse, center, middle
name: introduction
# 2. Introduction

---
## How to work with R

- The program R itself is an interpreter for the R language
- Similar to the Stata console, "commands" are entered and evaluated when *Return* is hit (newline character)
- "commands" are *expressions* (a symbolic description for what to do) which return a *value* when evaluated
- Interpreter stores a history, accessible via arrow keys <em>(&#8593;, &#8595;)</em>

```r
> sin(0)
> 2 - 1
> 0/0        # -> NaN (no a number)
> Inf-Inf    # -> NaN (no a number)
> 2 + 3*4    # PEMDAS (paranthes, exponents, multiplicate/divide,
>            #        add/substract)
> 2 +
+   1        # interpreter evaluates only when expression complete
```

```
## [1] 0
## [1] 1
## [1] NaN
## [1] NaN
## [1] 14
## [1] 3
```

---
## Simple graphics:

```r
set.seed(123)
x <- runif(100)
y <- x + rnorm(100, sd = 0.1)
plot(x,y)
```

---
## R scripts

There are a lot of simple calculations possible on the command line, but as things get more complicated, *scripts* should be used.

A script is a text file (usually ending with '.R') containing R code.

Example:

```r
## change working directory
setwd("C:/Users/Max/R-Code")

## load some date
myData <- read.table("all_important_data.csv",
                     sep = ";", header = TRUE)

## Summary
summary(myData)
```

---
## Editors

Text files are edited with text editors, there are some editors which make the work with R easier:

- R-Studio (used in this course)
  + Closest to a GUI
  + Plot, overview of objects and packages in one window
  + Extra click functions like loading data
  + Highly recommended especially for beginners

- Notepad++ in conjunction with npptor (Notepad++ to R)
  + Flexible editor for all possible programming languages, txt files etc.
  + Flexibly extensible

---
## Editors

- (X)Emacs with ESS
  + Extremely versatile and powerful editor that requires some training
  + Freely configurable
  + Runs on all platforms
  + Editor is suitable for all possible use cases (LaTeX, email, git, ...)

---
## RStudio

- Structure: 4 windows
  + top left: code editor, data.frame view (**Ctrl + 1**)
  + lower left: R console (**Ctrl + 2**)
  + top right: Display of objects in the global environment (**Ctrl + 8**), History (**Ctrl + 4**)
  + lower right: Help (**Ctrl + 3**), Plots (**Ctrl + 6**), Packages (**Ctrl + 7**),...

---
## Find help to a specific topic

To get help on a *known* function, you can either search for the function in the RStudio Help tab or enter the following in the R console

```r
?getwd
```
<img src="figures/getwd-help.png" style="width: 60%" align="center" />

---
## Find help to a specific topic

For example, if you are looking for a *unknown* function (e.g. a function that performs the t-test) then it is best to use Google.

Example:

---
## Find help to a specific topic

- Each package has its own documentation (contains the help files), often vignettes (more detailed explanations)
- [stackoverflow.com](stackoverflow.com)
- Mailing list *R-help*

---
## R as a calculator

```r
1 + 2          # -> 3
1 + (2 * 4)    # -> 9
a <- 3
b <- 3 * a     # -> 9
sqrt(b)        # -> 3
```

---
## Functions

Allmost all *expressions* make use of *functions*. Operators, Assignments, flow control,... are functions

- Functions are called by entering the function name followed by its arguments in curved brackets, seperated by a comma:
`functionname(argument1 = value1, argument2 = value2,...)`

- Arguments not neccessarily have to be passed by name:
`functionname(value1,value2,...)`

- But then, the correct ordering is mandatory

- Very often, functions have default arguments which only have to be specified if one wants to pass a different than the default value

---
## Assignment

The symbol `<-` is used to assign values to variables (symbolic description for objects stored in memory, not variables of a data frame)
- the value of the rhs is stored into the variable named on the lhs
- any object (data structure, function) can be bount to a name by assignment

```r
a <- (3 * 4)
(3 * 4) -> a    # the same but not recommended!
a = (3 * 4)     # can also be used but is not equivalent to '<-'
a<-(3*4)        # not so easy to read
```

- For readability, there should always be spaces around the assignment operator.
- Variable names must begin with a letter, but may also contain numbers, periods, and underscores
- It is recommended that the objects be given a uniform naming scheme, more on this later

---
## Environments

- Every variable name is bound to an environment, a data structure on its own that powers *scoping* (next slide)
- Most often we will assign variables in the global environment (`.GlobalEnv`)
- During runtime, functions have their own environment

---
## Scoping

- A set of rules describing *how* (not when) R looks up variables (values of a given symbols)
- `search()` returns the search path, a set of environments. If a variable is not found in the current environment, it is looked up in the next one in the search path

```r
f <- function(x) x + z
f(1)     # z not found (neither in function's env nor .GlobalEnv)
z <- 2
f(2)     # now found in .GlobalEnv
```
- More examples to come

---
## Logical operators

Examples:

```r
TRUE & FALSE   # FALSE
TRUE & TRUE    # TRUE
TRUE | FALSE   # TRUE
!TRUE | FALSE  # FALSE
FALSE && TRUE  # zweites TRUE wird nicht ausgewertet
TRUE && TRUE   # zweites TRUE wird ausgewertet
TRUE || FALSE  # zweites FALSE wird nicht ausgewertet
```

---
## Logical operators

More examples:

```r
c(TRUE, FALSE) & c(TRUE, TRUE)  # [1] TRUE FALSE -> vectorized
c(TRUE, FALSE) && c(TRUE, TRUE) # [1] TRUE       -> not vectorized
```

Quantors:

```r
a <- c(TRUE, FALSE, TRUE)
b <- c(TRUE, TRUE, TRUE)
any(a)         # [1] TRUE
all(a)         # [1] FALSE
all(b)         # [1] TRUE
```

---
## Logical operators

It is not possible to test `x == NA`, if one wants to check for `NA` one has to use the function `is.na()`

Examples:

```r
a <- c(TRUE, NA, TRUE)
a == NA           # [1] NA NA NA
is.na(a)          # [1] FALSE TRUE FALSE
```

---
## Some usefule functions

- `ls()` shows the existing objects in the global environment
- `str()` shows the structure of an object
- `rm()` deletes an object from the global workspace
- `getwd()` shows the current working directory
- `setwd()` changes the working directory
  + Under Windows the path separator is either `/` or `\\`
  + Under Linux/Mac always `/`
- `save()` saves objects as a `.RData` file.
- `load()` loads objects into the global environment
- `list.files()` displays files in the specified directory
- `source()` executes an R-script

---
## Extensibility via R packages

On [CRAN](https://cran.r-project.org/) there are a lot of different packages for all possible applications. Thus the (relatively small) basic system can be extended at will. R is delivered with some standard packages but for specific topics, packages have to be installed later.

```r
install.packages("AER")  # installs not yet available package
library(AER)             # loads namespace of package (exportet objects can
                         # be accessed)
data(CASchools)          # e.g. data frame "CASChools"
?ivreg                   # help for function ivreg
```

- `search()` shows the search path, this includes loaded packages:

```r
search()
```

```
##  [1] ".GlobalEnv"             "package:dplyr"          "package:ggplot2"       
##  [4] "package:R.utils"        "package:R.oo"           "package:R.methodsS3"   
##  [7] "package:RNetCDF"        "package:xaringanthemer" "package:stats"         
## [10] "package:graphics"       "package:grDevices"      "package:utils"         
## [13] "package:datasets"       "package:methods"        "Autoloads"             
## [16] "package:base"
```

---
If you use Linux, it is likely that R-packages are packaged for your system. Debian, for instance, in the testing branch has most of the CRAN packages, less (and more outdated) are available in the stable branch. I always do

```bash
sudo apt install -t testing r-cran-{pkgname}
```
for this to work, you have to add testing to your sources list and adjust apt settings:

```bash
echo 'APT::Default-Release "stable";' \
   | sudo tee -a /etc/apt/apt.conf.d/99defaultrelease
echo 'deb http://ftp.de.debian.org/debian/ testing main contrib non-free' \
   | sudo tee -a /etc/apt/sources.list.d/testing.list
echo 'deb-src http://ftp.de.debian.org/debian/ testing main contrib non-free' \
   | sudo tee -a /etc/apt/sources.list.d/testing.list
sudo apt update
```

---
## Differences to Stata
- Stata (as a language) is more procedural, R functional

- In Stata, you can only work with one data frame in the same time, in R one can create as many objects as one wants to (data frames, estimation objects,...)

- Stata has macros (``local'` (available in do file), and `$global`), R has variables bound to environments (availability has nothing to to with the text files where they were created)

- R is more close to Mata

---
## Exercise 2.1

Create a file `test.R` in your codes folder. This file shall contain a script that assigns to the object `x` the number `5`, and the object `y`, occupied by the number `6`, is created. Before you call this `test.R` file with the `source` function, delete your entire workspace. After that, look at the workspace, calculate the product of the two numbers, then delete the object `x`, save the rest of the workspace in ".RData" format in the R_Data folder.

Note: In general it is smart not to move the work directory back and forth,
 but only apply the source command to the explicit code folder.

## Exercise 2.2*
Test R as a calculator:
1. calculate the value of the sine function at the position 0
2. define `x` as the number 2 and calculate the double of the third power of x

---
## Exercise 2.3
Look for an R package that provides functions to test linear hypotheses in the multiple regression model. Install the package and display the help for a function.

---
## Packages used in this course

To be able to run all the code in these slides, install the following packages:

```r
pkglist <- c("AER", "car", "dplyr", "ggplot2", "lmtest",
             "sandwich", "plm", "rgdal", "RNetCDF", "R.utils",
             "readxl", "tidyr", "wbstats", "tmap", "stargazer")
install.packages(pkglist)
```

If you are on debian stable and have testing in your sources list, you can do something like

```bash
sudo apt install -t testing\
    $(for v in aer car dplyr ggplot2 lmtest sandwich plm rgdal\
    rnetcdf r.utils readxl sf units raster rcolorbrewer\
    viridislite classint htmltools lwgeom; do echo r-cran-$v; done)

R -e 'install.packages(c("tidyr","tmap", "wbstats", "stargazer"))'
```

---
class: inverse, center, middle
name: objects
# 2. Objects in R and language elements

---
class: middle
> To understand computations in R, two slogans are helpful:
>
> * Everything that exists is an object.
> * Everything that happens is a function call.
>
> --- John Chambers

---
## Probably the most important object: function

A function is a program construct that executes a procedure on provided objects and returns a result.

Several function call types are available in R:
- __prefix__: the majority of functions in R like `sum(a, b)`
- __infix__: all operators are infix functions, e.g. `2 + 3` or `10^2`
- __replace__: these functions modify their argument, e.g. `names(x) <- c("first", "second")`
- __special__: special language elements like `if`, `for`, `while`, `[[`,...

For every function, there is a prefix variant.

```r
log(2.3)
sin(2)
2 + 3    # infix
`+`(2,3) # prefix
```

---
## Functions

There are functions with and without side effect:

- Functions without side accept objects and perform an operation on them, and return the result (and nothing else). Example: `log()`

- Functions with side effect also change objects in the global workspace. Example: `setwd()`, `'<-'()` (assignment is also a function!)

`$\Rightarrow$` When developing functions, side effects should be avoided if possible (unless they are explicitly desired)

---
## How to define a function

```r
## simple function
product1 <- function(x1, x2) x1 * x2

## default arguments
product2 <- function(x1 = 1, x2 = 2) x1 * x2

## curly brackets are useful with more lines of code
product3 <- function(x1 = 1, x2 = 2){
  x1 * x2
}
```
- In the above example, curly braces are not necessary, but as soon as the function performs several operations, they should be used

- Functions either return the last evaluated expression or the argument of the `return()` function.

---
## Function call

```r
f1 <- function(x1, x2) x1 + 2* x2
f1                      # print the function definition
                        #    (print-method on the function)
f1()                    # function call, but f1 requires 2 args

f1(x1=2, x2=5)
f1(x1=2, x2=5, y=5)     # three args is too much
```

#### What happens when you call the function?
`f1(x1 = 1,x2 = 2)`

During the runtime of the function a new environment is created, in which two variables are available, `x1` here with the value 1, `x2` here with the value 2. This is used to calculate the product which is finally returned.

---

Not all functions require arguments, for example `getwd()`

```r
f2 <- function() x1 * x2
x1 <- 2
x2 <- 3
f2              # show function definition
f2()            # function call
f2(x1=2)        # produces an error, function does
                # not accept any argument
f2(2)
```
`f2` has no arguments. This also means that at runtime in the generated environment also the variables `x1` and `x2` are not available. Scoping rules now define that these objects are looked up in the next higher environment. If `x1` and `x2` do not exist in any environment, an error is shown:

```r
rm(x1,x2)
f2()            # error
```

---

Example for a somewhat more complicated function

```r
f3 <- function(x1,x2) {
    z <- x1 + x2
    abc <- x1/z
    return(abc)   # return explicitly the value abc and
                  # terminate function
    abc <- 123    # not evaluated any more
}
f3(1,2)             # -> 0.3333333
```

---
## Three dots ellipsis

Something one finds relatively often is the so called *three dots ellipsis* (`...`).
It means that the function is designed to take any number of named or unnamed arguments and passes to inner function calls:

```r
f4 <- function(x, ...) {
    log(x, ...)^2
}
```
All arguments not named in the function definition of `f4` are passed to `log()` via the `...`:

```r
f4(5)
log(5, base=10)^2             # log() has an argument "base"
f4(5, base = 10)              # is passed here
f4(5, base = 10, arg3 = "a")  # arg3 also not an arg of log()...
```

---
## Methods

There are also functions that call methods, especially `print()`, `summary()`, `plot()`. These perform different operations depending on the class of the passed object.

```r
plot(cos, -1, 1)  # 'plot' called on object 'cos'
x <- 1:5
y <- 6:10
plot(x, y)        # 'plot' called on two vectors 'x' and 'y'
```

There are two (now also three) class systems in R (S3, S4, and reference classes, the latter closest to OOP).

---
## Look up source code of functions

R is not a black box, see also the [article (p. 43)](https://cran.r-project.org/doc/Rnews/Rnews_2006-4.pdf) by Uwe Ligges:

>"When looking at R source code, sometimes calls to one of the following functions show up:
>
>`.C()`, `.Call()`, `.Fortran()`, `.External()`, or `.Internal()` and `.Primitive()`. These functions are calling entry points in compiled code such as shared objects, static libraries or dynamic link libraries. Therefore, it is necessary to look into the sources of the compiled code, if complete understanding of the code is required. ...
>The first step is to look up the entry point in file `$R HOME/src/main/names.c`, if the calling R function is either `.Primitive()` or `.Internal()`."

---
## Exercise 3.1
1. Create a function named "getSquaredSum", which, when entered
   of two numbers calculates and returns the squared sum of the numbers.
   Test this function in two ways.
   Now set the default value of the second argument to zero.

2. 1st version: create a function where in the function body another function is defined and called.<br>
   2nd version: now define the "inner" function outside the outer. Which version do you like better?
   Check that everything works and gives the same values!

---
## Atomic types

Atomic data types are the building blocks of more complex data structures (vectors, matrices, lists,...)

- `NULL`: empty set
- `logical`: `TRUE`/`FALSE`
- `integer`: -2,-1,0,1,2,...
- `numeric`: real numbers (double precision)
- `complex`: complex numbers
- `character`: character strings
- see also `?typeof`

```r
typeof(2.3)    # numeric
typeof(TRUE)   # logical
typeof("abc")  # character
typeof(log)    # special
```

To test for a specific type, use `is.{type}()`, for instance `is.numeric(1.23` returns `TRUE` and to convert (if possible) to a specific type, use `as.{type}()`; `as.character(1.23)` returns the character `"1.23"`.

---
## Vectors

Vectors are the basic structure in R and consist of several elements of an atomic data type. With the function `c()` vectors can be generated:

```r
x <- c(1, 2.3, 5, 1)
x <- c(2, 2, 2, x)      # increase vector x
y <- c("Test", "Hallo")
y <- c(y, x)            # everything is converted to 'character',
                        # the lowest type
x <- c(x, NA)           # but NA does not change type
```

---
## Construction of vectors

Several possibilities to construct vectors
- integer sequences: `1:5`, alternatively: `seq(1,5)`
- any sequence: `seq(start, end, by)`
- repetitions: `rep()`

```r
2:4              # -> 2,3,4
seq(2,8,2)       # -> 2,4,6,8
rep(2, 4)        # -> 2,2,2,2
x <- 1:3
rep(x, 2)        # -> 1,2,3,1,2,3
rep(x, each = 2) # -> 1,1,2,2,3,3
```

It is possible to give names to the elements of vectors:

```r
x <- c(one = 2.4, two = 3, three = 4, last = 2)
```
---
## Vectorized operations

All basic mathematical operations operate *vectorized*

```r
c(1,2,3) + c(1,1,1)  # -> 2,3,4
c(1,2) * c(1,4)      # -> 1,8
```

Take care: if both operands in a vectorized operation are not of same lenght, R recycles the shorter to the lenght of the longer vector. Sometimes, you get a warning

```r
c(1,2,4) * 2         # 2,4,6 -> second obj is recycled to c(2,2,2)
c(1,2,4) * c(2,3)    # 2,6,8 -> warning
c(1,2,4,8) * c(2,3)  # 2,6,8,24 -> works fine
```

---
## Indexing of vectors

To access elements of vectors, one has several possibilities:
- numerical indexing (works also vectorized): `x[c(2,3,7)]` returns 2nd, 3rd and 7th element of vector `x`
- logical indexing: `x[c(TRUE, FALSE, TRUE, FALSE, FALSE)]` returns 1st und 3rd element from `x` (given that `x` contains 5 elements)
- elements of named vectors can also be accessed by its name: `x["one"]` returns element with name "one"
- `x[-1]` returns all elements but first
- `x[ x  > 2 ]` returns all elements larger than 2 (given `x` is of type numeric)

---
## Some useful vector functions

---
## Exercise 3.2
Write a function `vectorSum` that calculates the sum of two vectors and returns as a value (not as print output) the character "The sum is [value]".
Test with the vectors:

```r
xy <- c(1,2,3)
yx <- c(4,5,6)
vectorSum(xy,yx)   # -> "The sum is [5,7,9]"
```
Note: see the help for the `paste()` function.

---
## Matrices

Matrices are vectors with an additional dimension info:
- construction `matrix(data= NA, ncol=1, nrow =1, byrow = FALSE)`

```r
matrix(data = 1:9, ncol = 3, nrow = 3) # column-major
```

```
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
```
If `data` is not long enough, it is again recycled:

```r
matrix(data = 1, ncol = 3, nrow = 3)
```

```
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1
## [3,]    1    1    1
```

---
## Matrices - Indexing

Indexing is analogous to indexing of vectors, only that we need column and row index

```r
x <- matrix(data = 1:9, ncol = 3, nrow = 3, byrow = TRUE)
x[1,3]    # returns element in 1st row, 3rd column
x[1,]     # returns first row as vector
x[,2]     # returns second column as vector
x[,2, drop = FALSE] # 2nd column as 3 x 1 matrix
x[x>2]    # returns elements of x larger than 3 (as vector)
```

- With `cbind()` and `rbind()` one can stack matrices column- and row-wise
- `dim()` returns the dimensions of the matrix
---
## Some useful functions for matrices

---
## Arrays

Matrices are two-dimensional objects, there are also multi-dimensional objects (arrays): `array(data, dim)`, where `dim` is a vector specifying how large the array is in each dimension.

```r
array(1:30, dim = c(3,3,5))
```
is an array of dimension `$3 \times 3 \times 5$`.

- indexing is the same as with matrices
- basic mathematical operations operate elementwise:

```r
A <- matrix(1:6, nrow = 2)
A^2
```

```
##      [,1] [,2] [,3]
## [1,]    1    9   25
## [2,]    4   16   36
```

---
## Exercise 3.3
Calculate for the matrix `A = Mat_A`

`Mat_A <- matrix(1:9, ncol = 3)`

and the vector `b = vec_b`

`vec_b <- 12:14`

the matrix product `$A\cdot b$` and the component product. Explain the differences.

---
## Exercise 3.4

Enter the objects `$y = (3,5,2,8,6,4,7)'$` and the matrix `$X$`, whose first column consists of ones and whose second column contains the entries `$(4,3,7,1,3,7,5)'$`, in R.

1. Print for 'X' and 'y' respectively the 3rd observation

2. Calculate the quantities `$X'X$`, `$X'y$`, `$(X'X)^{-1}$` and the OLS estimator of the linear regression model.<br>

4. Create a function which, given arguments `y` and `X`, outputs a vector with the KQ estimate.

---
## Lists

Vectors and matrices have one restriction, namely, all elements have to be of the same atomic type.

A more flexible object is a list. Lists may contain anything. They can be created with the function `list()`.

```r
L1 <- list(
  a = 1:3,
  A = matrix(1:9,3,3),
  w = "Hallo!"        # named list with different atomic types
)
```

Lists again may contain lists:

```r
L2 <- list(
  a = 1:3,
  l1 = L1 # list in another list
)
```

---
## Lists - indexing

In order to access elements of a list, a pair of two brackes `[[]]` is neccessary, one pair (`[]`) would return only a sublist. Again, indexing can be done numerically, logically or by name:

```r
L1[[1]]    # 1,2,3     -> vector
L1[1]      # list(1:3) -> still a list (sublist of L1)
L1[["w"]]  # "Hallo!"
```

Elements of named lists can also be accessed via `$` operator

```r
L1$w       # "Hallo!"
```

---
## Exercise 3.5
- Create a named list with three different elements.

- Access the elements by different ways.

- Use the function `str()` on your list

- Now create a list containing a list which again contains a list.

---
## Dataframes

Important data structure: `data.frame`:
- a list with the restriction that all elements have to be of the same length
- also understands matrix indexing, if given two dimensional index
- created with `data.frame()`:

```r
Customers <- data.frame(
  FirstName = c("Patrick","Mario","Claudio","Mario"),
  LastName =  c("Meyer","Schröder","Müller","Schmidt"),
  DateBirth = c("1994-03-03","1986-05-21","1978-10-03","1985-07-10"),
  Age = c(25,33,41,34),
  Childs = c(2,3,0,1),
  stringsAsFactors = FALSE      # names shall be represented as
                                # character, not "factor"
)
```

---
## Dataframes
Indexing:

```r
names(Customres)         # variable names
Customres$FirstName      # indexing via list-"language"
Customres[ ,1]           # numeric indexing like matrix
Customres[ ,"FirstName"] # by name
Customers[Customers$LastName=="Müller" , ] # logical
```

Add variables to the data frame:

```r
Customers$Gender <- c("m", "m", "m", "m")
```

---
## Dataframes

For working with data frames (data preparation) we will use the package `dplyr`.

However, some useful functions for working with `data.frames`:
- `summary()`: summary statistics for every variable
- `head()`: prints first rows of data frame
- `tail()`: prints last rows of data frame
- `attach()`: makes variables accessible in the global environment
- `split()`: splits the data frame into tho parts (horizontally)
- `merge()`: merges two data frames containing different information on the same individuals

---
## Exercise 3.6

Use the just created data.frame "Customers" and do the following:
 1. change the column name "FirstName" to "Prename"
 2. access the birth data in two different ways (see data.frame-matrix-list similarity).
 3. use logical indexing to extract all customers who have 2 or more children.
 4. add a new observation (row) (for example yourself) with arbitrary data
 5. add a new variable (column) to the data frame
 6. select all customers who have more than one child and are younger than 32 years old.

---
## Special data objects useful for real world data

### factor

- useful for categorical data
- technically the same as *labelled numeric* in Stata
- useful in regressions (automatic dummy expansion)
- some methods implemented like `summary()`, `plot()`
- created with `factor()`:

```r
x <- rep(c(1,2), 4)
factor(x, labels=c("female", "male"))
Z <- c("Yes", "Yes", "No", "Yes", "Maybe", "No",
       "Maybe", "Yes", "Yes", "No")
factor(Z)
```

---

```r
x <- sample(x=1:2, size=100, replace = TRUE)
x <- factor(x, labels = c("male", "female"))
plot(x)
```

![](index_files/figure-html/list45-1.png)

```r
summary(x)
```

```
##   male female 
##     57     43
```

---
### Data objects not covered in this curse:

- ordered factors, see [`?ordered()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html)

- date and time, see [`?as.Date()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.Date.html) and [`?DateTimeClasses`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html)

- time series data, for example see [`?ts()`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ts.html) or package [`zoo`](https://cran.r-project.org/web/packages/zoo/index.html)

---
## Flow control

An essential part of a programming language are constructs to allow conditional evaluation, perform similar tasks many times.

In R, theses constructs are

- conditional evaluation (`if`,  `else`)
- loops (`for`, `while`)
- instead of loops, functions of the `apply()` family often offer a more convienient way to repeat some code

---
## Conditional evaluation

```r
if ( expr ) {
    ## some code evaluated if as.logical(expr) == TRUE
} else {
    ## some other code
}
```

the whole construct is an expression itself and, therefore, also has a value which is the last evaluated expression inside that construct:

```r
result <- if ( 2 > 1 ) {
              2       ## last evaluated expression
          } else {
              1
          }   ## result has value 2
```

---
The expression in the condition of the if construct should be of length one, otherwise only the first element is taken into account and a warning is shown

```r
y <- c(5, 3, 2)
y > 3
if(y > 0) "Look!"   ## evaluates only first entry of y > 0
```

If one wants to check that all entries of a vector fulfil a certain condition, then the quantors `all()` and `any()` are helpful

```r
if ( all(y > 0) )  ## all() evaluates to "TRUE", if all entries are TRUE
{
    print("All entries of y are larger than 0")
} 
else {
    print("At least one element of y is equal to 0 or smaller!")
}
```

---
More than two different possibilities:

```r
stepfunction <- function(x){
  if ( x <= 0 ) {
      0
  } else if (x <= 5 ) {
      4
  } else {
      6
  }
}
```

see also [`?switch()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/switch.html).

---
There is also a vectorized `ifelse()` returning a vector

```r
x <- c(3, NA, 2)
ifelse( is.na(x), "Missing", "Not Missing")

## absolut value
x <- c(-3, 5, -8, 2)
ifelse( x < 0, -x, x)
```

---
## Exercise 3.7

Write a function `if_test` which gets two arguments, `x` and `y`, and checks whether `x` is numeric and `y` is character and returns "super!" if both conditions are fulfilled and otherwise prints which of the ojects `x`/`y` does not fulfil the required property.

Test your function with

```r
if_test(5, "char")

if_test("abc", 2)

if_test(list(1,2), matrix("a", 3, 3))
```

---
## Loops

R has three different loops and two control commands:

- `repeat {code}`: repeats the evaluations of the expression `code` until `break` is called

- `while (cond){ code }`: `code` is evaluated as long as condition `cond` evaluates to `TRUE
`
- `for (v in values){ code }`: `code` is evaluated as many times as the number of entries in the object `values`. In the i-th iteration, `v` has the value of the i-th entry of `values`

`break` stops a loop and `next` directly jumps into the next loop run

---

### Examples:

```r
vec <- c("One","Two","Three")
for (v in vec) print(v)

for (i in 1:10) {
  print(i + 2)
}

## list with functions:
flist <- list(function(x) x,
              function(x) x+1,
              function(x) x+2)

for (f in flist) print(f(10))
```

---
## Exercise 3.8

Compute the matrix product of the matrices `$A$` and `$B$`
   
   ```r
   A <- matrix(1:12,4,3)
   B <- matrix(1:9,3,3)
   ```
   using loops and check whether the result is equivalent to `A %*% B`

---
## Apply constructs

Loops in R are relatively slow because R is an interpreted language (machine code is generated during run time). Most of the operations performed using loops can be done faster by using an `apply` construct (exception: recursive operations).

- `apply()`: perform operations on subdimensions of matrices or arrays

- `lapply()`: perform operations on vectors or lists and return a list

- further versions of the above two: `mapply()`, `sapply()`, `vapply()`

---
## Example: `apply()`

```r
A <- matrix(c(2, 6, 3, 4, 5, 7, 8, 4, 1), ncol = 3)
 ## maximum over each row
apply(X = A,            ## array or matrix
      MARGIN = 1,       ## dimension where operation is applied on
                        ##  (1-> rows, 2-> columns)
      FUN = max         ## function to be applied
)

X <- matrix(rnorm(1000), ncol = 10)

apply(X, 2, var)        ## variances of columns
```

---
## Example: `lapply()`

```r
List <- list(a=c(4,8,7),
  b=seq(0,100,5),
  c=c(TRUE,TRUE,FALSE,TRUE)
)

ListSum <- lapply(List, sum)
class(ListSum)       ## again a list
```
Sometimes, the result of such an operation is relatively simple and can be saved in a more simple data structure:

```r
ListSum <- sapply(List, sum)
class(ListSum)         ## "numeric"
```

---
If the function applied requires additional arguments, they can be passed via the `...`:

```r
## function with several args
funnyFun <- function(x, m, std)
    sum(x)/( 2 * rnorm(1, mean=m, sd=std))

## 'm' und 'std' can by passed to funnyFun via '...':
sapply(X=List, FUN=funnyFun, m=2, std=5)

## any function can be applied, e.g. plot()
par(mfrow=c(length(List),1))
lapply(List, plot, main="Title", type="l", lwd=2)
```

---
## Any operation is a function:

In Stata you propably did create macro names dynamically, depending on the context

```stata
for var in varlist _all {
    local mean`var` = mean(`var`)
}
```

In R, you can do the same using `assign()` and `get()`:

```r
## equivalent assignments:
x <- 2
assign("x", 2)
assign(objectName, 2)

## get the value of object "x"
get(objectName)   ## -> 2

for (var in names(data)){ ## var is a character
    assign(paste0("mean", var), mean(data[,var]))
}
```

---
## Error handling

Sometimes, expressions may raise an error and you do not want that the execution of your script stops:

```r
value <- tryCatch({
  ## some code to be evaluated
  expr
}, warning = function(w){
  ## code evaluated if evaluating of expr leads to a warning
  print(w)
}, error = function(e){
  ## code evaluated if evaluating of expr leads to an error
  print(w)
})
```

---
name: io
class: inverse, center, middle
# 4. Input/Output

---
## Reading data into R workspace

Several opportunities, depending of the available file format:

- if data available as R-image (`.RData`), the funtion `load()` loads the objects in the image into the global environment
- very often, data sets are available as structured text files, for instance, the data frame `Customers`from above could be saved in a text file with the content

```c
"FirstName";"LastName";"DateBirth";"Age";"Childs"
"Patrick";"Meyer";"1994-03-03";25;2
"Mario";"Schröder";"1986-05-21";33;1
"Claudio";"Müller";"1978-10-03";41;0
"Mario";"Schmidt";"1985-07-10";34;1
```
To read such text files, the functions `read.table()`, `read.csv()`, `read.csv2()`, `read.delim()` and `read.delim2()` are available, the latter are wrappers of the first one, each with differnt defaults

---
## Reading data into R workspace

-  `xls`/`xlsx`-files: package`readxl`has functions `read_xls()` and `read_xlsx()`
- to read files of other statistical software (Stata, SPSS, Eviews, SAS,...), there are the packages `foreign` and `haven` with appropriate functions

Example

```r
## download zip archive with some data examples
curl::curl_download(url = "http://www.christophrust.de/example_data.zip",
                    destfile = "example_data.zip")
## unpack zip
res <- unzip("example_data.zip")

install.packages("readxl", "haven") ## install packages readxl and haven
library(readxl)

excel_data <- read_xls("Africa.xls", skip = 7,
                       col_names = c("country","pop", "larea","pop_dens",
                                     "gdpppp", "gdppcppp","gdpgrwth"))

dax <- read.csv("dax.csv") ## correct defaults for sep, dec
schools_treat <- haven::read_dta("TreatmentSchools.dta") ## namespace via ::
```

---
name: esri
## Reading data into R workspace

- for reading spatial data (ESRI shape,...), the package `rgdal` has the function `readORG()`:

```r
install.packages("rgdal")
library(rgdal)

kreise <- readOGR("vg2500_krs.shp")

class(kreise)
str(kreise, max = 2)
```

---
## Writing data from R workspace to file

- for almost every function that reads data, there is a function for the reverse direction: `write.csv()`, `write.table()`, `haven::write_dta()`,...

## More input/output

- there are also more low-level function for interacting with text/binary files: `readLines()`/`writeLines()`, `scan()`/`write()`,...

- for connecting to data bases, there is the `odbc` package

- network ressources can be accessed with `url()`, compressed files `unz()`,...

See also `?connections()`

---
name:data
class: inverse, center, middle

# 5. Data preparation (dplyr)

---
## tidyverse

For working with data, a collection of several R packages is very useful: [tidyverse](https://www.tidyverse.org/). The following packages are part of the tidyverse:

- [dplyr](https://dplyr.tidyverse.org/): "Grammar of Data Manipulation"

- [ggplot2](https://ggplot2.tidyverse.org/): "Grammar of Graphics"

- [readr](https://readr.tidyverse.org/): "fast and friendly way to read rectangular data"

- [tibble](https://tibble.tidyverse.org/): "A tibble, or tbl_df, is a modern reimagining of the data.frame"

- [tidyr](https://readr.tidyverse.org/): "create tidy data. Tidy data is data where:
  1. Every column is [a] variable.
  2. Every row is an observation..
  3. Every cell is a single value."

- [purrr](https://purrr.tidyverse.org/): "enhance R’s functional programming toolkit"

---
## Pipe operator

R is a functional language and many operations consist of compositions of different functions (nested function call):

```r
f <- function(x) x * 2 
g <- function(x) x + 10
h <- function(x) x ^ 4
a <- 2
h(g(f(a)))         # (2*a + 10)^4 -> 38416
```
The proplem with that is bad readability. Therefore, `dplyr` makes heavy use of the pipe operator `%>%`. In particular for data preparation this often is very useful.

The composition from above can be written using `%>%` as

```r
library(dplyr)

a %>% f %>% g %>% h
```

---
## Pipe operator
The pipe operator forwards the result (or value of the expression) on the LHS (by default) as first argument of the function (or expression) on the RHS.

The value of the LHS can also be accessed via the expression `.`:

```r
2 %>%
    log(16, base = .)
```
The RHS can also be an expression using `.`:

```r
2 %>% {
    a <- .^2
    a * 5
}
```

---
## `dplyr`

`dplyr` provides several functions for manipulating data frames:

- [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html): add new variables to data frame

- [`select()`](https://dplyr.tidyverse.org/reference/select.html): select (extract) columns (variables) of the data frame (stata: `keep varlist`)

- [`filter()`](https://dplyr.tidyverse.org/reference/filter.html): extract rows (observations) from data frame (stata `keep if `)

- [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.html): sort observations

- [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.html): condense values of a variable to a single value (stata `collapse`)

- [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html): perform operations on groups (stata: `by`)

- `join()`: merge two data frames

For most of these functions there are additional helper functions (e.g. for selecting variables by regex)

---
## Tidy data

For working with packages from the `tidyverse`, the data should be *tidy* (every obs is a row, every variable a column) or *long*. Very often however, data is published in *wide* format.

To get from `wide` to `long` there is the function `pivot_longer()` from `tidyr` (>=1.0.0):

```r
library(tidyr) ## at least version 1.0.0
data_wide <- data.frame(id = c(1,2,3,4),
                        wage90 = c(12,13,14,11),
                        wage95 = c(14,16,13,18))

data_long <- data_wide %>%
* pivot_longer(cols = starts_with("wage"), # columns with values
*              names_to = "year",          # name of new variable
*              values_to = "wage")         # name of variable with values

class(data_long) # tbl_df, tbl, data.frame
```

---
## Real data example

```r
library(tidyr)

county_elections <-
    read.table("btw_gemeinden.csv", skip = 10,
                colClasses = c("character", "integer","character",
                               rep("numeric", 10 )),
                sep = ";", dec = ",", na.strings = c("-", "DG"),
                quote = "",  nrows = 41654) %>%
    setNames(c("date", "ags_gem", "name", "electorate", "participation",
             "vote_tot", "CDU/CSU", "SPD", "GRÜNE",
             "FDP", "DIE LINKE", "AfD", "others")) %>%
    pivot_longer(cols = c("CDU/CSU", "SPD", "GRÜNE",
                          "FDP", "DIE LINKE", "AfD", "others"),
                 names_to = "party", values_to = "votes") %>%
    filter(ags_gem > 1000000 & ags_gem < 17000000) %>%
    mutate(ags_kkz = trunc(ags_gem / 1e3),
           date = as.Date(date, format = "%d.%m.%Y"),
           pary = factor(party)) %>%
    mutate(year = as.numeric(format(date, "%Y"))) %>%
    group_by(party, year, ags_kkz) %>%
    summarize(votes_share = sum(votes, na.rm = TRUE)/sum(vote_tot, na.rm = TRUE))
```

---

Get GDP data for German Kreise

```r
county_gdp <-
    read.table("gdp_counties.csv", skip = 10, sep = ";",
                dec = ",", na.strings = c("-", "DG", "."),
                quote = "",
                colClasses = c("integer", "integer", "character", "numeric",
                               "NULL", "numeric", rep("NULL", 8))) %>%
    setNames(c("year", "ags_kkz", "name", "gdp", "gdppc")) %>%
    filter(ags_kkz > 1000 & ags_kkz < 20000) %>%
    mutate(year = as.numeric(year))

## now join both data sources
county_data <- county_elections %>%
    left_join(county_gdp)
```

---
## Data from NASA (NetCDF)

```r
library(RNetCDF)
library(R.utils)
library(ggplot2)

url <- "https://data.giss.nasa.gov/pub/precipdai/precip1900-1988.nc.gz"
curl::curl_download(url,
                    destfile = "precip1900-1988.nc.gz")
gunzip("precip1900-1988.nc.gz")

p <- (res <- read.nc(open.nc("precip1900-1988.nc")) %>% {

## make a long data from the included array
    df <- data.frame(expand.grid(.$lon, .$lat, .$month, .$year),
                     as.vector(.$precip))

colnames(df) <- c("lon", "lat", "month", "year", "precip")
    df

}) %>%
*   mutate(precip = na_if(precip, -99999)) %>%
*   filter(year == 1980 & month == 13) %>%     ## year average
    ggplot(aes(lon, lat, fill = precip)) +
    geom_tile(na.rm = TRUE) +
    scale_fill_continuous(low="thistle2", high="darkred", na.value="white") +
    theme_bw()
p
```

---
![](index_files/figure-html/example-nasa-precipitation-fig-1.png)

---
## World Bank Data

For loading data from, for instance WDI, there is the package `wbstats`:

```r
library(wbstats)

## Population, total
wb(indicator = "SP.POP.TOTL",
   country = c("DEU", "FRA", "GBR"), startdate = 2000, enddate = 2015) %>%
    head()
```

```
##   iso3c date    value indicatorID         indicator iso2c country
## 1   DEU 2015 81686611 SP.POP.TOTL Population, total    DE Germany
## 2   DEU 2014 80982500 SP.POP.TOTL Population, total    DE Germany
## 3   DEU 2013 80645605 SP.POP.TOTL Population, total    DE Germany
## 4   DEU 2012 80425823 SP.POP.TOTL Population, total    DE Germany
## 5   DEU 2011 80274983 SP.POP.TOTL Population, total    DE Germany
## 6   DEU 2010 81776930 SP.POP.TOTL Population, total    DE Germany
```

---
## Exercise 5.1

Use the data frame CPS1985 available as csv at `www.christophrust.de/cps1985.csv` and 
1. add to the data frame a variable log_wage by using `mutate()` and the pipe operator

2. compute for the interaction of the groups `gender` and `occupation` average log wages

3. filter for both men and women the observation with highes wage

---
class: inverse, middle, center
name: graphics
# 6. Data visualization (ggplot)

---
## Grammar of Graphics

To build up a graph for visualizing data, the following generic components can be used:

- **data**, provided as *tidy* data.frame

- **aesthetic**: which variables do you want to plot and what is their role

- **geom**: what do you want to plot? points, lines, polygons,...

- **scale**: how are data values graphically represented? colors, axis scaling,...

- **stat**: is the data to be transformed?

- **facet**: plot small subfigures for grouping variable

---

Let's use this: