---
title: "Introduction to R and RStudio"
format: html
number-sections: true
theme: pandoc
html-math-method: mathml
minimal: true
from: markdown+lists_without_preceding_blankline
---


## Installing R and RStudio
To install R, read the official [cran page],
which provides detailed instructions regarding
how to install R on different OSs (Linux, macOS and Windows).

[cran page]: https://cran.r-project.org/

<!-- - For windows users, you probably need to install -->


RStudio is a great environment for writing and
executing R scripts in this course.
You can download RStudio freely
from its [official page](https://posit.co/download/rstudio-desktop/).
- If you are new to programming, I recommend RStudio
  as it works outside of the box.
  You already have everything you need for this course after
  installing R and RStudio.


## About R
R creates an interactive environment for you to play with data and statistical models.
If you have no exposure to computer-based statistical computing before, I dare say that learning R will be the **most valuable** experience for you in this course.

After installing R and RStudio on your computer,
you can play with the following scripts in RStudio.
There are three ways to run R code in RStudio:

1. Use the R console. By default, it should be the left-bottom sub-window of RStudio.
1.  Write your R scripts in the code editor (it should be the left-top sub-window). Then,
   run your R Scripts line by line using the shortcut `Ctrl-Enter`/`Command-Enter` (or maybe `Shift-Enter`, depending on whether your are using windows, mac or Linux).
1.  Write your R Scripts in Quarto. Indeed, this guide is written in Quarto. You can find the Quarto source by following the link in the index page.


## Use R as a calculator

```{r}
## In R, comments start with ## symbol
## This line will not be run
```

```{r}
num = 3
num + 2

### vector
x = c(2, 7, 5) ## Here "c" is short for combine
x

### a vector (sequence) from 1 to 5
y = seq(1, 5)
y

### a sequence from 1 to 5, with each step increasing by 2
seq(1, 5, 2) ## seq(from=1, to=5, by=2)
?seq

x + y
```

```{r}
x / y

### extracting sub-vector (also called subsetting)
x[1]
c(x[1], x[3])
x[1:3] ## same as x as length(x) == 3
length(x)
x[-1]
```

```{r}
### matrix
z = matrix(seq(1,12), nrow= 3)
z
dim(z)
length(z)
```

## Generate random data

```{r}
set.seed(2020)
x = runif(20)
x
e = runif(20)
y = 0.5 + x * 2 + e
plot(x,y)
```

## Linear model

The function `lm()` is for fitting linear models.
Here we simply regress `y` on `x`.

First, we create a dataframe for our generated data:

```{r}
my_data = data.frame(y = y, x = x)
```


```{r}
my_lm = lm(y ~ x, data = my_data)
summary(my_lm)
```

```{r}
my_lm["coefficients"]
## To know other options besides coefficients, run
## str(my_lm)
## Also, try running:
## my_lm$coefficients
```

```{r}
#create scatterplot
plot(y ~ x, data=my_data)

#add fitted regression line to scatterplot
abline(my_lm)
```

## Loading packages

- Base R is powerful enough for many routine statistical tasks.
You can also use third-party R packages to make
your work easier.

- **ggplot2** is a package for plotting, which provides a more sophisticated layout than the base `plot()` function:

```{r, eval=FALSE}
## install ggplot2. You only need to do it once.
install.packages("ggplot2")
```

```{r}
library(ggplot2)
#create scatterplot with fitted regression line
ggplot(my_data, aes(x = x, y = y)) +
  geom_point() +
  stat_smooth(method = "lm", se=FALSE)
```

## Further readings

You should have a basic understanding about using R now.
For further knowledge, I recommend the following materials:

- [Hands-On Programming with R](https://rstudio-education.github.io/hopr/)

- [R for Data Science](https://r4ds.had.co.nz/)

## Review questions

1. How do you create a vector $(3,4,5)$ in R?

1. Given a vector `x`, how to get its length?

1. How to create a vector $(2, 5, 8, 11)$ in R?

1. Given `x=c(1,2,3)` and `y=c(1,5)`, what's
`x-y`?

1. Given a vector `x`, how do you get its third element? What if `length(x) == 2`?

1. How do you generate a sequence $\{x_i\}_{i=1}^{20}$ where each $x_i \sim N(0,1)$?

1. what's a seed? Why do we use `set.seed(2020)`?

1. Given a data set `data` with two variables `x` and `y`, how do you fit the linear model
$y = \beta_0 + \beta_1 x$?

1. Explain the output of `summary(my_lm)` in our example:

    ```
    Call:
    lm(formula = y ~ x, data = my_data)

    Residuals:
         Min       1Q   Median       3Q      Max
    -0.56217 -0.30050  0.00473  0.40064  0.44601

    Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)    0.946      0.176   5.374 4.16e-05 ***
    x              2.132      0.323   6.602 3.36e-06 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Residual standard error: 0.3703 on 18 degrees of freedom
    Multiple R-squared:  0.7077,    Adjusted R-squared:  0.6915
    F-statistic: 43.59 on 1 and 18 DF,  p-value: 3.362e-06
    ```