Homework 1

Q1: Overfitting

A key idea in statistical learning is to select models based on data. Moreover, to overcome overfitting, practitioners use separate datasets for model training (training dataset) and model selection (validation dataset). Later in this course, we will see cross-validation, a technique that takes this idea to the next level.

To help you gain a better understanding of overfitting, read the three examples in the slides (starting on page 33 here). Then, use your own words to explain the flexibility (or complexity) of the true model versus the models used for prediction, and whether overfitting/underfitting becomes a problem in each example (and if so, why).

Remark: if you can read and understand some math, try reading the explanation of bias-variance tradeoff in the textbook as well as the slides, and explain the three examples from the perspective of bias-variance tradeoff. See the last page of slides.

Q2: R assignments

Below are some fundamental R knowledge that you should be familiar with. You should be able to answer (most of) the questions after reading/learning the Intro to R materials. Feel free to use an LLM to check your answers; after all, the goal is to help you learn and remember the basics anyway.

Create and manipulate R vectors:
1. Create a vector of temperatures in Celsius: $(0, 10, 20, 30, 40)$ . Name the created vector C.
2. Convert them to Fahrenheit. Hint: F = C * 9/5 + 32
3. Check how many of the Fahrenheit values are above 70. Can you do that using a one-liner?
What’s an (atomic) vector in R? Can an atomic vector in R contain elements of different types (numeric, string and boolean)? What will happen if you run c("a", 1, TRUE)?
What’s a data frame? Is a data frame an atomic vector?
Which function does R provide for creating a data frame?
Given x <- c(3, 8, 5, 12, 7) and y <- c(2, 6, 4, 9, 10).
1. Compute x + y, x * y, and x^2 - y.
2. Use the mean() function to compute the means of x and y.
3. Replace the third element of y with NA. Then compute the mean of y again. What happens?
In R, a formula is a special way to describe the relationship between variables. The basic syntax is y ~ x, which basically means “y depends on x.” Where have we used formulas in the Lec Intro to R?