A key idea in statistical learning is to select models based on data. Moreover, to overcome overfitting, practitioners use separate datasets for model training (training dataset) and model selection (validation dataset). Later in this course, we will see cross-validation, a technique that takes this idea to the next level.
To help you gain a better understanding of overfitting, read the three examples in the slides (starting on page 33 here). Then, use your own words to explain the flexibility (or complexity) of the true model versus the models used for prediction, and whether overfitting/underfitting becomes a problem in each example (and if so, why).
Below are some fundamental R knowledge that you should be familiar with. You should be able to answer (most of) the questions after reading/learning the Intro to R materials. Feel free to use an LLM to check your answers; after all, the goal is to help you learn and remember the basics anyway.
Create and manipulate R vectors:
C.F = C * 9/5 + 3270.
Can you do that using a one-liner?What’s an (atomic) vector in R? Can an atomic vector in R contain
elements of different types (numeric, string and boolean)? What will
happen if you run c("a", 1, TRUE)?
What’s a data frame? Is a data frame an atomic vector?
Which function does R provide for creating a data frame?
Given x <- c(3, 8, 5, 12, 7) and
y <- c(2, 6, 4, 9, 10).
x + y, x * y, and
x^2 - y.mean() function to compute the means of
x and y.y with NA.
Then compute the mean of y again. What happens?In R, a formula is a special way to describe the relationship
between variables. The basic syntax is y ~ x, which
basically means “y depends on x.” Where have we used formulas in the Lec
Intro to R?