---
title: "Introductory Econometrics Examples"
author: "Justin M Shea"
date: ' '
output:
    rmarkdown::html_vignette:
    toc: yes
vignette: >
  %\VignetteIndexEntry{Introductory Econometrics Examples}  
  %\VignetteEngine{knitr::rmarkdown}  
  %\VignetteEncoding{UTF-8}
---

\newpage

## Introduction

This vignette reproduces examples from various chapters of _Introductory Econometrics: A Modern Approach, 7e_ by Jeffrey M. Wooldridge. Each example illustrates how to load data, build econometric models, and compute estimates with **R**.

In addition, the **Appendix** cites a few sources using **R** for econometrics. Of note, in 2020 Florian Heiss published a 2nd edition of [_Using R for Introductory Econometrics_](http://www.urfie.net/); it is excellent. The Heiss text is a companion to wooldridge for `R` users, and offers an in depth treatment with several worked examples from each chapter. Indeed, his example use this `wooldridge` package as well.

Now, install and load the `wooldridge` package and lets get started!

```{r, echo = TRUE, eval = FALSE, warning=FALSE}
install.packages("wooldridge")
```

```{r, echo = TRUE, eval = TRUE, warning=FALSE, message=FALSE}
library(wooldridge)
```

```{r, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE}
library(stargazer)
library(knitr)
```


\newpage

## Chapter 2: The Simple Regression Model

### **`Example 2.10:` A Log Wage Equation**

Load the `wage1` data and check out the documentation.

```{r, message=FALSE, eval=FALSE}
data("wage1")

?wage1
```

The documentation indicates these are data from the 1976 Current Population Survey, collected by Henry Farber when he and Wooldridge were colleagues at MIT in 1988.

**$educ$:** years of education

**$wage$:** average hourly earnings

**$lwage$:** log of the average hourly earnings


First, make a scatter-plot of the two variables and look for possible patterns in the relationship between them.

```{r, echo=FALSE}
plot(y = wage1$wage, x = wage1$educ, col = "darkgreen", pch = 21, bg = "lightgrey",     
     cex=1.25, xaxt="n", frame = FALSE, main = "Wages vs. Education, 1976", 
     xlab = "years of education", ylab = "Hourly wages")
axis(side = 1, at = c(0,6,12,18))
rug(wage1$wage, side=2, col="darkgreen")
```

It appears that _**on average**_, more years of education, leads to higher wages.

The example in the text is interested in the _return to another year of education_, or what the _**percentage**_ change in wages one might expect for each additional year of education. To do so, one must use the $log($`wage`$)$. This has already been computed in the data set and is defined as `lwage`.

The textbook provides excellent discussions around these topics, so please consult it.

Build a linear model to estimate the relationship between the _log of wage_ (`lwage`) and _education_ (`educ`).

$$\widehat{log(wage)} = \beta_0 + \beta_1educ$$

```{r}
log_wage_model <- lm(lwage ~ educ, data = wage1)
```

Print the `summary` of the results.

```{r, echo = TRUE, eval = FALSE, warning=FALSE}
summary(log_wage_model)
```


```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html", log_wage_model, single.row = TRUE, header = FALSE, digits = 5)
```

Plot the $log($`wage`$)$ vs `educ`. The blue line represents the least squares fit.

```{r, echo=FALSE}
plot(y = wage1$lwage, x = wage1$educ, main = "A Log Wage Equation", 
     col = "darkgreen", pch = 21, bg = "lightgrey", cex=1.25,
     xlab = "years of education", ylab = "log of average hourly wages",
     xaxt="n", frame = FALSE)
axis(side = 1, at = c(0,6,12,18))
abline(log_wage_model, col = "blue", lwd=2)
rug(wage1$lwage, side=2, col="darkgreen")
```


\newpage

## Chapter 3: Multiple Regression Analysis: Estimation

### **`Example 3.2:` Hourly Wage Equation**

Check the documentation for variable information

```{r, eval=FALSE}
?wage1
```

**$lwage$:** log of the average hourly earnings

**$educ$:** years of education

**$exper$:** years of potential experience

**$tenutre$:** years with current employer

Plot the variables against `lwage` and compare their distributions
and slope ($\beta$) of the simple regression lines.

```{r, fig.height=3, echo=FALSE}
par(mfrow=c(1,3))

plot(y = wage1$lwage, x = wage1$educ, col="darkgreen", xaxt="n", frame = FALSE, main = "years of education", xlab = "", ylab = "")
mtext(side=2, line=2.5, "Hourly wages", cex=1.25)
axis(side = 1, at = c(0,6,12,18))
abline(lm(lwage ~ educ, data=wage1), col = "darkblue", lwd=2)

plot(y = wage1$lwage, x = wage1$exper, col="darkgreen", xaxt="n", frame = FALSE, main = "years of experience", xlab = "", ylab = "")
axis(side = 1, at = c(0,12.5,25,37.5,50))
abline(lm(lwage ~ exper, data=wage1), col = "darkblue", lwd=2)

plot(y = wage1$lwage, x = wage1$tenure, col="darkgreen", xaxt="n", frame = FALSE, main = "years with employer", xlab = "", ylab = "")
axis(side = 1, at = c(0,11,22,33,44))
abline(lm(lwage ~ tenure, data=wage1), col = "darkblue", lwd=2)
```

Estimate the model regressing _educ_, _exper_, and _tenure_ against _log(wage)_.

$$\widehat{log(wage)} = \beta_0 + \beta_1educ + \beta_3exper + \beta_4tenure$$

```{r}
hourly_wage_model <- lm(lwage ~ educ + exper + tenure, data = wage1)
```

Print the estimated model coefficients:

```{r, eval=FALSE}
coefficients(hourly_wage_model)
```

```{r, echo=FALSE}
kable(coefficients(hourly_wage_model), digits=4, col.names = "Coefficients", align = 'l')
```

Plot the coefficients, representing percentage impact of each variable on $log($`wage`$)$ for a quick comparison.

```{r, echo=FALSE}
barplot(sort(100*hourly_wage_model$coefficients[-1]), horiz=TRUE, las=1,
        ylab = " ", main = "Coefficients of Hourly Wage Equation")
```

## Chapter 4: Multiple Regression Analysis: Inference

### **`Example 4.1` Hourly Wage Equation**

Using the same model estimated in **`example: 3.2`**, examine and compare the standard errors associated with each coefficient. Like the textbook, these are contained in parenthesis next to each associated coefficient.


```{r, eval=FALSE}
summary(hourly_wage_model)
```

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html", hourly_wage_model,  single.row = TRUE, header = FALSE, digits=5)
```

For the years of experience variable, or `exper`, use coefficient and Standard Error 
to compute the $t$ statistic:

$$t_{exper} = \frac{0.004121}{0.001723} = 2.391$$

Fortunately, `R` includes $t$ statistics in the `summary` of model diagnostics.


```{r, eval=FALSE}
summary(hourly_wage_model)$coefficients
```

```{r, echo=FALSE}
kable(summary(hourly_wage_model)$coefficients, align="l", digits=5)
```

```{r, fig.height=8, eval=FALSE, echo=FALSE}
par(mfrow=c(2,2))

plot(y = hourly_wage_model$residuals, x = hourly_wage_model$fitted.values , col="darkgreen", xaxt="n", 
     frame = FALSE, main = "Fitted Values", xlab = "", ylab = "")
mtext(side=2, line=2.5, "Model Residuals", cex=1.25)
abline(0, 0, col = "darkblue", lty=2, lwd=2)

plot(y = hourly_wage_model$residuals, x = wage1$educ, col="darkgreen", xaxt="n", 
     frame = FALSE, main = "years of education", xlab = "", ylab = "")
axis(side = 1, at = c(0,6,12,18))
abline(0, 0, col = "darkblue", lty=2, lwd=2)

plot(y = hourly_wage_model$residuals, x = wage1$exper, col="darkgreen", xaxt="n", 
     frame = FALSE, main = "years of experience", xlab = "", ylab = "")
mtext(side=2, line=2.5, "Model Residuals", cex=1.25)
axis(side = 1, at = c(0,12.5,25,37.5,50))
abline(0, 0, col = "darkblue", lty=2, lwd=2)

plot(y = hourly_wage_model$residuals, x = wage1$tenure, col="darkgreen", xaxt="n", 
     frame = FALSE, main = "years with employer", xlab = "", ylab = "")
axis(side = 1, at = c(0,11,22,33,44))
abline(0, 0, col = "darkblue", lty=2, lwd=2)
```

Plot the $t$ statistics for a visual comparison:

```{r, echo=FALSE}
barplot(sort(summary(hourly_wage_model)$coefficients[-1, "t value"]), horiz=TRUE, las=1, 
        ylab = " ", main = "t statistics of Hourly Wage Equation")
```


### **`Example 4.7` Effect of Job Training on Firm Scrap Rates**

Load the `jtrain` data set. 

```{r, echo = TRUE, eval = TRUE, warning=FALSE, message=FALSE}
data("jtrain")
```

```{r, echo = TRUE, eval = FALSE, warning=FALSE}
?jtrain
```

From H. Holzer, R. Block, M. Cheatham, and J. Knott (1993), _Are Training Subsidies Effective? The Michigan Experience_, Industrial and Labor Relations Review 46, 625-636. The authors kindly provided the data.


**$year:$** 1987, 1988, or 1989 

**$union:$** =1 if unionized    

**$lscrap:$** Log(scrap rate per 100 items)     

**$hrsemp:$** (total hours training) / (total employees trained)        

**$lsales:$** Log(annual sales, $)      

**$lemploy:$** Log(umber of employees at plant) 


First, use the `subset` function and it's argument by the same name to return
observations which occurred in **1987** and are not **union**. At the same time, use
the `select` argument to return only the variables of interest for this problem.


```{r}
jtrain_subset <- subset(jtrain, subset = (year == 1987 & union == 0),
                        select = c(year, union, lscrap, hrsemp, lsales, lemploy))
```

Next, test for missing values. One can "eyeball" these with R Studio's `View`
function, but a more precise approach combines the `sum` and `is.na` functions 
to return the total number of observations equal to `NA`.

```{r}
sum(is.na(jtrain_subset))
```

While `R`'s `lm` function will automatically remove missing `NA` values, eliminating
these manually will produce more clearly proportioned graphs for exploratory analysis.
Call the `na.omit` function to remove all missing values and assign the new 
`data.frame` object the name **`jtrain_clean`**.

```{r}
jtrain_clean <- na.omit(jtrain_subset)
```


Use `jtrain_clean` to plot the variables of interest against `lscrap`. Visually 
observe the respective distributions for each variable, and compare the slope 
($\beta$) of the simple regression lines.

```{r, echo=FALSE, fig.height=3}
par(mfrow=c(1,3))

point_size <- 1.75

plot(y = jtrain_clean$lscrap, x = jtrain_clean$hrsemp, frame = FALSE, 
main = "Total (hours/employees) trained", ylab = "", xlab="", pch = 21, bg = "lightgrey", cex=point_size)
mtext(side=2, line=2, "Log(scrap rate)", cex=1.25)
abline(lm(lscrap ~ hrsemp, data=jtrain_clean), col = "blue", lwd=2)

plot(y = jtrain_clean$lscrap, x = jtrain_clean$lsales, frame = FALSE, main = "Log(annual sales $)", ylab = " ", xlab="", pch = 21, bg = "lightgrey", cex=point_size)
abline(lm(lscrap ~ lsales, data=jtrain_clean), col = "blue", lwd=2)

plot(y = jtrain_clean$lscrap, x = jtrain_clean$lemploy, frame = FALSE, main = "Log(# employees at plant)", ylab = " ", xlab="", pch = 21, bg = "lightgrey", cex=point_size)
abline(lm(lscrap ~ lemploy, data=jtrain_clean), col = "blue", lwd=2)
```


Now create the linear model regressing `hrsemp`(total hours training/total employees trained), `lsales`(log of annual sales), and `lemploy`(the log of the number of the employees), against `lscrap`(the log of the scrape rate).

$$lscrap = \alpha + \beta_1 hrsemp + \beta_2 lsales + \beta_3 lemploy$$


```{r}
linear_model <- lm(lscrap ~ hrsemp + lsales + lemploy, data = jtrain_clean)
```

Finally, print the complete summary diagnostics of the model.

```{r, eval=FALSE, warning=FALSE, message=FALSE}
summary(linear_model)
```

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html", linear_model, single.row = TRUE, header = FALSE, digits=5)
```


```{r, echo=FALSE, eval=FALSE}
#Plot the coefficients, representing the impact of each variable on $log($`scrap`$)$ for a quick comparison. As you can observe, for some variables, the confidence intervals are wider than others.
coefficient <- coef(linear_model)[-1]
 confidence <- confint(linear_model, level = 0.95)[-1,]

graph <- drop(barplot(coefficient, ylim = range(c(confidence)),
              main = "Coefficients & 95% C.I. of variables on Firm Scrap Rates"))  

arrows(graph, coefficient, graph, confidence[,1], angle=90, length=0.55, col="blue", lwd=2)
arrows(graph, coefficient, graph, confidence[,2], angle=90, length=0.55, col="blue", lwd=2)

```


## Chapter 5: Multiple Regression Analysis: OLS Asymptotics

### **`Example 5.1:` Housing Prices and Distance From an Incinerator**

Load the `hprice3` data set.

```{r}
data("hprice3")
```


**$lprice:$** Log(selling price)

**$ldist:$** Log(distance from house to incinerator, feet)

**$larea:$** Log(square footage of house)


Graph the prices of housing against distance from an incinerator:

```{r, echo=FALSE, fig.align='center'}
par(mfrow=c(1,2))

plot(y = hprice3$price, x = hprice3$dist, main = " ", xlab = "Distance to Incinerator in feet", ylab = "Selling Price",  frame = FALSE, pch = 21, bg = "lightgrey")
abline(lm(price ~ dist, data=hprice3), col = "blue", lwd=2)
```


Next, model the $log($`price`$)$ against the $log($`dist`$)$ to estimate the percentage relationship between the two.

$$price = \alpha + \beta_1 dist$$

```{r}
price_dist_model <- lm(lprice ~ ldist, data = hprice3)
```

Create another model that controls for "quality" variables, such as square footage `area` per house.

$$price = \alpha + \beta_1 dist + \beta_2 area$$

```{r}
price_area_model <- lm(lprice ~ ldist + larea, data = hprice3)
```

Compare the coefficients of both models. Notice that adding `area` improves the quality of the model, but also reduces the coefficient size of `dist`.  

```{r, eval=FALSE}
summary(price_dist_model)
summary(price_area_model)
```

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html",price_dist_model, price_area_model,  single.row = TRUE, header = FALSE, digits=5)
```

 Graphing illustrates the larger coefficient for `area`.

```{r, echo=FALSE}
par(mfrow=c(1,2))

point_size <- 0.80

plot(y = hprice3$lprice, x = hprice3$ldist, frame = FALSE, 
main = "Log(distance from incinerator)", ylab = "", xlab="", 
pch = 21, bg = "lightgrey", cex=point_size)
mtext(side=2, line=2, "Log( selling price )", cex=1.25)
abline(lm(lprice ~ ldist, data=hprice3), col = "blue", lwd=2)

plot(y = hprice3$lprice, x = hprice3$larea, frame = FALSE, main = "Log(square footage of house)", ylab = " ", xlab="", pch = 21, bg = "lightgrey", cex=point_size)
abline(lm(lprice ~ larea, data=hprice3), col = "blue", lwd=2)

```


\newpage

## Chapter 6: Multiple Regression: Further Issues

### **`Example 6.1:` Effects of Pollution on Housing Prices, standardized.**

Load the `hprice2` data and view the documentation.

```{r, message=FALSE, eval=FALSE}
data("hprice2")
?hprice2
```

Data from _Hedonic Housing Prices and the Demand for Clean Air_, by Harrison, D. and D.L.Rubinfeld, Journal of Environmental Economics and Management 5, 81-102. Diego Garcia, a former Ph.D. student in economics at MIT, kindly provided these data, which he obtained from the book Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, by D.A. Belsey, E. Kuh, and R. Welsch, 1990. New York: Wiley. 


$price$: median housing price.

$nox$: Nitrous Oxide concentration; parts per million.

$crime$: number of reported crimes per capita.

$rooms$: average number of rooms in houses in the community.

$dist$: weighted distance of the community to 5 employment centers.

$stratio$: average student-teacher ratio of schools in the community.


$$price = \beta_0 + \beta_1nox + \beta_2crime + \beta_3rooms + \beta_4dist + \beta_5stratio + \mu$$

Estimate the usual `lm` model.

```{r}
housing_level <- lm(price ~ nox + crime + rooms + dist + stratio, data = hprice2)
```

Estimate the same model, but standardized coefficients by wrapping each variable 
with R's `scale` function:

$$\widehat{zprice} = \beta_1znox + \beta_2zcrime + \beta_3zrooms + \beta_4zdist + \beta_5zstratio$$


```{r}
housing_standardized <- lm(scale(price) ~ 0 + scale(nox) + scale(crime) + scale(rooms) + scale(dist) + scale(stratio), data = hprice2)
```

Compare results, and observe

```{r, eval=FALSE}
summary(housing_level)
summary(housing_standardized)
```

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html",housing_level, housing_standardized,  single.row = TRUE, header = FALSE, digits=5)
```


\newpage

### **`Example 6.2:` Effects of Pollution on Housing Prices, Quadratic Interactive Term**

Modify the housing model from **`example 4.5`**, adding a quadratic term in _rooms_: 

$$log(price) = \beta_0 + \beta_1log(nox) + \beta_2log(dist) + \beta_3rooms + \beta_4rooms^2 + \beta_5stratio + \mu$$
```{r}
housing_model_4.5 <- lm(lprice ~ lnox + log(dist) + rooms + stratio, data = hprice2)

housing_model_6.2 <- lm(lprice ~ lnox + log(dist) + rooms + I(rooms^2) + stratio, 
                        data = hprice2)
```

Compare the results with the model from `example 6.1`.

```{r, eval=FALSE}
summary(housing_model_4.5)
summary(housing_model_6.2)
```

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html", housing_model_4.5 , housing_model_6.2, single.row = TRUE, header = FALSE, digits=5)
```

Estimate the minimum turning point at which the `rooms` interactive term changes
from negative to positive.

$$x = \frac{\hat{\beta_1}}{2\hat{\beta_2}}$$

```{r}
beta_1 <- summary(housing_model_6.2)$coefficients["rooms",1] 
beta_2 <- summary(housing_model_6.2)$coefficients["I(rooms^2)",1]
turning_point <- abs(beta_1 / (2*beta_2))

print(turning_point)
```

Compute the percent change across a range of average rooms. Include the smallest, 
turning point, and largest.

```{r}
Rooms <- c(min(hprice2$rooms), 4, turning_point, 5, 5.5, 6.45, 7.5, max(hprice2$rooms))
Percent.Change <- 100*(beta_1 + 2*beta_2*Rooms)

kable(data.frame(Rooms, Percent.Change))
```

```{r, echo=FALSE}
from <- min(hprice2$rooms)
to <- max(hprice2$rooms)
rooms <- seq(from=from, to =to, by = ((to - from)/(NROW(hprice2)-1)))
quadratic <- abs(100*summary(housing_model_6.2)$coefficients["rooms",1] + 200*summary(housing_model_6.2)$coefficients["I(rooms^2)",1]*rooms)

housing_model_frame <- model.frame(housing_model_6.2)

housing_sq <- abs(beta_1*housing_model_frame[,"rooms"]) + 
              beta_2*housing_model_frame[,"I(rooms^2)"]

```

Graph the log of the selling price against the number of rooms. Superimpose a
simple model as well as a quadratic model and examine the difference.

```{r, echo=FALSE}
rooms_interaction <- lm(lprice ~ rooms + I(rooms^2), data = hprice2)

par(mfrow=c(1,2))

plot(y = hprice2$lprice, x = hprice2$rooms, xaxt="n", pch = 21, bg = "lightgrey",
     frame = FALSE, main = "lprice ~ rooms", xlab = "Rooms", ylab = "")
mtext(side=2, line=2, "Log( selling price )", cex=1.25)
axis(side = 1, at = c(min(hprice2$rooms), 4, 5, 6, 7, 8, max(hprice2$rooms)))
abline(lm(lprice ~ rooms, data = hprice2), col="red", lwd=2.5)

plot(y = hprice2$lprice, x = hprice2$rooms, xaxt="n", pch = 21, bg = "lightgrey",
     frame = FALSE, main = "lprice ~ rooms + I(rooms^2)", xlab = "Rooms", ylab = " ")
axis(side = 1, at = c(min(hprice2$rooms), 4, 5, 6, 7, 8, max(hprice2$rooms)))
lines(sort(hprice2$rooms), sort(fitted(rooms_interaction)), col = "red", lwd=2.5)

```

\newpage

## Chapter 7: Multiple Regression Analysis with Qualitative Information 

### **`Example 7.4:` Housing Price Regression, Qualitative Binary variable**

This time, use the `hrprice1` data. 

```{r}
data("hprice1")
```

```{r, eval=FALSE}
?hprice1
```

Data collected from the real estate pages of the Boston Globe during 1990. 
These are homes that sold in the Boston, MA area.

**$lprice:$** Log(house price, $1000s)

**$llotsize:$** Log(size of lot in square feet)

**$lsqrft:$** Log(size of house in square feet)

**$bdrms:$** number of bdrms

**$colonial:$** =1 if home is colonial style

```{r, fig.height=8, eval=FALSE, echo=FALSE}
par(mfrow=c(2,2))

palette(rainbow(6, alpha = 0.8))
plot(y = hprice1$lprice, x = hprice1$llotsize, col=hprice1$bdrms, pch = 19,
     frame = FALSE, main = "Log(lot size)", xlab = "", ylab = "")
mtext(side=2, line=2, "Log( selling price )", cex=1.25)


plot(y = hprice1$lprice, x = hprice1$lsqrft, col=hprice1$bdrms, pch=19,
     frame = FALSE, main = "Log(home size)", xlab = "Rooms", ylab = " ")
legend(8, 5.8, sort(unique(hprice1$bdrms)), col = 1:length(hprice1$bdrms), 
       pch=19, title = "bdrms")


hprice1$colonial <- as.factor(hprice1$colonial)

palette(rainbow(2, alpha = 0.8))
plot(y = hprice1$lprice, x = hprice1$llotsize, col=hprice1$colonial, pch = 19, bg = "lightgrey",
     frame = FALSE, main = "Log(lot size)", xlab = "", ylab = "")
mtext(side=2, line=2, "Log( selling price )", cex=1.25)


plot(y = hprice1$lprice, x = hprice1$lsqrft, col=hprice1$colonial, pch=19,
     frame = FALSE, main = "Log(home size)", xlab = "Rooms", ylab = " ")
legend(8, 5.25, unique(hprice1$colonial), col=1:length(hprice1$colonial), pch=19, title = "colonial")
```


$$\widehat{log(price)} = \beta_0 + \beta_1log(lotsize) + \beta_2log(sqrft) + \beta_3bdrms + \beta_4colonial $$

Estimate the coefficients of the above linear model on the `hprice` data set.

```{r}
housing_qualitative <- lm(lprice ~ llotsize + lsqrft + bdrms + colonial, data = hprice1)
```


```{r, eval=FALSE}
summary(housing_qualitative)
```

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html",housing_qualitative,  single.row = TRUE, header = FALSE, digits=5)
```

\newpage

## Chapter 8: Heteroskedasticity

### **`Example 8.9:` Determinants of Personal Computer Ownership**

$$\widehat{PC} = \beta_0 + \beta_1hsGPA + \beta_2ACT + \beta_3parcoll + \beta_4colonial $$
Christopher Lemmon, a former MSU undergraduate, collected these data from a survey he took of MSU students in Fall 1994. Load `gpa1` and create a new variable combining the `fathcoll` and `mothcoll`, into `parcoll`. This new column indicates if either parent went to college.

```{r, message=FALSE}
data("gpa1")

gpa1$parcoll <- as.integer(gpa1$fathcoll==1 | gpa1$mothcoll)

GPA_OLS <- lm(PC ~ hsGPA + ACT + parcoll, data = gpa1)
```

Calculate the weights and then pass them to the `weights` argument.

```{r}
weights <- GPA_OLS$fitted.values * (1-GPA_OLS$fitted.values)

GPA_WLS <- lm(PC ~ hsGPA + ACT + parcoll, data = gpa1, weights = 1/weights)
```

Compare the OLS and WLS model in the table below:

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html",GPA_OLS, GPA_WLS,  single.row = TRUE, header = FALSE, digits=5)
```


\newpage

## Chapter 9: More on Specification and Data Issues

### **`Example 9.8:` R&D Intensity and Firm Size**


$$rdintens = \beta_0 + \beta_1sales + \beta_2profmarg + \mu$$

From _Businessweek R&D Scoreboard_, October 25, 1991. Load the data and estimate the model.

```{r, message=FALSE}
data("rdchem")

all_rdchem <- lm(rdintens ~ sales + profmarg, data = rdchem)
```

Plotting the data reveals the outlier on the far right of the plot, which will skew the results of our model.

```{r, echo=FALSE}
plot_title <- "FIGURE 9.1: Scatterplot of R&D intensity against firm sales"
x_axis <- "firm sales (in millions of dollars)"
y_axis <- "R&D as a percentage of sales"

plot(rdintens ~ sales, pch = 21, bg = "lightgrey", data = rdchem, main = plot_title, xlab = x_axis, ylab = y_axis)
```

So, we can estimate the model without that data point to gain a better understanding of how `sales` and `profmarg` describe `rdintens` for most firms. We can use the `subset` argument of the linear model function to indicate that we only want to estimate the model using data that is less than the highest sales.

```{r}
smallest_rdchem <- lm(rdintens ~ sales + profmarg, data = rdchem, 
                      subset = (sales < max(sales)))
```

The table below compares the results of both models side by side. By removing the outlier firm, $sales$ become a more significant determination of R&D expenditures.


```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html",all_rdchem, smallest_rdchem,  single.row = TRUE, header = FALSE, digits=5)
```


\newpage

## Chapter 10: Basic Regression Analysis with Time Series Data

### **`Example 10.2:` Effects of Inflation and Deficits on Interest Rates**

$$\widehat{i3} = \beta_0 + \beta_1inf_t + \beta_2def_t$$
Data from the _Economic Report of the President, 2004_, Tables B-64, B-73, and B-79.

```{r, message=FALSE}
data("intdef") # load data
# load eXtensible Time Series package.
# xts is excellent for time series plots and 
# properly indexing time series.
library(xts) 
# create xts object from data.frame
# First, index year as yearmon class of monthly data. 
# Note: I add 11/12 to set the month to December, end of year.
index <- zoo::as.yearmon(intdef$year + 11/12)
# Next, create the xts object, ordering by the index above.
intdef.xts <- xts(intdef[ ,-1], order.by = index) 
# extract 3-month Tbill, inflation, and deficit data
intdef.xts <- intdef.xts[ ,c("i3", "inf", "def")] 
# rename with clearer names
colnames(intdef.xts) <- c("Tbill3mo", "cpi", "deficit") 
# plot the object, add a title, and place legend at top left.
plot(x = intdef.xts, 
     main = "Inflation, Deficits, and Interest Rates",
     legend.loc = "topleft")
# Run a Linear regression model
tbill_model <- lm(Tbill3mo ~ cpi + deficit, data = intdef.xts)
```


```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html",tbill_model, single.row = TRUE, header = FALSE, digits=5)
```

Now lets update the example with current data, pull from the Federal Reserve Economic Research (FRED) using the [quantmod package](https://CRAN.R-project.org/package=quantmod). Other than the convenient API, the package also formats time series data into [xts: eXtensible Time Series](https://CRAN.R-project.org/package=xts) objects, which add many feature and benefits when working with time series.

```{r eval=FALSE, FALSE, message=FALSE, warning=FALSE}
# DO NOT RUN
library(quantmod)

# Tbill, 3 month
getSymbols("TB3MS", src = "FRED")
# convert to annual observations and convert index to type `yearmon`.
TB3MS <- to.yearly(TB3MS, OHLC=FALSE, drop.time = TRUE)
index(TB3MS) <- zoo::as.yearmon(index(TB3MS))

# Inflation
getSymbols("FPCPITOTLZGUSA", src = "FRED")
# Convert the index to yearmon and shift FRED's Jan 1st to Dec
index(FPCPITOTLZGUSA) <- zoo::as.yearmon(index(FPCPITOTLZGUSA)) + 11/12
# Rename and update column names
inflation <- FPCPITOTLZGUSA
colnames(inflation) <- "inflation"

## Deficit, percent of GDP: Federal outlays - federal receipts
 # Download outlays
getSymbols("FYFRGDA188S", src = "FRED")
 # Lets move the index from Jan 1st to Dec 30th/31st
index(FYFRGDA188S) <- zoo::as.yearmon(index(FYFRGDA188S)) + 11/12
 # Rename and update column names
outlays <- FYFRGDA188S
colnames(outlays) <- "outlays"

 # Download receipts
getSymbols("FYONGDA188S", src = "FRED")
 # Lets move the index from Jan 1st to Dec 30th/31st
index(FYONGDA188S) <- zoo::as.yearmon(index(FYONGDA188S)) + 11/12
 # Rename and update column names
receipts <- FYONGDA188S
colnames(receipts) <- "receipts"
```

Now that all data has been downloaded, we can calculate the deficit from the federal `outlays` and `receipts` data. Next, we will merge our new `deficit` variable with `inflation` and `TB3MS` variables. As these are all `xts` times series objects, the `merge` function will automatically key off each series time date index, insuring integrity and alignment among each observation and its respective date.  Additionally, xts provides easy chart construction with its plot method.

```{r eval=FALSE, message=FALSE, warning=FALSE}
# DO NOT RUN
# create deficits from outlays - receipts
# xts objects respect their indexing and outline the future
deficit <- outlays - receipts
colnames(deficit) <- "deficit"

# Merge and remove leading and trailing NAs for a balanced data matrix
intdef_updated <- merge(TB3MS, inflation, deficit)
intdef_updated <- zoo::na.trim(intdef_updated)

#Plot all
plot(intdef_updated, 
     main = "T-bill (3mo rate), inflation, and deficit (% of GDP)",
     legend.loc = "topright",)
```

Now lets run the model again. Inflation plays a much more prominent role in the 3 month T-bill rate, than the deficit. 

```{r eval=FALSE}
# DO NOT RUN
updated_model <- lm(TB3MS ~ inflation + deficit, data = intdef_updated)
```

```{r, eval=FALSE, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
#DO NOT RUN
updated_model <- lm(TB3MS ~ inflation + deficit, data = intdef_updated)

stargazer(type = "html", updated_model, single.row = TRUE, header = FALSE, digits=5)
```


\newpage

## Chapter 11: Further Issues in Using OLS with with Time Series Data

### **`Example 11.7:` Wages and Productivity**


$$\widehat{log(hrwage_t)} = \beta_0 + \beta_1log(outphr_t) + \beta_2t + \mu_t$$
Data from the _Economic Report of the President, 1989_, Table B-47. The data are for the non-farm business sector.

```{r, message=FALSE}
data("earns")

wage_time <- lm(lhrwage ~ loutphr + t, data = earns)
```

```{r}
wage_diff <- lm(diff(lhrwage) ~ diff(loutphr), data = earns)
```

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html",wage_time, wage_diff,  single.row = TRUE, header = FALSE, digits=5)
```


\newpage

## Chapter 12: Serial Correlation and Heteroskedasticiy in Time Series Regressions

### **`Example 12.8:` Heteroskedasticity and the Efficient Markets Hypothesis**

These are Wednesday closing prices of value-weighted NYSE average, available in many publications. Wooldridge does not recall the particular source used when he collected these data at MIT, but notes probably the easiest way to get similar data is to go to the NYSE web site, [www.nyse.com](https://www.nyse.com/data-and-tech). 

$$return_t = \beta_0 + \beta_1return_{t-1} + \mu_t$$

```{r, message=FALSE, eval=FALSE}
data("nyse")
?nyse 
```
```{r}
return_AR1 <-lm(return ~ return_1, data = nyse)
```

$$\hat{\mu^2_t} = \beta_0 + \beta_1return_{t-1} + residual_t$$


```{r}
return_mu <- residuals(return_AR1)
mu2_hat_model <- lm(return_mu^2 ~ return_1, data = return_AR1$model)
```

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html",return_AR1, mu2_hat_model,  single.row = TRUE, header = FALSE, digits=5)
```


\newpage

### **`Example 12.9:` ARCH in Stock Returns**


$$\hat{\mu^2_t} = \beta_0 + \hat{\mu^2_{t-1}} + residual_t$$

We still have `return_mu` in the working environment so we can use it to create $\hat{\mu^2_t}$, (`mu2_hat`) and $\hat{\mu^2_{t-1}}$ (`mu2_hat_1`). Notice the use `R`'s matrix subset operations to perform the lag operation. We drop the first observation of `mu2_hat` and squared the results. Next, we remove the last observation of `mu2_hat_1` using the subtraction operator combined with a call to the `NROW` function on `return_mu`. Now, both contain $688$ observations and we can estimate a standard linear model.

```{r}
mu2_hat  <- return_mu[-1]^2
mu2_hat_1 <- return_mu[-NROW(return_mu)]^2
arch_model <- lm(mu2_hat ~ mu2_hat_1)
```

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html",arch_model, single.row = TRUE, header = FALSE, digits=5)
```


\newpage

## Chapter 13: Pooling Cross Sections across Time: Simple Panel Data Methods

### **`Example 13.7:` Effect of Drunk Driving Laws on Traffic Fatalities**

Wooldridge collected these data from two sources, the 1992 _Statistical Abstract of the United States_ (Tables 1009, 1012) and _A Digest of State Alcohol-Highway Safety Related Legislation_, 1985 and 1990, published by the U.S. National Highway Traffic Safety Administration. 
$$\widehat{\Delta{dthrte}} = \beta_0 + \Delta{open} + \Delta{admin}$$

```{r, message=FALSE, eval=FALSE}
data("traffic1")
?traffic1
```
```{r}
DD_model <- lm(cdthrte ~ copen + cadmn, data = traffic1)
```

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html",DD_model,  single.row = TRUE, header = FALSE, digits=5)
```


\newpage

## Chapter 18: Advanced Time Series Topics

### **`Example 18.8:` FORECASTING THE U.S. UNEMPLOYMENT RATE**

Data from _Economic Report of the President, 2004_, Tables B-42 and B-64.

```{r, message=FALSE, eval=FALSE}
data("phillips")
?phillips
```

$$\widehat{unemp_t} = \beta_0 + \beta_1unem_{t-1}$$

Estimate the linear model in the usual way and note the use of the `subset` argument to define data equal to and before the year 1996.

```{r}
phillips_train <- subset(phillips, year <= 1996)

unem_AR1 <- lm(unem ~ unem_1, data = phillips_train)
```

$$\widehat{unemp_t} = \beta_0 + \beta_1unem_{t-1} + \beta_2inf_{t-1}$$

```{r}
unem_inf_VAR1 <- lm(unem ~ unem_1 + inf_1, data = phillips_train)
```

```{r, results='asis', echo=FALSE, warning=FALSE, message=FALSE}
stargazer(type = "html",unem_AR1, unem_inf_VAR1,  single.row = TRUE, header = FALSE, digits=5)
```

Now, use the `subset` argument to create our testing data set containing observation after 1996.
Next, pass the both the model object and the test set to the `predict` function for both models.
Finally, `cbind` or "column bind" both forecasts as well as the year and unemployment rate of the test set.

```{r, warning=FALSE, message=FALSE, echo=TRUE}
phillips_test <- subset(phillips, year >= 1997)

AR1_forecast <- predict.lm(unem_AR1, newdata = phillips_test)
VAR1_forecast <- predict.lm(unem_inf_VAR1, newdata = phillips_test)

kable(cbind(phillips_test[ ,c("year", "unem")], AR1_forecast, VAR1_forecast))
```


\newpage

# Appendix


### Using R for Introductory Econometrics

This is an excellent open source complimentary text to "Introductory Econometrics" by Jeffrey M. Wooldridge and should be your number one resource. This excerpt from the book's website:

>  This book introduces the popular, powerful and free programming language and software package R with a focus on the implementation of standard tools and methods used in econometrics. Unlike other books on similar topics, it does not attempt to provide a self-contained discussion of econometric models and methods. Instead, it builds on the excellent and popular textbook "Introductory Econometrics" by Jeffrey M. Wooldridge.

Heiss, Florian. _Using R for Introductory Econometrics_. ISBN: 979-8648424364, CreateSpace Independent Publishing Platform, 2020, Dusseldorf, Germany. 

[url: http://www.urfie.net/](http://www.urfie.net/).


### Applied Econometrics with R

From the publisher's website:

>    This is the first book on applied econometrics using the R system for statistical computing and graphics. It presents hands-on examples for a wide range of econometric models, from classical linear regression models for cross-section, time series or panel data and the common non-linear models of microeconometrics such as logit, probit and tobit models, to recent semiparametric extensions. In addition, it provides a chapter on programming, including simulations, optimization, and an introduction to R tools enabling reproducible econometric research. An R package accompanying this book, AER, is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=AER.

Kleiber, Christian  and Achim Zeileis. _Applied Econometrics with R_. ISBN 978-0-387-77316-2,
Springer-Verlag, 2008, New York. [https://link.springer.com/book/10.1007/978-0-387-77318-6](https://link.springer.com/book/10.1007/978-0-387-77318-6)

\newpage

## Bibliography

Jeffrey M. Wooldridge (2020). _Introductory Econometrics: A Modern Approach, 7th edition_. ISBN-13: 978-1-337-55886-0. Mason, Ohio :South-Western Cengage Learning. 

Jeffrey A. Ryan and Joshua M. Ulrich (2020). quantmod:
Quantitative Financial Modelling Framework. R package version
0.4.18. https://CRAN.R-project.org/package=quantmod

R Core Team (2021). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. URL https://www.R-project.org/.

Marek Hlavac (2018). _stargazer: Well-Formatted Regression and Summary Statistics Tables_. R package version 5.2.2. URL: https://CRAN.R-project.org/package=stargazer

van der Loo M (2020). tinytest “A method for deriving information from
running R code.” _The R Journal_, Accepted for publication. <URL:
https://arxiv.org/abs/2002.07472>.

Yihui Xie (2021). _knitr: A General-Purpose Package for Dynamic
Report Generation in R_. R package version 1.33. https://CRAN.R-project.org/package=knitr