Undergraduate Econometrics: Problem Set 4
Due: October 9th at 12:00 pm
0. Book Problems 3.4, 3.12, 3.13 (i,ii), 4.5, 4.10, 4.11
1. Playing with OVB In this le we learn about omitted variable bias through a simulation
exercise. The le ovbSimulation.R which is attached with this assignment contains code that
simulates the impact of running a single variable regression when U and X are correlated. In
particular, it beings with the following regression equation:
Y = 0 + 1X + U
and allows for XU > 0. The code itself generates an estimate of ^ in 100 simulated datasets and
stores the results in a data frame called betaData. This data frame contains two variables: BETAHAT
and BIAS DUMMY. The rst variable is an estimated ^ from some sample and the second variable is
a 0/1 variable for whether this variable is drawn from a biased or unbiased sample.
a) Show that E(UjX) = 0 implies that E(UX) = 0. Then, explain why this means that OLS does
not work if XU 6= 0. Hint: Use the law of total expectations on E(UXjX).
b) At the bottom of the code is some space labeled Student Analysis. Here, ll in code to cal-
culate the mean of the ^ estimates in each of the biased and unbiased samples. Also ll in the
code to plot the overlapping densities. Both of these are commands from previous assignments.
Hint: Your code should look something like: remember that to index a subset of data we can do
betaData$BETAHAT[betaData$BIAS DUMMY==?] What goes in \?”
c) At the top of the code is a variable which governs the correlation between X and U. First,
set = 0 and run the code. Report the estimated means in the biased and unbiased samples. The
true value of = 2. Perform a two-sided t-test for each sample on whether or not ^ is statistically
dierent from 2. As a reminder, to do a t-test, rst calculate the mean and the standard deviation,
then construct the t-statistic. You should report two separate t-statistics for this exercise (one for
each of the 0/1 groups).
d) Now set = :55. Plot the distribution of ^ by group. Then repeat (c) for the biased group. Is
^ now statistically signicantly dierent from 2?
e) Repeat (d) for = :01. Relative to the sampling variation in ^ , does the bias seem too important
here? (No rigorous answer).
f) Increase the sample size to N = 500 and rerun the code for = :5. What happens to the
variance of ^ in both when samples are biased and not? What does this exercise suggest about the
usefulness of having a large sample size when your estimator is biased?
??g) Optional problem for those interested in exploring computation. Increase the number of sam-
ples to 500 (this is the S variable in the code). Fix the number of observations to 100 and let
= :25. In this case, calculate the fraction of the time that you would estimate ^ to be statistically
signicantly indistinguishable from 2 despite the bias. Hint: You will rst need to calculate the
upper and lower bound on ^ at which you would not reject the null hypothesis with = 2 and the
as calculated in the sample.
2. Do Doctors Aect Drinking? In this problem we will exploit the drinkData.Rdata dataset.
This data is taken from \The Eect Of Physician Advice On Alcohol Consumption: Count Regres-
sion With An Endogenous Treatment Eect, “by Donald S. Kenkel and Joseph V. Terza (Journal
of Applied Econometrics, 16: 165-184 (2001)). The goal of this paper was to understand if doctors
could impact people’s drinking activity. The authors do some sophisticated work to try and deal
with concerns about causality. We will not replicate their methods. Instead, we will ignore issues
related to omitted variable bias and focus on the tools of multiple regressions. There is a complete
description of the dataset at the back of this problem set.
a) First let us get a feel for the dataset. The variable DRINKS is the number of drinks an
individual has had and the variable ADVISE is a 0/1 variable for whether a person’s doctor has
told them to drink less. Report the mean number of drinks per person in each group. Similarly,
calculate the mean education and income by group. Finally, what fraction of individuals in each
group are between 30 and 40 and how many are between 40 and 50. Do a t-test for a dierence in
group means of income and education. Do they appear to be dierent?
b) What is a possible source of omitted variable bias in a regression of DRINKS on ADVISE?
You may think about the variables above or something else. Remember: omitted variable bias has
c) Regress DRINKS on ADVISE and do a one-sided signicance test on ADVISE. What is the sign
of ADVISE? Why do you think this might be the case?
d) Now run the same regression but including income, education and all the age dummies as
controls. Report the results (by hand). What happens to the coecient on advise?
e) Do an F-test to determine if age does not matter for drinking habits.
f) Create a variable called EVERDRINK that is a 0/1 variable for DRINKS being positive.
Regress this on all age variables. Do an F-test to determine if the choice of whether to drink at all
depends on age.
DRINKS Total drinks over a two week period
ADVISE Dummy variable for whether the individual has been told to drink less by a doctor.
EDITINC Monthly income ($1000)
AGE30 30 <age 40
AGE40 40 <age 50
AGE50 50 <age 60
AGE60 60 <age 70
AGEGT70 70 < age
EDUC Years of schooling
OTHER Non-white, non-black
DIVSEP Divorced or separated
MEDICARE Insurance through Medicare
MEDICAID Insurance through Medicaid
CHAMPUS Military insurance
HLTHINS Health insurance
REGMED Reg. source of care
DRI See same doctor
MAIORLIM Limits on major daily activ.
SOMELIM Limits on some daily activ.
HVDIAB Have diabetes
HHRTCOND Have heart condition
HADSTROKE Had stroke
Undergraduate Econometrics: Problem Set 4