Undergraduate Econometrics: Problem Set 4

Due: October 9th at 12:00 pm

0. Book Problems 3.4, 3.12, 3.13 (i,ii), 4.5, 4.10, 4.11

1. Playing with OVB In this le we learn about omitted variable bias through a simulation

exercise. The le ovbSimulation.R which is attached with this assignment contains code that

simulates the impact of running a single variable regression when U and X are correlated. In

particular, it beings with the following regression equation:

Y = 0 + 1X + U

and allows for XU > 0. The code itself generates an estimate of ^ in 100 simulated datasets and

stores the results in a data frame called betaData. This data frame contains two variables: BETAHAT

and BIAS DUMMY. The rst variable is an estimated ^ from some sample and the second variable is

a 0/1 variable for whether this variable is drawn from a biased or unbiased sample.

a) Show that E(UjX) = 0 implies that E(UX) = 0. Then, explain why this means that OLS does

not work if XU 6= 0. Hint: Use the law of total expectations on E(UXjX).

b) At the bottom of the code is some space labeled Student Analysis. Here, ll in code to cal-

culate the mean of the ^ estimates in each of the biased and unbiased samples. Also ll in the

code to plot the overlapping densities. Both of these are commands from previous assignments.

Hint: Your code should look something like: remember that to index a subset of data we can do

betaData$BETAHAT[betaData$BIAS DUMMY==?] What goes in \?”

c) At the top of the code is a variable which governs the correlation between X and U. First,

set = 0 and run the code. Report the estimated means in the biased and unbiased samples. The

true value of = 2. Perform a two-sided t-test for each sample on whether or not ^ is statistically

dierent from 2. As a reminder, to do a t-test, rst calculate the mean and the standard deviation,

then construct the t-statistic. You should report two separate t-statistics for this exercise (one for

each of the 0/1 groups).

d) Now set = :55. Plot the distribution of ^ by group. Then repeat (c) for the biased group. Is

^ now statistically signicantly dierent from 2?

1

e) Repeat (d) for = :01. Relative to the sampling variation in ^ , does the bias seem too important

here? (No rigorous answer).

f) Increase the sample size to N = 500 and rerun the code for = :5. What happens to the

variance of ^ in both when samples are biased and not? What does this exercise suggest about the

usefulness of having a large sample size when your estimator is biased?

??g) Optional problem for those interested in exploring computation. Increase the number of sam-

ples to 500 (this is the S variable in the code). Fix the number of observations to 100 and let

= :25. In this case, calculate the fraction of the time that you would estimate ^ to be statistically

signicantly indistinguishable from 2 despite the bias. Hint: You will rst need to calculate the

upper and lower bound on ^ at which you would not reject the null hypothesis with = 2 and the

2

^

as calculated in the sample.

2. Do Doctors Aect Drinking? In this problem we will exploit the drinkData.Rdata dataset.

This data is taken from \The Eect Of Physician Advice On Alcohol Consumption: Count Regres-

sion With An Endogenous Treatment Eect, “by Donald S. Kenkel and Joseph V. Terza (Journal

of Applied Econometrics, 16: 165-184 (2001)). The goal of this paper was to understand if doctors

could impact people’s drinking activity. The authors do some sophisticated work to try and deal

with concerns about causality. We will not replicate their methods. Instead, we will ignore issues

related to omitted variable bias and focus on the tools of multiple regressions. There is a complete

description of the dataset at the back of this problem set.

a) First let us get a feel for the dataset. The variable DRINKS is the number of drinks an

individual has had and the variable ADVISE is a 0/1 variable for whether a person’s doctor has

told them to drink less. Report the mean number of drinks per person in each group. Similarly,

calculate the mean education and income by group. Finally, what fraction of individuals in each

group are between 30 and 40 and how many are between 40 and 50. Do a t-test for a dierence in

group means of income and education. Do they appear to be dierent?

b) What is a possible source of omitted variable bias in a regression of DRINKS on ADVISE?

You may think about the variables above or something else. Remember: omitted variable bias has

two ingredients.

c) Regress DRINKS on ADVISE and do a one-sided signicance test on ADVISE. What is the sign

2

of ADVISE? Why do you think this might be the case?

d) Now run the same regression but including income, education and all the age dummies as

controls. Report the results (by hand). What happens to the coecient on advise?

e) Do an F-test to determine if age does not matter for drinking habits.

f) Create a variable called EVERDRINK that is a 0/1 variable for DRINKS being positive.

Regress this on all age variables. Do an F-test to determine if the choice of whether to drink at all

depends on age.

3

Variable Description

DRINKS Total drinks over a two week period

ADVISE Dummy variable for whether the individual has been told to drink less by a doctor.

EDITINC Monthly income ($1000)

AGE30 30 <age 40

AGE40 40 <age 50

AGE50 50 <age 60

AGE60 60 <age 70

AGEGT70 70 < age

EDUC Years of schooling

BLACK Black

OTHER Non-white, non-black

MARRIED Married

WIDOW Widowed

DIVSEP Divorced or separated

EMPLOYED Employed

UNEMPLOY Unemployed

NORTHE Northeast

MIDWEST Midwest

SOUTH South

MEDICARE Insurance through Medicare

MEDICAID Insurance through Medicaid

CHAMPUS Military insurance

HLTHINS Health insurance

REGMED Reg. source of care

DRI See same doctor

MAIORLIM Limits on major daily activ.

SOMELIM Limits on some daily activ.

HVDIAB Have diabetes

HHRTCOND Have heart condition

HADSTROKE Had stroke

4

Interested in a PLAGIARISM-FREE paper based on these particular instructions?...with 100% confidentiality?