---
title: "Lab 5"
subtitle: "External Validity"
author: |
| **Name:** Your name here
| **Mac ID:** Your Mac ID here
date: "**Due:** Friday, February 17, 5 PM"
output:
pdf_document:
highlight: espresso
fig_caption: yes
urlcolor: blue
header-includes:
- \usepackage{setspace}
- \doublespacing
- \usepackage{float}
- \floatplacement{figure}{t}
- \floatplacement{table}{t}
- \usepackage{flafter}
- \usepackage[T1]{fontenc}
- \usepackage[utf8]{inputenc}
- \usepackage{ragged2e}
- \usepackage{booktabs}
- \usepackage{amsmath}
fontsize: 12pt
---
```{r setup, include=FALSE}
# Global options for the knitting behavior of all subsequent code chunks
# Adding an option to compile PDF even if code has errors
knitr::opts_chunk$set(echo = TRUE, error = TRUE)
# Packages
library(tidyverse)
library(DeclareDesign)
# Add extra packages here if needed
```
# External Validity
In the readings for this week, [Coppock et al (2018)](https://doi.org/10.1073/pnas.1808083115) mention how the correspondence in effects between representative and convenience samples depends on the distribution of individual treatment effects.
The following design simulates a model with heterogeneous treatment effects, and compares the result of survey experiments conducted with a representative and convenience sample.
```{r}
# Parameters
N = 1000 # population
n = 100 # sample
effect = 0.5
# Model
model = declare_model(
N = N,
U = rnorm(N),
X = runif(N), # observed covariate
potential_outcomes(
Y ~ Z * effect * X + U
)
)
```
We are also specifying `X` as an observed covariate that moderates the treatment effect, something like digital literacy. It's generated by random draws of a uniform distribution between 0 and 1 (hence the `runif` function). If `X` is 1, the unit experiences the full effect. If it is zero, the effect disappears. The numbers in between scale the treatment effect accordingly. This is a way to simulate heterogeneous treatment effects.
The inquiry is standard fare:
```{r}
# Inquiry
inquiry = declare_inquiry(
ATE = mean(Y_Z_1 - Y_Z_0)
)
```
Then we have to compare two data strategies, the survey experiment with a random sample and the one with a convenience sample. At this point our research design branches into two paths, since each data strategy will also have its own analogous answer strategy. They are essentially two different designs, but we can recycle some components.
This is how it looks for the representative sample:
```{r}
# Data strategy
r_sampling = declare_sampling(S = complete_rs(N, n = n))
assignment = declare_assignment(Z = complete_ra(N))
# Answer strategy
measurement = declare_measurement(Y = reveal_outcomes(Y ~ Z))
estimator = declare_estimator(
Y ~ Z,
inquiry = "ATE"
)
```
Then put everything together:
```{r}
r_design = model + inquiry +
r_sampling + assignment +
measurement + estimator
```
To create our convenience sample, we need a custom sampling function that makes it so that units with higher `X` are more likely to be drawn.
```{r}
convenience_sampling = function(data){
id = sample(data$ID, size = n, prob = data$X)
data$S = with(
data,
ifelse(
data$ID %in% id, 1, 0
)
)
data[data$S == 1, ]
}
```
Then we pass this custom function to `declare_sampling`.
```{r}
c_sampling = declare_sampling(handler = convenience_sampling)
```
And now we can create a separate design for our convenience sample
```{r}
c_design = model + inquiry +
c_sampling + assignment +
measurement + estimator
```
Then we can diagnose both designs at once:
```{r}
# remember to replace with student number
set.seed(123)
r_diag = diagnose_design(r_design)
c_diag = diagnose_design(c_design)
```
And we can use the following function to fetch the bias and RMSE of each design.
```{r}
diagnosands = rbind(
r_diag$diagnosands_df %>%
select(design, bias, rmse),
c_diag$diagnosands_df %>%
select(design, bias, rmse)
)
diagnosands
```
## TASK 1
**Which design is better in terms of bias and RMSE? What explains this?**
## TASK 2
**What happens to the bias and RMSE of both designs as the sample size** `n` **decreases but the population** `N` **remains constant? What explains this?**
## TASK 3
**What happens to the bias and RMSE when the population and sample sizes are the same? What explains this?**
**Hint:** *It may be faster to calculate this by choosing a number in between the original population and sample sizes.*
# Answers
## TASK 1
## TASK 2
## TASK 3