04_MarginalEffects.Rmd · TourismDeforestationSubmission

---
title: "04_Marginal_effects"
output: html_document
date: "2023-11-30"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

The asssumption of homogeneous treatment effects/ common trends posits that, when treatment effects are homogeneous, the two-way fixed effects model assumes that the relationship between the residualized outcome variable (deforestation) and the residualized treatment variable (tourism) is linear (Jakiela 2021). Therefore, we tested for linearity of residualized deforestation and tourism for the inside-PA, outside-PA (buffer), and combined data. 

```{r Load data and packages}
#load packages
library(tidyverse)
library(lfe)
library(ggpubr)

#load data 
master_full <- read.csv("DataKeep/MasterSheet_full.csv")
master_full$Tourism <- master_full$Tourism/1000
master_full$precipitation <- master_full$precipitation*12
master_full$PercentDeforest <- master_full$PercentDeforest*100
master_full <- master_full %>% filter(Year >= 2001)

master_buffer <-read.csv("DataKeep/master_buffer_analysis_3km.csv")
master_buffer_combined <- read.csv("DataKeep/master_buffer_combined.csv")
master_buffer_combined$Tourism <- master_buffer_combined$Tourism/1000


```

```{r Visualize Residuals}
#Inside-PA
fem_full_formula_resid <- as.formula("PercentDeforest ~ PopDensity + precipitation + OilPriceAdj | Year + Name | 0 | Regions")
fem_full_formula_resid1 <- as.formula("Tourism ~ PopDensity + precipitation + OilPriceAdj | Year + Name | 0 | Regions")
tourism_resid <- residuals(felm(fem_full_formula_resid1, data = master_full[!is.na(master_full$PercentDeforest),]))
deforestation_resid <- residuals(felm(fem_full_formula_resid, data = master_full[!is.na(master_full$PercentDeforest),]))

residuals <-data.frame(tourism_resid, deforestation_resid, master_full[!is.na(master_full$PercentDeforest)&!is.na(master_full$precipitation),]$Year, master_full[!is.na(master_full$PercentDeforest)&!is.na(master_full$precipitation),]$Tourism,
master_full[!is.na(master_full$PercentDeforest)&!is.na(master_full$precipitation),]$PercentDeforest)
colnames(residuals)[3]<- "Year"
colnames(residuals)[4]<- "Tourism2"
colnames(residuals)[5]<- "Deforestation"

summary(lm(data= residuals, deforestation_resid ~ tourism_resid)) # not linear - this is just our main model, shows no significant linear relationship (before clustering standard errors) , does not suggest that whatever relationship exists is non-linear

residual_plot <- ggplot(data = residuals, aes(x = tourism_resid, y = deforestation_resid, color=Tourism2))+ geom_point()+theme_classic()+ylab("Residualized Deforestation")+xlab("Resizualized Tourism")+geom_smooth(color="magenta4")+labs(title="Inside-PA")+  scale_color_distiller(type = "seq", direction = -1,palette = "Greys", name = "Tourism")

#Outside-PA (Buffer)
fem_full_formula_resid3 <- as.formula("PercentDeforest ~ PopDensity + Precipitation + OilPriceAdj | Year + Name | 0 | Regions")
fem_full_formula_resid4 <- as.formula("Tourism ~ PopDensity + Precipitation + OilPriceAdj | Year + Name | 0 | Regions")


tourism_resid_buffer <- residuals(felm(fem_full_formula_resid4, data = master_buffer[!is.na(master_buffer$PercentDeforest),]))

deforestation_resid_buffer <- residuals(felm(fem_full_formula_resid3, data = master_buffer[!is.na(master_buffer$PercentDeforest),]))

residuals_buffer <-data.frame(tourism_resid_buffer, deforestation_resid_buffer, master_buffer$Year, master_buffer$Tourism,
master_buffer$PercentDeforest)
colnames(residuals_buffer)[3]<- "Year"
colnames(residuals_buffer)[4]<- "Tourism2"
colnames(residuals_buffer)[5]<- "Deforestation"

summary(lm(deforestation_resid_buffer ~ tourism_resid_buffer)) 

residual_plot_buffer <- ggplot(data = residuals_buffer, aes(x = tourism_resid_buffer, y = deforestation_resid_buffer, color=Tourism2))+ geom_point()+theme_classic()+ylab("Residualized Deforestation")+xlab("Resizualized Tourism")+geom_smooth(color="magenta4")+labs(title="Outside-PA (Buffer)")+ scale_color_distiller(type = "seq", direction = -1,palette = "Greys", name = "Tourism")


#Combined inside and outside 

fem_full_formula_resid5 <- as.formula("PercentDeforest ~ PopDensity + combined_precip + OilPriceAdj | Year + Name | 0 | Regions")
fem_full_formula_resid6 <- as.formula("Tourism ~ PopDensity + combined_precip + OilPriceAdj | Year + Name | 0 | Regions")

tourism_resid_buffer_combined <- residuals(felm(fem_full_formula_resid6, data =master_buffer_combined[!is.na(master_buffer_combined$PercentDeforest),]))

deforestation_resid_buffer_combined <- residuals(felm(fem_full_formula_resid5, data = master_buffer_combined[!is.na(master_buffer_combined$PercentDeforest),]))

residuals_buffer_combined <-data.frame(tourism_resid_buffer_combined, deforestation_resid_buffer_combined, master_buffer_combined[!is.na(master_buffer_combined$PercentDeforest),]$Year, master_buffer_combined[!is.na(master_buffer_combined$PercentDeforest),]$Tourism,
master_buffer_combined[!is.na(master_buffer_combined$PercentDeforest),]$PercentDeforest)
colnames(residuals_buffer_combined)[3]<- "Year"
colnames(residuals_buffer_combined)[4]<- "Tourism2"
colnames(residuals_buffer_combined)[5]<- "Deforestation"

summary(lm(deforestation_resid_buffer_combined ~ tourism_resid_buffer_combined)) # not linear 

residual_plot_buffer_combined <- ggplot(data = residuals_buffer_combined, aes(x = tourism_resid_buffer_combined, y = deforestation_resid_buffer_combined, color=Tourism2))+ geom_point()+theme_classic()+ylab("Residualized Deforestation")+xlab("Resizualized Tourism")+geom_smooth(color="magenta4")+labs(title="Combined")+  scale_color_distiller(type = "seq", direction = -1,palette = "Greys", name = "Tourism")


#Combine plots
residuals_plot <- ggarrange(residual_plot, residual_plot_buffer,residual_plot_buffer_combined, labels=c("a", "b", "c"), ncol=3, common.legend = T)

```

Based on this initial visualization, the data look heteroskedastic but not necesserily nonlinear. However, we need to look at residualized binned scatterplots to better visualize what is happening.

```{r Try with binscatter (produces same results)}
#Funcation
binscatter <- function(formula, key_var, data, bins=20, partial=FALSE){
  
require(lfe)
require(ggplot2)

if(partial==TRUE){
  y <- unlist(strsplit(formula, "~"))[1]
  x <- unlist(strsplit(formula, "~"))[2]
  controls <- gsub(paste("[[:punct:]]*",key_var,"[[:space:]]*[[:punct:]]*",sep=""),
                   "",x)
  
  reg_all <- felm(formula(formula),data=data)
  reg_y <- felm(formula(paste(y, "~", controls, sep="")), data=data)
  reg_x <- felm(formula(paste(key_var, "~", controls, sep="")), data=data)
  resid_all <- resid(reg_all)
  resid_y <- resid(reg_y)
  resid_x <- resid(reg_x)
  df <- data.frame(resid_y, resid_x)
  cluster_grp <- trimws(unlist(strsplit(formula, "\\|"))[4])
  if(is.na(cluster_grp)){
    reg <- felm(resid_y ~ resid_x)  
  } else{
    data$resid_y <- resid_y
    data$resid_x <- resid_x
    reg <- felm(formula(paste("resid_y ~ resid_x | 0 | 0 |",
                              cluster_grp, sep="")), data=data)
  }
  newdata <- df
  colnames(df) <- c(paste("residual",names(df)[1]), paste("residual",names(df)[2]))
  
} else if(partial==FALSE){
  reg <- felm(formula(formula),data=data)
  y <- trimws(unlist(strsplit(formula, "~"))[1])
  df <- data[,c(y,key_var)]
  newdata <- df  # To calculate CI of predicted mean
  #  colnames(df) <- c("resid_y", "resid_x")
}
intercept <- coef(reg)[1]
slope <- coef(reg)[2]

### Use cluster vcov from the correct model, if not available, use robust
if(is.null(reg$clustervcv)){
  vcov <- reg$robustvcv
  se_type <- "robust"
} else {
  vcov <- reg$clustervcv
  se_type <- "cluster"
}

Terms <- terms(reg)
m.mat <- model.matrix(Terms,data=newdata)
fit <- as.vector(m.mat %*% coef(reg))
se.fit <- sqrt(rowSums((m.mat %*% vcov) * m.mat)) 
## Much faster alternative to sqrt(diag(m.mat%*%vcov%*%t(m.mat))) and works fine since
## we only want the diagonal
df$upper_ci <- fit + 1.96*se.fit
df$lower_ci <- fit - 1.96*se.fit

min_x <- min(df[,2])
max_x <- max(df[,2])
min_y <- intercept + min_x*slope
max_y <- intercept + max_x*slope

df_bin <- aggregate(df,by=list(cut(as.matrix(df[,2]),bins)), mean)

ggplot(data=df, aes(x=df[,2], y=df[,1])) +
 # geom_point(alpha=0.2) + 
  geom_segment(aes(x = min_x, y = min_y, xend = max_x, yend = max_y),
               color="black", size=1) +
  geom_ribbon(aes(ymin=lower_ci, ymax=upper_ci),alpha=0.18) +
  geom_point(data=df_bin, aes(x=df_bin[,3], y=df_bin[,2]), color="magenta4", size=2) +
  labs(caption = paste(" slope = ", signif(slope,2), sep=""),
       x = names(df)[2], y = names(df)[1]) +
  theme_classic()
}

#Inside-PA
inside_binscatter <- binscatter(formula = "PercentDeforest ~Tourism+ PopDensity + precipitation + OilPriceAdj | Year + Name | 0 | Regions", data = master_full[!is.na(master_full$PercentDeforest),], key_var = "Tourism", bins=20, partial=TRUE)+ ylab("Residualized Deforestation") + xlab("Residualized Tourism")

#Outside-PA
outside_binscatter <- binscatter(formula = "PercentDeforest ~Tourism+ PopDensity + Precipitation + OilPriceAdj | Year + Name | 0 | Regions", data = master_buffer[!is.na(master_buffer$PercentDeforest),], key_var = "Tourism", bins=20, partial=TRUE)+ ylab("Residualized Deforestation") + xlab("Residualized Tourism")

#Combined
combined_binscatter <- binscatter(formula = "PercentDeforest ~Tourism+ PopDensity + combined_precip + OilPriceAdj | Year + Name | 0 | Regions", data = master_buffer_combined[!is.na(master_buffer_combined$PercentDeforest),], key_var = "Tourism", bins=20, partial=TRUE)+ ylab("Residualized Deforestation") + xlab("Residualized Tourism")

binscatter_plot <- ggarrange(inside_binscatter, outside_binscatter, combined_binscatter, labels = c("d", "e", "f"), ncol=3)

```

Based on the binned scatterplots, is seems like the relationship between residualized deforestation and residualized tourism is relatively linear for inside the PAs. However, it does look like there may be a quadratic term for both the outside-PA and combined data. Because of this potential nonlinearity, we conducted the models with an added quadratic term for tourism and then examine the marginal effects. 

```{r Marginal effects}
#Inside-PA
master_full$TourismSq <- master_full$Tourism^2
fem_full_formula_new <- as.formula("PercentDeforest ~ Tourism +TourismSq +PopDensity + precipitation + OilPriceAdj | Year + Name | 0 | Regions")
dd_reg_full_new <- felm(fem_full_formula_new, data = master_full)
summary_dd_reg_full_new <- summary(dd_reg_full_new)
inside_quadratic<- as.data.frame(summary_dd_reg_full_new$coefficients)
inside_quadratic$Location <- "Inside PA"

#Outside-PA
master_buffer_analysis_3km$TourismSq <- master_buffer_analysis_3km$Tourism^2

fem_full_formula_3km_new <- as.formula("PercentDeforest ~ Tourism + TourismSq+ PopDensity + Precipitation + OilPriceAdj | Year + Name | 0 | Regions")
dd_reg_full_3km_new <- felm(fem_full_formula_3km_new, data = master_buffer_analysis_3km)
summary_dd_reg_full_3km_new <- summary(dd_reg_full_3km_new)
outside_quadratic <- as.data.frame(summary_dd_reg_full_3km_new$coefficients)
outside_quadratic$Location <- "Buffer 3 km"

#Combined
master_buffer_combined$TourismSq <- master_buffer_combined$Tourism^2

fem_full_formula_combined_new <- as.formula("PercentDeforest ~ Tourism + TourismSq+ PopDensity + combined_precip + OilPriceAdj | Year + Name | 0 | Regions")
dd_reg_full_combined_new <- felm(fem_full_formula_combined_new, data = master_buffer_combined)
summary_dd_reg_full_combined_new <- summary(dd_reg_full_combined_new)
combined_quadratic <- as.data.frame(summary_dd_reg_full_combined_new$coefficients)
combined_quadratic$Location <- "Inside PA + Buffer 3 km"

#MARGINAL EFFECTS INSIDE PA
x <- master_full$Tourism
x2 <- master_full$TourismSq
beta.hat <- coef(dd_reg_full_new)
vcov1 <- vcov(dd_reg_full_new)
z0 <- seq(min(x), max(x), length.out = 1000)
dy.dx <- beta.hat[1] + 2*beta.hat[2]*z0
se.dy.dx <- sqrt(vcov1[1, 1] + 4*(z0^2)*vcov1[2, 2] + 4*z0*vcov1[1, 2])
upr <- dy.dx + 1.96*se.dy.dx
lwr <- dy.dx - 1.96*se.dy.dx
upr2 <- dy.dx + 1.645*se.dy.dx
lwr2 <- dy.dx - 1.645*se.dy.dx
inside_plot <- ggplot(data=NULL, aes(x=z0, y=dy.dx)) +
  labs(x="Tourism (1,000s of tourists)", y="Marginal Effects",
       cex=4) +
  geom_line(aes(z0, dy.dx),size = 1) +
  geom_line(aes(z0, lwr), size = 1, linetype = 1, color="magenta4") +
  geom_line(aes(z0, upr), size = 1, linetype = 1, color="magenta4") +
   geom_line(aes(z0, lwr2), size = 1, linetype = 2, color="magenta4") +
  geom_line(aes(z0, upr2), size = 1, linetype = 2, color="magenta4") +
  geom_hline(yintercept=0, size = 1, linetype=3) +
  geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=0.3)+
  theme_classic()
# #Marginal Effects of tourism on deforestation inside the PA vary with tourism, but effects are only marginally statistically significant.
#When you have more tourists, the effect of tourism on deforestation become more negative. 

#MARGINAL EFFECTS OUTSIDE-PA (BUFFER)
x_buffer <- master_buffer_analysis_3km$Tourism
x2_buffer <- master_buffer_analysis_3km$TourismSq
beta.hat_buffer <- coef(dd_reg_full_3km_new)
vcov1_buffer <- vcov(dd_reg_full_3km_new)
z0_buffer <- seq(min(x_buffer), max(x_buffer), length.out = 1000)
dy.dx_buffer <- beta.hat_buffer[1] + 2*beta.hat_buffer[2]*z0_buffer
se.dy.dx_buffer <- sqrt(vcov1_buffer[1, 1] + 4*(z0_buffer^2)*vcov1_buffer[2, 2] + 4*z0_buffer*vcov1_buffer[1, 2])
upr_buffer <- dy.dx_buffer + 1.96*se.dy.dx_buffer
lwr_buffer <- dy.dx_buffer - 1.96*se.dy.dx_buffer
upr2_buffer <- dy.dx_buffer + 1.645*se.dy.dx_buffer
lwr2_buffer <- dy.dx_buffer - 1.645*se.dy.dx_buffer
outside_plot <- ggplot(data=NULL, aes(x=z0_buffer, y=dy.dx_buffer)) +
  labs(x="Tourism (1,000s of tourists)", y="Marginal Effects",
       cex=4) +
  geom_line(aes(z0_buffer, dy.dx_buffer),size = 1) +
  geom_line(aes(z0_buffer, lwr_buffer), size = 1, linetype = 1, color="magenta4") +
  geom_line(aes(z0_buffer, upr_buffer), size = 1, linetype = 1, color="magenta4") +
   geom_line(aes(z0_buffer, lwr2_buffer), size = 1, linetype = 2, color="magenta4") +
  geom_line(aes(z0_buffer, upr2_buffer), size = 1, linetype = 2, color="magenta4") +
  geom_hline(yintercept=0, size = 1, linetype=3) +
  geom_ribbon(aes(ymin=lwr_buffer, ymax=upr_buffer), alpha=0.3)+
  theme_classic()
#Marginal Effects of tourism on deforestation vary with tourism
#As you increase the level of tourism, the effect of tourism on deforestation increases
# We can look at sensitivity to a quadratic term
# There might be a quadratic term
# If we include the quadratic term, the effect is positive
# The effect of increasing tourism in the buffer area gets worse as there is increasing tourism
# This shows the effect of tourism that varies over the level of tourism
# Effect for a park with the average level of tourism
# WHen there are very few tourists, there is a marginally negative effect of tourism on deforestation

#MARGINAL EFFECTS OUTSIDE-PA (BUFFER)
x_buffer_combined <- master_buffer_combined[!is.na(master_buffer_combined$PercentDeforest),]$Tourism
x2_buffer_combined <- master_buffer_combined[!is.na(master_buffer_combined$PercentDeforest),]$TourismSq
beta.hat_buffer_combined <- coef(dd_reg_full_combined_new)
vcov1_buffer_combined <- vcov(dd_reg_full_combined_new)
z0_buffer_combined <- seq(min(x_buffer_combined), max(x_buffer_combined), length.out = 1000)
dy.dx_buffer_combined <- beta.hat_buffer_combined[1] + 2*beta.hat_buffer_combined[2]*z0_buffer_combined
se.dy.dx_buffer_combined <- sqrt(vcov1_buffer_combined[1, 1] + 4*(z0_buffer_combined^2)*vcov1_buffer_combined[2, 2] + 4*z0_buffer_combined*vcov1_buffer_combined[1, 2])
upr_buffer_combined <- dy.dx_buffer_combined + 1.96*se.dy.dx_buffer_combined
lwr_buffer_combined <- dy.dx_buffer_combined - 1.96*se.dy.dx_buffer_combined
upr2_buffer_combined <- dy.dx_buffer_combined + 1.645*se.dy.dx_buffer_combined
lwr2_buffer_combined <- dy.dx_buffer_combined - 1.645*se.dy.dx_buffer_combined
outside_plot_combined <- ggplot(data=NULL, aes(x=z0_buffer_combined, y=dy.dx_buffer_combined)) +
  labs(x="Tourism (1,000s of tourists)", y="Marginal Effects",
       cex=4) +
  geom_line(aes(z0_buffer_combined, dy.dx_buffer_combined),size = 1) +
  geom_line(aes(z0_buffer_combined, lwr_buffer_combined), size = 1, linetype = 1, color="magenta4") +
  geom_line(aes(z0_buffer_combined, upr_buffer_combined), size = 1, linetype = 1, color="magenta4") +
   geom_line(aes(z0_buffer_combined, lwr2_buffer_combined), size = 1, linetype = 2, color="magenta4") +
  geom_line(aes(z0_buffer_combined, upr2_buffer_combined), size = 1, linetype = 2, color="magenta4") +
  geom_hline(yintercept=0, size = 1, linetype=3) +
  geom_ribbon(aes(ymin=lwr_buffer_combined, ymax=upr_buffer_combined), alpha=0.3)+
  theme_classic()

marginal_plots <- ggarrange(inside_plot, outside_plot,outside_plot_combined, labels = c("g","h", "i"), ncol=3) # solid purple lines show 95% confidence intervals and dased lines show 90% confidence intervals (i.e., marginal significance)

summary(lm(z0_buffer_combined~ dy.dx_buffer_combined))

plots_total <-  ggarrange(residuals_plot, binscatter_plot, marginal_plots, ncol = 1)

```
When we include the quadratic tourism term in the model, neither tourism nor tourism^2 is statistically significant. The marginal effects of tourism on deforestation inside the PA vary with tourism, but effects are only marginally statistically significant when tourism levels are at their highest extreme. These results show that, when you have more tourists, the effect of tourism on deforestation may become more negative, although the marginal effects are weak. 

Outside the PA, the marginal effect of tourism on deforestation is positively related to tourism. Tourism is marginally statistically significant and tourism^2 is statistically significant. The marginal effects of tourism on deforestation vary according to the level of tourism; as the level of tourism increases, the effect of tourism on deforestation increases. However, this is only statistically significant when tourism levels are high. When there are very few tourists, there is a marginally negative effect of tourism on deforestation outside the PAs.

In combined inside- and outside-PA, the effect of tourism is not statistically significant but the effect of tourism^2 is statistically significant. While the the marginal effects of tourism on deforestation are positively related to tourism, the effect is only marginally significant at the extreme levels of tourism. 

```{r Dataframe Compilation}
regression_table_quadratic <- rbind(inside_quadratic, outside_quadratic, combined_quadratic)
regression_table_quadratic$Variable <- rownames(regression_table_quadratic)
regression_table_quadratic$Estimate <- round(regression_table_quadratic$Estimate, 3)
regression_table_quadratic$`Cluster s.e.`<- round(regression_table_quadratic$`Cluster s.e.`, 3)
regression_table_quadratic$`t value` <- round(regression_table_quadratic$`t value`, 3)
regression_table_quadratic$`Pr(>|t|)` <- round(regression_table_quadratic$`Pr(>|t|)`, 3)

write.csv(regression_table_quadratic, "Outputs/regression_table_quadratic.csv")


```