523 C
  • Home

Lab 1 - COVID Trends

Authors

Completed by: Colleen Hoskins

Created by: Mike Johnson, PhD

In this lab you will practice data wrangling and visualization skills using COVID-19 data curated by the New York Times. This data is a large dataset measuring the cases and deaths per US county across the lifespan of COVID from its early beginnings to just past the peak. The data stored in daily cummulative counts, is a great example of data that needs to be wrangled and cleaned before any analysis can be done.

libraries:

library(tidyverse)
library(flextable)
library(zoo)
library(RColorBrewer)
library(tigris)
library(patchwork)
library(maps)
library(sf)
library(lubridate)
library(broom)
library(readr)
library(dplyr)
library(httr)

Question 1. Daily Summary

  1. read in the data from the NY-Times URL
data <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")
  1. create an object called my.date and set it as “2022-02-01”
  2. create an object called my.state and set it to “Colorado”
my.date <- as.Date("2022-02-01")
my.state <- "Colorado"
  1. make a subset that limits the data to Colorado and add a new column with daily new cases. do the same for new deaths
co_data <- data %>%
  filter(state == my.state) %>%
  group_by(county) %>%
  mutate(new_cases = cases - lag(cases, n = 1),
         new_deaths = deaths - lag(deaths)) %>%
  drop_na() %>%
  ungroup()
  1. generate 2 tables. the first should show the 5 counties with the most cummulative cases on your date of interest, and the second should show the 5 counties with the most new cases on that same date
today_data <- filter(co_data, date == my.date)

slice_max(today_data, n = 5, order_by = cases) %>%
  select(county, state, cases) %>%
  flextable() %>%
  set_header_labels(
    cases = "cases") %>%
  set_caption("Top 5 Counties by Cummulative COVID-19 Cases")

county

state

cases

El Paso

Colorado

170,673

Denver

Colorado

159,022

Arapahoe

Colorado

144,255

Adams

Colorado

126,768

Jefferson

Colorado

113,240

slice_max(today_data, n = 5, order_by = new_cases) %>%
  select(county, state, new_cases) %>%
  flextable() %>%
  set_header_labels(
    new_cases = "cases") %>%
  set_caption("Top 5 Counties by New COVID-19 Cases")

county

state

cases

El Paso

Colorado

630

Arapahoe

Colorado

401

Denver

Colorado

389

Adams

Colorado

326

Jefferson

Colorado

291

Question 2. Evaluating Census Data (EDA)

  1. read in the population data and create a five digit FIP variable. keep only columns that contain “NAME” or “2021” and remove all state level rows
# Load population data from the given URL
pop_url <- read_csv("co-est2023-alldata.csv")
# Read and process the population data
cd <- pop_url %>%
  filter(COUNTY != "000") %>%               # Filter out rows with COUNTY = "000"
  mutate(fips = paste0(STATE, COUNTY)) %>%   # Create a new FIPS code column
  select(STNAME, COUNTY, fips, contains("2021"))  # Select relevant columns
b. Data Exploration: Attributes are state names and numbers. I am able to see that there are columns for state, county, fips, population, births, and deaths. Fips matches one of the columns in the Covid data. The dimensions have been modified to include counties that do not have the identification number “000”, and includes columns that have information for the year 2021.

Question 3: Per Capita Summary

  1. join the population data to the Colorado COVID data and compute the per capita cumulative cases, per capita new cases, and per capita new deaths. generate 2 new tables. the first should show the 5 counties with the most cumulative cases per capita on your date, and the second should show the 5 counties with the most new cases per capita on the same date
co_join <- inner_join(co_data, cd, by = "fips") 
per_capita <- co_join %>%
  group_by(county) %>%
  mutate(cases_per_capita = (cases/POPESTIMATE2021) * 100000,
         new_cases_per_capita = (new_cases/POPESTIMATE2021) * 100000,
         new_deaths_per_capita = (deaths/POPESTIMATE2021) * 100000) %>%
  drop_na() %>%
  ungroup() 
  
capita_my_date <- per_capita %>%
  filter(date == my.date)

slice_max(capita_my_date, n = 5, order_by = cases_per_capita) %>%
  select(county, cases_per_capita) %>%
  flextable() %>%
  set_header_labels(
    cases_per_capita = "cases per capita") %>%
  set_caption("Top 5 Colorado Counties")

county

cases per capita

Crowley

51,176.98

Bent

41,187.49

Pitkin

34,296.59

Lincoln

34,240.82

Logan

30,477.01

slice_max(capita_my_date, n = 5, order_by = new_cases_per_capita) %>%
  select(county, new_cases_per_capita) %>%
  flextable() %>%
  set_header_labels(
    new_cases_per_capita = "new cases per capita") %>%
  set_caption("Top 5 Colorado Counties by New Cases")

county

new cases per capita

Crowley

976.4603

Bent

412.0622

Sedgwick

386.9304

Washington

287.5924

Las Animas

265.1039

Question 4: Rolling Thresholds

  1. filter the merged COVID/Population data for Colorado to include the last 14 days. determine the total number of new cases in the last 14 days per 100,000 people. print a table of the top 5 counties and report the number of counties that meet the watch list condition
threshold <- per_capita %>%
  filter(date >= (my.date - 14) & date <= my.date) %>%  
  group_by(county) %>%
  summarize(
    total_new_cases = sum(new_cases),  
    population = sum(POPESTIMATE2021)) %>%
  mutate(new_cases_threshold = (total_new_cases / population) * 100000) %>%  
  drop_na() %>%
  ungroup()

slice_max(threshold, n = 5, order_by = new_cases_threshold) %>%
  select(county, new_cases_threshold) %>%
  flextable() %>%
  set_header_labels(new_cases_threshold = "new cases") %>%
  set_caption("Top 5 Colorado Watch List Counties")

county

new cases

Crowley

365.0102

Lincoln

250.9288

Alamosa

250.5177

Mineral

222.4614

Conejos

222.4567

threshold %>%
  filter(new_cases_threshold > 100) %>%
  nrow()
[1] 53
53 Colorado counties meet the watch list condition

Question 5: Death toll

  1. determine the percentage of deaths in each county that were attributed to COVID last year(2021). plot a visualization of all counties where COVID deaths account for 20% or more of the annual death toll
death_toll_2021 <- co_join %>%
  mutate(year = year(date)) %>%
  filter(year == 2021) %>%
  group_by(county) %>%
  summarize(covid_death = sum(new_deaths),
            total_deaths = first(DEATHS2021)) %>% 
  mutate(death_toll = covid_death / total_deaths * 100)

death_20 <- death_toll_2021 %>%
  filter(death_toll > 20)

death_20 %>%
  ggplot(aes(x = death_toll, y = reorder(county, death_toll))) + 
  geom_point(size = 3, color = "#1B9E77") +  # Use the first color from Set2
  labs(x = "death toll (%)", 
       y = "county",
       caption = "Colorado counties where COVID-19 accounted for more than 20% of total deaths in 2021") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.caption = element_text(hjust = 0))

Question 6: Multi-state

  1. group/summarize county level data to the state level, filter it to the four states of interest, and calculate the number of daily new cases and the 7-day rolling mean
state_data <- data %>%
  filter(state %in% c("Alabama", "Ohio", "Colorado", "New York")) %>%
  group_by(state, date) %>%
  summarize(fips = first(fips),
            cases = sum(cases, na.rm = TRUE), 
            deaths = sum(deaths, na.rm = TRUE), 
            .groups = "drop") %>%  
  group_by(state) %>%  
  arrange(state, date) %>%  
  mutate(new_cases = pmax(0, cases - lag(cases, n = 1)),  
         new_deaths = pmax(0, deaths - lag(deaths, n = 1)),
         new_cases_mean = rollmean(new_cases, 7, fill = NA, align = "right"),
         new_deaths_mean = rollmean(new_deaths, 7, fill = NA, align = "right")) %>%
  drop_na() %>%  
  ungroup()  
  1. facet plot the daily new cases and the 7-day rolling mean
ggplot(state_data, aes(x = date, y = new_cases)) +  
  geom_col(aes(fill = state)) +  
  facet_wrap(~state, scales = "free_y") +  
  labs(title = "Daily New COVID-19 Cases",
       x = "date", 
       y = "new cases") +
  scale_fill_brewer(palette = "Set2") +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(state_data, aes(x = date, y = new_cases_mean)) +  
  geom_col(aes(fill = state)) +  
  facet_wrap(~state, scales = "free_y") +  
  labs(title = "COVID-19 Cases 7-Day Rolling Mean",
       x = "date", 
       y = "new cases") +
  scale_fill_brewer(palette = "Set2") +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  1. explore the cases per capita of each state and calculate the 7-day rolling mean of the new cases per capita
states_join <- inner_join(state_data, cd, by = "fips") %>%
  mutate(new_cases_capita = new_cases / POPESTIMATE2021,  
         new_cases_mean = rollmean(new_cases_capita, 7, fill = NA, align = "right")) %>%
  mutate(state = factor(state, levels = c("Alabama", "Ohio", "Colorado", "New York"))) 
  1. plot the 7-day rolling averages overlying each other
ggplot(states_join, aes(x = date, y = new_cases_mean, fill = state, group = state)) +  
  geom_col() +  
  labs(title = "7-Day Rolling Average of New COVID-19 Cases Per Capita",
       x = "date", 
       y = "rolling average") +
  scale_fill_brewer(palette = "Set2") +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

  1. Briefly describe the influence scaling by population had on the analysis? Does it make some states look better? Some worse? How so?
Scaling by population shows that per capita rolling averages are higher in Alabama and Ohio. This differs from the state wide daily cases and 7-day rolling averages, which showed New York as the state with the highest COVID-19 cases.

Question 7. Time and Space

  1. calculate the Weighted Mean Center of the COVID-19 outbreak
county_centroids <- readr::read_csv('https://raw.githubusercontent.com/mikejohnson51/csu-ess-330/refs/heads/main/resources/county-centroids.csv')
county_join <- inner_join(county_centroids, data, by = "fips") %>%
  group_by(date) %>%  
  summarise(
    weighted_x_cases = sum(LON * cases, na.rm = TRUE) / sum(cases, na.rm = TRUE),
    weighted_y_cases = sum(LAT * cases, na.rm = TRUE) / sum(cases, na.rm = TRUE),
    weighted_x_deaths = sum(LON * deaths, na.rm = TRUE) / sum(deaths, na.rm = TRUE),
    weighted_y_deaths = sum(LAT * deaths, na.rm = TRUE) / sum(deaths, na.rm = TRUE),
    cases = sum(cases, na.rm = TRUE),
    deaths = sum(deaths, na.rm = TRUE)
  ) %>%
  drop_na() 
  1. make two plots next to each other showing cases in navy and deaths in red. describe the differences in the plots and what they mean about the spatial patterns seen with COVID impacts
a <- ggplot() +
  borders("state", fill = "gray90", colour = "white") +
  geom_point(data = county_join, aes(x = weighted_x_cases, y = weighted_y_cases, size = cases), color = "#377EB8") +
  labs(title = "COVID-19 Cases") +
  theme_minimal() +  
  theme(
    axis.title = element_blank(), 
    axis.text = element_blank(),   
    axis.ticks = element_blank(),  
    panel.grid = element_blank()   
  )

b <- ggplot() +
  borders("state", fill = "gray90", colour = "white") +
  geom_point(data = county_join, aes(x = weighted_x_deaths, y = weighted_y_deaths, size=deaths), color = "#E41A1C") +
  labs(title = "COVID-19 Deaths") +
  theme_minimal() +
  theme(
    axis.title = element_blank(),  
    axis.text = element_blank(),   
    axis.ticks = element_blank(),  
    panel.grid = element_blank()   
  )

a + b

The weighted mean of both COVID cases and Deaths are highest around Arkansaw, Missouri, Tenessee, Kentucky, and Illinois. Weighted Cases appear to be highest in middle North America, whereas death spread towards the Pacific Northwest. There are higher numbers of cases than deaths.

Question 8: Trends

Data Visualization

  1. compute county level daily new cases and deaths, and then join it to the census data
  2. add a new column to the data for year, month, and season
  3. group the data by state, year, and season and summarize the total population, new cases, and new deaths per grouping
  4. apply a log transformation to cases, deaths, and population
trends <- data %>%
  group_by(fips) %>%
  mutate(new_cases = pmax(0, cases - lag(cases, n = 1)),
         new_deaths = pmax(0, deaths - lag(deaths))) %>%
  ungroup() %>%
  left_join(cd, by = "fips") %>%
  mutate(year = lubridate::year(date),
         month = lubridate::month(date),
         season = dplyr::case_when(
           month %in% 3:5 ~ "Spring",    
           month %in% 6:8 ~ "Summer",    
           month %in% 9:11 ~ "Fall",     
           month %in% c(12, 1, 2) ~ "Winter")) %>%
  group_by(state, year, season) %>%
  summarize(
    population = sum(POPESTIMATE2021),   
    new_cases = sum(new_cases, na.rm = TRUE),
    new_deaths = sum(new_deaths, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(log_cases = log(new_cases + 1),          
         log_deaths = log(new_deaths + 1),       
         log_population = log(population))

Model Building

  1. build a linear model to predict the log of cases using the log of deaths, the log of population, and the season.
  2. Once the model is built, summarize it (summary) and report the R-squared value and the p-value of the model. What does this mean for the value of its application?
lm_model <- lm(log_cases ~ log_deaths*log_population + season, data = trends)

summary(lm_model)

Call:
lm(formula = log_cases ~ log_deaths * log_population + season, 
    data = trends)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.31412 -0.31982 -0.02291  0.34272  1.37613 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -2.69871    3.21708  -0.839  0.40274    
log_deaths                 0.77857    0.39852   1.954  0.05242 .  
log_population             0.53675    0.18023   2.978  0.00333 ** 
seasonSpring              -0.79528    0.12221  -6.507 8.55e-10 ***
seasonSummer              -0.31156    0.13100  -2.378  0.01852 *  
seasonWinter               0.37357    0.11530   3.240  0.00144 ** 
log_deaths:log_population -0.01318    0.02103  -0.627  0.53161    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4979 on 167 degrees of freedom
  (380 observations deleted due to missingness)
Multiple R-squared:  0.8764,    Adjusted R-squared:  0.8719 
F-statistic: 197.3 on 6 and 167 DF,  p-value: < 2.2e-16
Multiple R-squared: 0.8764, Adjusted R-squared: 0.8719, p-value: < 2.2e-16
The R-squared value is high, indicating a strong model fit. The p-value is low, indicating that the model is statistically significant.

Question 9: Evaluation

  1. generate a data frame of predictions and residuals
  2. create a scatter plot of predicted cases vs. actual cases. add a line of best fit to the plot, and make the plot as appealing as possible. Describe the relationship that you see…are you happy with the model?
Yes, I am happy with the model. The scatter plot shows a strong linear relationship between predicted and actual log cases, with the points following the line of best fit.
  1. create a histogram of the residuals to visually check for residual normality. how does the distribution look? was a linear model appropriate for this case?
The histogram looks to be normally distributed. A linear model was appropriate for this case.
model_eval <- broom::augment(lm_model, trends = trends)

ggplot(model_eval, aes(x = log_cases, y = .fitted)) +
  geom_point() +
  geom_smooth(method = "lm", color = "#1B9E77") +
  labs(title = "Predicted vs Actual Cases",
       x = "actual",     
       y = "predicted") + 
  theme_minimal()

ggplot(model_eval, aes(x = .resid)) +
  geom_histogram(fill = "#1B9E77", color = "black", bins = 30) +  
  labs(title = "Histogram of Residuals",
       x = "residuals") +  
  theme_minimal()