
Lab 5 - Machine Learning in Hydrology
In this lab, we will explore predictive modeling in hydrology using the tidymodels framework and the CAMELS (Catchment Attributes and Meteorology for Large-sample Studies) dataset.
libraries:
library(tidyverse)
library(tidymodels)
library(powerjoin)
library(glue)
library(vip)
library(baguette)
library(ggthemes)
library(ranger)
library(xgboost)
library(ggplot2)What is tidymodels?
tidymodels is an R framework designed for machine learning and statistical modeling. Built on the principles of the tidyverse, tidymodels provides a consistent and modular approach to tasks like data preprocessing, model training, evaluation, and validation. By leveraging the strengths of packages such as recipes, parsnip, and yardstick, tidymodels streamlines the modeling workflow, making it easier to experiment with different models while maintaining reproducibility and interpretability.
What is the CAMELS dataset?
The CAMELS dataset is a widely used resource in hydrology and environmental science, providing data on over 500 self-draining river basins across the United States. It includes meteorological forcings, streamflow observations, and catchment attributes such as land cover, topography, and soil properties. This dataset is particularly valuable for large-sample hydrology studies, enabling researchers to develop and test models across diverse climatic and physiographic conditions.
In this lab, we will focus on predicting mean streamflow for these basins using their associated characteristics. CAMELS has been instrumental in various hydrologic and machine learning applications, including:
Calibrating Hydrologic Models – Used for parameter tuning in models like SAC-SMA, VIC, and HBV, improving regional and large-sample studies.
Training Machine Learning Models – Supports deep learning (e.g., LSTMs) and regression-based streamflow predictions, often outperforming traditional methods.
Understanding Model Behavior – Assists in assessing model generalization, uncertainty analysis, and the role of catchment attributes.
Benchmarking & Regionalization – Facilitates large-scale model comparisons and parameter transfer to ungauged basins.
Hybrid Modeling – Enhances physics-based models with machine learning for bias correction and improved hydrologic simulations.
A notable study by Kratzert et al. (2019) demonstrated that LSTMs can outperform conceptual models in streamflow prediction. As part of this lab, we will explore how to programmatically download and load the data into R.
What’s in the data?
Each record in the CAMELS dataset represents a unique river basin, identified by an outlet USGS NWIS gauge_id. The dataset contains a mix of continuous and categorical variables, including meteorological, catchment, and streamflow summaries.
The data we are going to downloaded are the basin level summaries. For example, if we looked at row 1 of the data (Gage: 01013500) all of the values are the areal average for the drainage basin seen below, while the flow metrics are associated with the outlet gage (in red):
# Use the `findNLDI` function to get the basin and flowlines for the first gauge
basin <- dataRetrieval::findNLDI(nwis = "01013500",
# Navigate the "upper tributaries" of the basin
nav = "UT",
# Return the basin and flowlines
find = c("basin", "flowlines"))
# Plot the basin, flowlines, and gauge ...
ggplot() +
geom_sf(data = basin$basin, fill = "lightblue") +
geom_sf(data = basin$UT_flowlines, color = "blue") +
geom_sf(data = basin$origin, color = "red") +
theme_minimal()
Lab Goals
In this lab, you will:
- Learn how to programatically download and access data.
- Practice using tidymodels for predictive modeling.
- Train and evaluate models to predict mean streamflow across the country.
- Interpret and compare model performance using workflows.
By the end of this lab, you will have hands-on experience applying machine learning techniques to real-world data, helping to bridge the gap between statistical modeling and environmental science.
Data Download
The CAMELS dataset is hosted by NCAR and can be accessed here under the “Individual Files” section. The root URL for all data seen on the “Individual Files” page is:
root <- 'https://gdex.ucar.edu/dataset/camels/file'Near the bottom of that page, there are many .txt files that contain the data we want. Some hold climate data for each basin, some hold geology data, some hold soil data, etc. There is also a PDF with descriptions of the columns in each file. We are going to download all of the .txt files and the PDF.
Getting the documentation PDF
We can download the documentation PDF which provides a descriptions for the various columns as many are not self-explanatory. Here we can use download.file to download the PDF to our data directory.
download.file('https://gdex.ucar.edu/dataset/camels/file/camels_attributes_v2.0.pdf',
'data/camels_attributes_v2.0.pdf')Getting Basin characteristics
Now we want to download the .txt files that store the actual data documented in the PDF. Doing this file by file (like we did with the PDF) is possible, but lets look at a better/easier way…
- Lets create a vector storing the data types/file names we want to download:
types <- c("clim", "geol", "soil", "topo", "vege", "hydro")- Using glue, we can construct the needed URLs and file names for the data we want to download:
# Where the files live online ...
remote_files <- glue('{root}/camels_{types}.txt')
# where we want to download the data ...
local_files <- glue('data/camels_{types}.txt')- Now we can download the data: walk2 comes from the purrr package and is used to apply a function to multiple arguments in parallel (much like map2 works over paired lists). Here, we are asking walk2 to pass the first element of remote_files and the first element of local_files to the download.file function to download the data, and setting quiet = TRUE to suppress output. The process is then iterated for the second element of each vector, and so on.
walk2(remote_files, local_files, download.file, quiet = TRUE)- Once downloaded, the data can be read it into R using readr::read_delim(), again instead of applying this to each file individually, we can use map to apply the function to each element of the local_files list.
# Read and merge data
camels <- map(local_files, read_delim, show_col_types = FALSE) - This gives us a list of data.frames, one for each file that we want to merge into a single table. So far in class we have focused on *_join functions to merge data based on a primary and foreign key relationship.
In this current list, we have >2 tables, but, have a shared column called gauge_id that we can use to merge the data. However, since we have more then a left and right hand table, we need a more robust tool. We will use the powerjoin package to merge the data into a single data frame. powerjoin is a flexible package for joining lists of data.frames. It provides a wide range of join types, including inner, left, right, full, semi, anti, and cross joins making it a versatile tool for data manipulation and analysis, and one that should feel familiar to users of dplyr.
In this case, we are join to merge every data.frame in the list (n = 6) by the shared gauge_id column. Since we want to keep all data, we want a full join.
camels <- power_full_join(camels ,by = 'gauge_id')Alternatively, we could have read straight form the urls. Strongly consider the implications of this approach as the longevity and persistence of the data is not guaranteed.
# Read and merge data
camels <- map(remote_files, read_delim, show_col_types = FALSE) |>
power_full_join(by = 'gauge_id')Question 1
- Make sure all data and the PDF are downloaded into you data directory
- From the documentation PDF, report what zero_q_freq represents
- zero_q_freq is the % frequency of days with Q = 0 mm/day
Exploratory Data Analysis
First, lets make a map of the sites. Use the borders() ggplot function to add state boundaries to the map and initially color the points by the mean flow (q_mean) at each site.
ggplot(data = camels, aes(x = gauge_lon, y = gauge_lat)) +
borders("state", colour = "gray50") +
geom_point(aes(color = q_mean)) +
scale_color_gradient(low = "pink", high = "dodgerblue") +
ggthemes::theme_map()
We can take a moment here to are learn a few new things about ggplot2!
Color scales In ggplot, sometimes you want different things (like bars, dots, or lines) to have different colors. But how does R know which colors to use? That’s where color scales come in!
scales can be used to map data values to colors (scale_color_) or fill aesthetics (scale_fill_). There are two main types of color scales:
Discrete color scales – for things that are categories, like “apples,” “bananas,” and “cherries.” Each gets its own separate color.
scale_color_manual(values = c("red", "yellow", "pink")) #lets you pick your own colors.<ggproto object: Class ScaleDiscrete, Scale, gg>
aesthetics: colour
axis_order: function
break_info: function
break_positions: function
breaks: waiver
call: call
clone: function
dimension: function
drop: TRUE
expand: waiver
get_breaks: function
get_breaks_minor: function
get_labels: function
get_limits: function
get_transformation: function
guide: legend
is_discrete: function
is_empty: function
labels: waiver
limits: NULL
make_sec_title: function
make_title: function
map: function
map_df: function
n.breaks.cache: NULL
na.translate: TRUE
na.value: grey50
name: waiver
palette: function
palette.cache: NULL
position: left
range: environment
rescale: function
reset: function
train: function
train_df: function
transform: function
transform_df: function
super: <ggproto object: Class ScaleDiscrete, Scale, gg>
Or
scale_color_brewer(palette = "Set1") #uses a built-in color set.<ggproto object: Class ScaleDiscrete, Scale, gg>
aesthetics: colour
axis_order: function
break_info: function
break_positions: function
breaks: waiver
call: call
clone: function
dimension: function
drop: TRUE
expand: waiver
get_breaks: function
get_breaks_minor: function
get_labels: function
get_limits: function
get_transformation: function
guide: legend
is_discrete: function
is_empty: function
labels: waiver
limits: NULL
make_sec_title: function
make_title: function
map: function
map_df: function
n.breaks.cache: NULL
na.translate: TRUE
na.value: NA
name: waiver
palette: function
palette.cache: NULL
position: left
range: environment
rescale: function
reset: function
train: function
train_df: function
transform: function
transform_df: function
super: <ggproto object: Class ScaleDiscrete, Scale, gg>
Continuous color scales – for numbers, like temperature (cold to hot) or height (short to tall). The color changes smoothly.
scale_color_gradient(low = "blue", high = "red") #makes small numbers blue and big numbers red.<ScaleContinuous>
Range:
Limits: 0 -- 1
Question 2
- Make 2 maps of the sites, coloring the points by the aridty and p_mean column
- Add clear labels, titles, and a color scale that makes sense for each parameter.
- Ensure these render as a single image with your choice of facet_*, patchwork, or ggpubr
aridity <- ggplot(data = camels, aes(x = gauge_lon, y = gauge_lat)) +
borders("state", colour = "gray50") +
geom_point(aes(color = aridity)) +
scale_color_viridis_c(option = "magma", direction = -1) +
ggthemes::theme_map()
p_mean <- ggplot(data = camels, aes(x = gauge_lon, y = gauge_lat)) +
borders("state", colour = "gray50") +
geom_point(aes(color = p_mean)) +
scale_color_viridis_c(option = "viridis", direction = -1) +
ggthemes::theme_map()
# Patchwork plot
library(patchwork)
(aridity + p_mean) + plot_annotation(title = "CAMELS Site Maps by Aridity and Precipitation")
# ggpubr plot
library(ggpubr)
ggarrange(aridity, p_mean,
ncol = 2, nrow = 1,
common.legend = FALSE)
# Facet
camels_long <- camels %>%
select(gauge_lon, gauge_lat, aridity, p_mean) %>%
pivot_longer(cols = c(aridity, p_mean), names_to = "variable", values_to = "value")
# Faceted plot
ggplot(data = camels_long, aes(x = gauge_lon, y = gauge_lat)) +
borders("state", colour = "gray50") +
geom_point(aes(color = value)) +
scale_color_gradient(low = "blue", high = "red")+
facet_wrap(~ variable, scales = "free") +
ggthemes::theme_map()
Model Preparation
As an initial analysis, lets look at the relationship between aridity, rainfall and mean flow. First, lets make sure there is not significant correlation between these variables. Here, we make sure to drop NAs and only view the 3 columns of interest.
camels |>
select(aridity, p_mean, q_mean) |>
drop_na() |>
cor() aridity p_mean q_mean
aridity 1.0000000 -0.7550090 -0.5817771
p_mean -0.7550090 1.0000000 0.8865757
q_mean -0.5817771 0.8865757 1.0000000
As expected, there is a strong correlation between rainfall and mean flow, and an inverse correlation between aridity and rainfall. While both are high, we are going see if we can build a model to predict mean flow using aridity and rainfall.
Visual EDA
- Lets start by looking that the 3 dimensions (variables) of this data. We’ll start with a XY plot of aridity and rainfall. We are going to use the scale_color_viridis_c() function to color the points by the q_mean column. This scale functions maps the color of the points to the values in the q_mean column along the viridis continuous (c) palette. Because a scale_color_* function is applied, it maps to the known color aesthetic in the plot.
# Create a scatter plot of aridity vs rainfall
ggplot(camels, aes(x = aridity, y = p_mean)) +
# Add points colored by mean flow
geom_point(aes(color = q_mean)) +
# Add a linear regression line
geom_smooth(method = "lm", color = "red", linetype = 2) +
# Apply the viridis color scale
scale_color_viridis_c() +
# Add a title, axis labels, and theme (w/ legend on the bottom)
theme_linedraw() +
theme(legend.position = "bottom") +
labs(title = "Aridity vs Rainfall vs Runnoff",
x = "Aridity",
y = "Rainfall",
color = "Mean Flow")
Ok! so it looks like there is a relationship between rainfall, aridity, and rainfall but it looks like an exponential decay function and is certainly not linear.
To test a transformation, we can log transform the x and y axes using the scale_x_log10() and scale_y_log10() functions:
ggplot(camels, aes(x = aridity, y = p_mean)) +
geom_point(aes(color = q_mean)) +
geom_smooth(method = "lm") +
scale_color_viridis_c() +
# Apply log transformations to the x and y axes
scale_x_log10() +
scale_y_log10() +
theme_linedraw() +
theme(legend.position = "bottom") +
labs(title = "Aridity vs Rainfall vs Runnoff",
x = "Aridity",
y = "Rainfall",
color = "Mean Flow")
Great! We can see a log-log relationship between aridity and rainfall provides a more linear relationship. This is a common relationship in hydrology and is often used to estimate rainfall in ungauged basins. However, once the data is transformed, the lack of spread in the streamflow data is quite evident with high mean flow values being compressed to the low end of aridity/high end of rainfall.
To address this, we can visualize how a log transform may benifit the q_mean data as well. Since the data is represented by color, rather then an axis, we can use the trans (transform) argument in the scale_color_viridis_c() function to log transform the color scale.
ggplot(camels, aes(x = aridity, y = p_mean)) +
geom_point(aes(color = q_mean)) +
geom_smooth(method = "lm") +
# Apply a log transformation to the color scale
scale_color_viridis_c(trans = "log") +
scale_x_log10() +
scale_y_log10() +
theme_linedraw() +
theme(legend.position = "bottom",
# Expand the legend width ...
legend.key.width = unit(2.5, "cm"),
legend.key.height = unit(.5, "cm")) +
labs(title = "Aridity vs Rainfall vs Runnoff",
x = "Aridity",
y = "Rainfall",
color = "Mean Flow") 
Excellent! Treating these three right skewed variables as log transformed, we can see a more evenly spread relationship between aridity, rainfall, and mean flow. This is a good sign for building a model to predict mean flow using aridity and rainfall.
Model Building
Let’s start by splitting the data
First, we set a seed for reproducabilty, then transform the q_mean column to a log scale. Remember it is error prone to apply transformations to the outcome variable within a recipe. So, we’ll do it a prioi.
Once set, we can split the data into a training and testing set. We are going to use 80% of the data for training and 20% for testing with no stratification.
Additionally, we are going to create a 10-fold cross validation dataset to help us evaluate multi-model setups.
set.seed(123)
# Bad form to perform simple transformations on the outcome variable within a
# recipe. So, we'll do it here.
camels <- camels |>
mutate(logQmean = log(q_mean))
# Generate the split
camels_split <- initial_split(camels, prop = 0.8)
camels_train <- training(camels_split)
camels_test <- testing(camels_split)
camels_cv <- vfold_cv(camels_train, v = 10)Preprocessor: recipe
In lecture, we have focused on using formulas as a workflow preprocessor. Separately we have used the recipe function to define a series of data preprocessing steps. Here, we are going to use the recipe function to define a series of data preprocessing steps.
We learned quite a lot about the data in the visual EDA. We know that the q_mean, aridity and p_mean columns are right skewed and can be helped by log transformations. We also know that the relationship between aridity and p_mean is non-linear and can be helped by adding an interaction term to the model. To implement these, lets build a recipe!
# Create a recipe to preprocess the data
rec <- recipe(logQmean ~ aridity + p_mean, data = camels_train) %>%
# Log transform the predictor variables (aridity and p_mean)
step_log(all_predictors()) %>%
# Add an interaction term between aridity and p_mean
step_interact(terms = ~ aridity:p_mean) |>
# Drop any rows with missing values in the pred
step_naomit(all_predictors(), all_outcomes())Naive base lm approach:
Ok, to start, lets do what we are comfortable with … fitting a linear model to the data. First, we use prep and bake on the training data to apply the recipe. Then, we fit a linear model to the data.
# Prepare the data
baked_data <- prep(rec, camels_train) |>
bake(new_data = NULL)
# Interaction with lm
# Base lm sets interaction terms with the * symbol
lm_base <- lm(logQmean ~ aridity * p_mean, data = baked_data)
summary(lm_base)
Call:
lm(formula = logQmean ~ aridity * p_mean, data = baked_data)
Residuals:
Min 1Q Median 3Q Max
-2.91162 -0.21601 -0.00716 0.21230 2.85706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.77586 0.16365 -10.852 < 2e-16 ***
aridity -0.88397 0.16145 -5.475 6.75e-08 ***
p_mean 1.48438 0.15511 9.570 < 2e-16 ***
aridity:p_mean 0.10484 0.07198 1.457 0.146
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5696 on 531 degrees of freedom
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7684
F-statistic: 591.6 on 3 and 531 DF, p-value: < 2.2e-16
# Sanity Interaction term from recipe ... these should be equal!!
summary(lm(logQmean ~ aridity + p_mean + aridity_x_p_mean, data = baked_data))
Call:
lm(formula = logQmean ~ aridity + p_mean + aridity_x_p_mean,
data = baked_data)
Residuals:
Min 1Q Median 3Q Max
-2.91162 -0.21601 -0.00716 0.21230 2.85706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.77586 0.16365 -10.852 < 2e-16 ***
aridity -0.88397 0.16145 -5.475 6.75e-08 ***
p_mean 1.48438 0.15511 9.570 < 2e-16 ***
aridity_x_p_mean 0.10484 0.07198 1.457 0.146
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5696 on 531 degrees of freedom
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7684
F-statistic: 591.6 on 3 and 531 DF, p-value: < 2.2e-16
Where things get a little messy…
Ok so now we have our trained model lm_base and want to validate it on the test data.
Remember a models ability to predict on new data is the most important part of the modeling process. It really doesnt matter how well it does on data it has already seen!
We have to be careful about how we do this with the base R approach:
Wrong version 1: augment
The broom package provides a convenient way to extract model predictions and residuals. We can use the augment function to add predicted values to the test data. However, if we use augment directly on the test data, we will get incorrect results because the preprocessing steps defined in the recipe object have not been applied to the test data.
nrow(camels_test)[1] 135
nrow(camels_train)[1] 536
broom::augment(lm_base, new_data = camels_test)# A tibble: 535 × 9
logQmean aridity p_mean .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.0230 -0.172 1.18 0.107 -0.130 0.00252 0.570 0.0000328 -0.228
2 -0.496 0.481 1.08 -0.549 0.0534 0.0179 0.570 0.0000408 0.0945
3 0.530 -0.362 1.44 0.631 -0.101 0.00367 0.570 0.0000291 -0.178
4 0.314 0.573 0.665 -1.26 1.57 0.00562 0.566 0.0108 2.76
5 0.658 -0.466 1.28 0.470 0.188 0.00422 0.570 0.000116 0.331
6 0.00495 -0.146 1.33 0.308 -0.303 0.00475 0.570 0.000339 -0.533
7 -0.0146 -0.135 0.871 -0.376 0.362 0.00847 0.570 0.000868 0.638
8 0.413 -0.338 1.29 0.389 0.0238 0.00264 0.570 0.00000116 0.0419
9 1.12 -0.402 1.52 0.769 0.349 0.00479 0.570 0.000454 0.614
10 0.552 -0.506 1.34 0.591 -0.0388 0.00378 0.570 0.00000442 -0.0682
# ℹ 525 more rows
Wrong version 2: predict
The predict function can be used to make predictions on new data. However, if we use predict directly on the test data, we will get incorrect results because the preprocessing steps defined in the recipe object have not been applied to the test data.
camels_test$p2 = predict(lm_base, newdata = camels_test)
## Scales way off!
ggplot(camels_test, aes(x = p2, y = logQmean)) +
geom_point() +
# Linear fit line, no error bands
geom_smooth(method = "lm", se = FALSE, size =1) +
# 1:1 line
geom_abline(color = "red", size = 1) +
labs(title = "Linear Model Using `predict()`",
x = "Predicted Log Mean Flow",
y = "Observed Log Mean Flow") +
theme_linedraw()
Correct version: prep -> bake -> predict
To correctly evaluate the model on the test data, we need to apply the same preprocessing steps to the test data that we applied to the training data. We can do this using the prep and bake functions with the recipe object. This ensures the test data is transformed in the same way as the training data before making predictions.
test_data <- bake(prep(rec), new_data = camels_test)
test_data$lm_pred <- predict(lm_base, newdata = test_data)Model Evaluation: statistical and visual
Now that we have the predicted values, we can evaluate the model using the metrics function from the yardstick package. This function calculates common regression metrics such as RMSE, R-squared, and MAE between the observed and predicted values.
metrics(test_data, truth = logQmean, estimate = lm_pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.583
2 rsq standard 0.742
3 mae standard 0.390
ggplot(test_data, aes(x = logQmean, y = lm_pred, colour = aridity)) +
# Apply a gradient color scale
scale_color_gradient2(low = "brown", mid = "orange", high = "darkgreen") +
geom_point() +
geom_abline(linetype = 2) +
theme_linedraw() +
labs(title = "Linear Model: Observed vs Predicted",
x = "Observed Log Mean Flow",
y = "Predicted Log Mean Flow",
color = "Aridity")
Ok so that was a bit burdensome, is really error prone (fragile), and is worthless if we wanted to test a different algorithm… lets look at a better approach!
Using a workflow instead
tidymodels provides a framework for building and evaluating models using a consistent and modular workflow. The workflows package allows you to define a series of modeling steps, including data preprocessing, model fitting, and model fitting, in a single object. This makes it easier to experiment with different models, compare performance, and ensure reproducibility.
workflows are built from a model, a preprocessor, and a execution. Here, we are going to use the linear_reg function to define a linear regression model, set the engine to lm, and the mode to regression. We then add our recipe to the workflow, fit the model to the training data, and extract the model coefficients.
# Define model
lm_model <- linear_reg() %>%
# define the engine
set_engine("lm") %>%
# define the mode
set_mode("regression")
# Instantiate a workflow ...
lm_wf <- workflow() %>%
# Add the recipe
add_recipe(rec) %>%
# Add the model
add_model(lm_model) %>%
# Fit the model to the training data
fit(data = camels_train)
# Extract the model coefficients from the workflow
summary(extract_fit_engine(lm_wf))$coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.7758557 0.16364755 -10.851710 6.463654e-25
aridity -0.8839738 0.16144589 -5.475357 6.745512e-08
p_mean 1.4843771 0.15511117 9.569762 4.022500e-20
aridity_x_p_mean 0.1048449 0.07198145 1.456555 1.458304e-01
# From the base implementation
summary(lm_base)$coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.7758557 0.16364755 -10.851710 6.463654e-25
aridity -0.8839738 0.16144589 -5.475357 6.745512e-08
p_mean 1.4843771 0.15511117 9.569762 4.022500e-20
aridity:p_mean 0.1048449 0.07198145 1.456555 1.458304e-01
Making Predictions
Now that lm_wf is a workflow, data is not embedded in the model, we can use augment with the new_data argument to make predictions on the test data.
lm_data <- augment(lm_wf, new_data = camels_test)
dim(lm_data)[1] 135 62
Model Evaluation: statistical and visual
As with EDA, applying for graphical and statistical evaluation of the model is a key Here, we use the metrics function to extract the default metrics (rmse, rsq, mae) between the observed and predicted mean streamflow values.
We then create a scatter plot of the observed vs predicted values, colored by aridity, to visualize the model performance.
metrics(lm_data, truth = logQmean, estimate = .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.583
2 rsq standard 0.742
3 mae standard 0.390
ggplot(lm_data, aes(x = logQmean, y = .pred, colour = aridity)) +
scale_color_viridis_c() +
geom_point() +
geom_abline() +
theme_linedraw()
Switch it up!
The real power of this approach is that we can easily switch out the models/recipes and see how it performs. Here, we are going to instead use a random forest model to predict mean streamflow. We define a random forest model using the rand_forest function, set the engine to ranger, and the mode to regression. We then add the recipe, fit the model, and evaluate the skill.
rf_model <- rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
rf_wf <- workflow() %>%
# Add the recipe
add_recipe(rec) %>%
# Add the model
add_model(rf_model) %>%
# Fit the model
fit(data = camels_train) Predictions
Make predictions on the test data using the augment function and the new_data argument.
rf_data <- augment(rf_wf, new_data = camels_test)
dim(rf_data)[1] 135 61
Model Evaluation: statistical and visual
Evaluate the model using the metrics function and create a scatter plot of the observed vs predicted values, colored by aridity.
metrics(rf_data, truth = logQmean, estimate = .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.596
2 rsq standard 0.733
3 mae standard 0.370
ggplot(rf_data, aes(x = logQmean, y = .pred, colour = aridity)) +
scale_color_viridis_c() +
geom_point() +
geom_abline() +
theme_linedraw()
Awesome! We just set up a completely new model and were able to utilize all of the things we had done for the linear model. This is the power of the tidymodels framework!
That said, we still can reduce some to the repetition. Further, we are not really able to compare these models to one another as they
A workflowset approach
workflow_set is a powerful tool for comparing multiple models on the same data. It allows you to define a set of workflows, fit them to the same data, and evaluate their performance using a common metric. Here, we are going to create a workflow_set object with the linear regression and random forest models, fit them to the training data, and compare their performance using the autoplot and rank_results functions.
wf <- workflow_set(list(rec), list(lm_model, rf_model)) %>%
workflow_map('fit_resamples', resamples = camels_cv)
autoplot(wf)
rank_results(wf, rank_metric = "rsq", select_best = TRUE)# A tibble: 4 × 9
wflow_id .config .metric mean std_err n preprocessor model rank
<chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
1 recipe_rand_fore… Prepro… rmse 0.565 0.0243 10 recipe rand… 1
2 recipe_rand_fore… Prepro… rsq 0.770 0.0255 10 recipe rand… 1
3 recipe_linear_reg Prepro… rmse 0.569 0.0260 10 recipe line… 2
4 recipe_linear_reg Prepro… rsq 0.770 0.0223 10 recipe line… 2
Overall it seems the random forest model is outperforming the linear model. This is not surprising given the non-linear relationship between the predictors and the outcome :)
Final Model
rf_fin <- rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
final <- workflow() |>
add_recipe(rec) |>
add_model(rf_fin) |>
fit(data = camels_train) Evaluation
As a last step, lets evaluate the Random Forest model’s performance in predicting streamflow using the vip, augment, and ggplot2. We’ll starts by computing variable importance (vip::vip()) to understand which predictors most influence the model.
Next, we’ll apply the trained model (final) to the test dataset using augment to append predictions to the test data.
Model performance is then assessed using metrics(), comparing the actual (logQmean) and predicted (.pred) log-transformed mean streamflow values.
Finally, a scatter plot is generated , visualizing the observed vs. predicted values, color-coded by aridity. The plot includes a 1:1 reference line (geom_abline()) to indicate perfect predictions and uses the viridis color scale to improve readability.
# VIP:
vip::vip(final)
## Prediction
rf_data <- augment(final, new_data = camels_test)
## Evaluation
metrics(rf_data, truth = logQmean, estimate = .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.591
2 rsq standard 0.738
3 mae standard 0.366
ggplot(rf_data, aes(x = logQmean, y = .pred, colour = aridity)) +
scale_color_viridis_c() +
geom_point() +
geom_abline() +
geom_smooth(method = "lm", col = 'red', lty = 2, se = FALSE) +
theme_linedraw() +
labs(title = "Random Forest Model: Observed vs Predicted",
x = "Observed Log Mean Flow",
y = "Predicted Log Mean Flow",
color = "Aridity")
Question 3
- Build a xgboost (engine) regression (mode) model using boost_tree
- Build a neural network model using the nnet engine from the baguette package using the bag_mlp function
- Add this to the above workflow
- Evaluate the model and compare it to the linear and random forest models
- Which of the 4 models would you move forward with?
xgboost
# Re set recipe
rec <- recipe(logQmean ~ aridity + p_mean, data = camels_train) %>%
step_log(all_predictors()) %>%
step_interact(terms = ~ aridity:p_mean) %>%
step_naomit(all_predictors(), all_outcomes())
# Define model
xgb_spec <- boost_tree(
trees = 1000,
learn_rate = 0.05,
tree_depth = 6
) %>%
set_engine("xgboost") %>%
set_mode("regression")
# Instantiate a workflow...
xgb_wf <- workflow() %>%
add_model(xgb_spec) %>%
add_recipe(rec) %>%
fit(data = camels_train)neural network
# Define model
nn_spec <- bag_mlp(
hidden_units = 10,
epochs = 100
) %>%
set_engine("nnet") %>%
set_mode("regression")
# Instantiate a workflow...
nn_wf <- workflow() %>%
add_model(nn_spec) %>%
add_recipe(rec) %>%
fit(data = camels_train)prediction and evaluation
xgb_data <- augment(xgb_wf, new_data = camels_test)
nn_data <- augment(nn_wf, new_data = camels_test)
#xgboost evaluation
metrics(xgb_data, truth = logQmean, estimate = .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.685
2 rsq standard 0.664
3 mae standard 0.421
#neural network evaluation
metrics(nn_data, truth = logQmean, estimate = .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.547
2 rsq standard 0.771
3 mae standard 0.338
#lm evaluation
metrics(lm_data, truth = logQmean, estimate = .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.583
2 rsq standard 0.742
3 mae standard 0.390
#random forest evaluation
metrics(rf_data, truth = logQmean, estimate = .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.591
2 rsq standard 0.738
3 mae standard 0.366
The neural network model appears to preform the best because it has the lowest rmse and mae and the highest rsq.
Question 4: Build your own model
Data Splitting
# Set a seed
set.seed(4185)
# Create an initial split with 75% used for training and 25% for testing
camels_split <- initial_split(camels, prop = 0.8)
# Extract your training and testing sets
camels_train <- training(camels_split)
camels_test <- testing(camels_split)
# Build a 10-fold CV dataset as well
camels_cv <- vfold_cv(camels_train, v = 10)Visualization
# Create a scatter plot of PET vs rainfall
ggplot(camels, aes(x = pet_mean, y = p_mean)) +
# Add points colored by mean flow
geom_point(aes(color = q_mean)) +
# Add a linear regression line
geom_smooth(method = "lm", color = "red", linetype = 2) +
# Apply the viridis color scale
scale_color_viridis_c() +
# Add a title, axis labels, and theme (w/ legend on the bottom)
theme_linedraw() +
theme(legend.position = "bottom") +
labs(title = "PET vs Rainfall vs Runnoff",
x = "PET",
y = "Rainfall",
color = "Mean Flow")
ggplot(camels, aes(x = pet_mean, y = p_mean)) +
geom_point(aes(color = q_mean)) +
geom_smooth(method = "lm") +
scale_color_viridis_c() +
# Apply log transformations to the x and y axes
scale_x_log10() +
scale_y_log10() +
theme_linedraw() +
theme(legend.position = "bottom") +
labs(title = "PET vs Rainfall vs Runnoff",
x = "PET",
y = "Rainfall",
color = "Mean Flow")
Recipe
# Define a formula you want to use to predict logQmean
## I want to use pet_mean and p_mean to predict logQmean
# Describe in words why you are choosing the formula you are. Consult the downloaded PDF for the data to help you make this decision
## Based on the initial visualizations, I do not know if PET will be a better predictor than aridity. But, I will model it for this lab to practice.
# Build a recipe that you feel handles the predictors chosen well
rec <- recipe(logQmean ~ pet_mean + p_mean, data = camels_train) %>%
step_log(all_predictors()) %>%
step_interact(terms = ~ pet_mean:p_mean) %>%
step_naomit(all_predictors(), all_outcomes())Define 3 models
# Define a random forest model using the rand_forest function
# Set the engine to ranger and the mode to regression
# Define two other models of your choice
rf <- rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# xgboost
xgb <- boost_tree(
trees = 1000,
learn_rate = 0.05,
tree_depth = 6
) %>%
set_engine("xgboost") %>%
set_mode("regression")
# neural network
nn <- bag_mlp(
hidden_units = 10,
epochs = 100
) %>%
set_engine("nnet") %>%
set_mode("regression")workflowset()
# Create a workflow object
# Add the recipe
# Add the model(s)
rf_wf <- workflow() %>%
add_model(rf) %>%
add_recipe(rec) %>%
fit(data = camels_train)
nn_wf <- workflow() %>%
add_model(nn) %>%
add_recipe(rec) %>%
fit(data = camels_train)
xgb_wf <- workflow() %>%
add_model(xgb) %>%
add_recipe(rec) %>%
fit(data = camels_train)
# fit the model to the resamples
rf_fit <- augment(rf_wf, new_data = camels_test)
nn_fit <- augment(nn_wf, new_data = camels_test)
xgb_fit <- augment(xgb_wf, new_data = camels_test)Evaluation
# Use autoplot and rank_results to compare the models
model_set <- workflow_set(
preproc = list(simple_recipe = rec),
models = list(
rf_model = rf,
xgb_model = xgb,
nn_model = nn))
model_res <- model_set %>%
workflow_map("fit_resamples", resamples = camels_cv)
autoplot(model_res)
# Describe what model you think is best and why
## I think that random forest is best because it has a higher rsq and lower rmse than xgboost while having a smaller range of metrics than the neural net. Tune the best model
# Use the tune-grid function to tune at least one of the model hyperparameters
rf_tune <- rand_forest(
mtry = tune(),
min_n = tune(),
trees = 1000
) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
rf_tune_wf <- workflow() %>%
add_model(rf_tune) %>%
add_recipe(rec)
rf_grid <- grid_regular(
mtry(range = c(1, 2)),
min_n(range = c(2, 10)),
levels = 5)
set.seed(4185)
rf_tuned_results <- tune_grid(
rf_tune_wf,
resamples = camels_cv,
grid = rf_grid,
metrics = metric_set(rmse, rsq))
# Use show_best to find the best hyperparameter values for the metric of your choice
show_best(rf_tuned_results, metric = "rmse")# A tibble: 5 × 8
mtry min_n .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 10 rmse standard 0.553 10 0.0348 Preprocessor1_Model09
2 1 8 rmse standard 0.554 10 0.0351 Preprocessor1_Model07
3 1 4 rmse standard 0.554 10 0.0363 Preprocessor1_Model03
4 1 6 rmse standard 0.557 10 0.0355 Preprocessor1_Model05
5 1 2 rmse standard 0.561 10 0.0362 Preprocessor1_Model01
show_best(rf_tuned_results, metric = "rsq")# A tibble: 5 × 8
mtry min_n .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 10 rsq standard 0.787 10 0.0213 Preprocessor1_Model09
2 1 8 rsq standard 0.786 10 0.0215 Preprocessor1_Model07
3 1 4 rsq standard 0.786 10 0.0216 Preprocessor1_Model03
4 1 6 rsq standard 0.784 10 0.0211 Preprocessor1_Model05
5 2 10 rsq standard 0.783 10 0.0220 Preprocessor1_Model10
# Use a workflow to fit your final, tuned, model
best_rf <- tibble(mtry = 1, min_n = 10)
final_rf_wf <- finalize_workflow(rf_tune_wf, best_rf)
final_model <- fit(final_rf_wf, data = camels_train)Look at VIP
# Use the vip::vip package to visualize the importance of your final model
# Describe what you think of the results and if they make sense
# If the model you elect can't provide VIP, instead discuss the pros and cons of a less interpretable model
vip::vip(final_model)
VIP shows that p_mean has the highest importance, then pet_mean x P_mean, and that pet_mean has the lease importance. This makes sense because precipitation will be of greater importance to streamflow than evapotranspiration, although both factors play a role in streamflow.
Extract and Evaluate
# Use augment to make predictions on the test data
# Use metrics to evaluate the model performance on the test data
# Create a plot of the observed vs predicted values with a clear title, axis labels, and a compelling color scale
# Describe what you think of the results!
final_rf_aug <- augment(final_model, new_data = camels_test)
test_metrics <- final_rf_aug %>%
metrics(truth = logQmean, estimate = .pred)
print(test_metrics)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.509
2 rsq standard 0.788
3 mae standard 0.328
ggplot(final_rf_aug, aes(x = logQmean, y = .pred, colour = pet_mean)) +
scale_color_viridis_c() +
geom_point() +
geom_abline() +
geom_smooth(method = "lm", col = 'red', lty = 2, se = FALSE) +
theme_linedraw() +
labs(title = "Random Forest: Observed vs Predicted Streamflow",
x = "Observed Log Mean Flow",
y = "Predicted Log Mean Flow",
color = "PET") +
theme_minimal()
Results: rsq is 0.731, and the prompt called for an rsq above 0.9, so some work needs to be done to fit a better model or use different predictor variables. It seems that values above 0 logQmean are better fit. I will attempt to create a model of values of loqQmean > 0.
# Filter to positive
camels_train_pos <- camels_train %>% filter(logQmean > 0)
camels_test_pos <- camels_test %>% filter(logQmean > 0)
# Rebuild the recipe based on the positive subset
rec_pos <- recipe(logQmean ~ pet_mean + p_mean, data = camels_train_pos) %>%
step_log(all_predictors()) %>%
step_interact(terms = ~ pet_mean:p_mean) %>%
step_naomit(all_predictors(), all_outcomes())
# Refinalize the rf model
best_rf <- tibble(mtry = 1, min_n = 10)
rf_tune <- rand_forest(
mtry = best_rf$mtry,
min_n = best_rf$min_n,
trees = 1000
) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# Workflow with updated recipe
final_rf_wf_pos <- workflow() %>%
add_model(rf_tune) %>%
add_recipe(rec_pos)
# Fit to training data
final_model_pos <- fit(final_rf_wf_pos, data = camels_train_pos)
# Augment
final_pos_aug <- augment(final_model_pos, new_data = camels_test_pos)
# Evaluate metrics
test_metrics_pos <- final_pos_aug %>%
metrics(truth = logQmean, estimate = .pred)
print(test_metrics_pos)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.174
2 rsq standard 0.891
3 mae standard 0.136
# Plot
ggplot(final_pos_aug, aes(x = logQmean, y = .pred, colour = pet_mean)) +
scale_color_viridis_c() +
geom_point() +
geom_abline() +
geom_smooth(method = "lm", col = 'red', lty = 2, se = FALSE) +
theme_linedraw() +
labs(title = "Random Forest: Observed vs Predicted Streamflow - Positive Values",
x = "Observed Log Mean Flow",
y = "Predicted Log Mean Flow",
color = "PET") 
Results: This model is a better fit, with a higher rsq and lower rmse.