Midterm (Due Sunday 2/13/2022 at 11:55 pm)

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I’m interested in the recent TidyTuesday data set about bee colony losses (2022 week 2), because my mom kept bees for several years until she lost the colony to some kind of parasite. Thanks to her “bee talks” during family get-togethers, I know that native bee populations are in a lot of trouble due to farm monocultures (=lack of food, because all the plants in the monoculture bloom at the same time), pesticides, and parasites.

This dataset looks at farmed and hobby bee colonies, not wild ones, but I am still interested in the stressors the colonies face. I’m particularly interested in two different questions:

1: Can I see any effect of climate change on bee colonies in Oregon? For example, is there any correlation between average temperature over a time span and the percentage of colonies lost? Since we’re in the middle of a multi-year drought, is there any correlation between precipitation over a time span and the percentage of colonies lost?

2: Are the types of stressors faced by bee colonies in Oregon and around the country changing over time? Specifically, is the percentage of colonies stressed by pesticides changing, and if so, how? Is the percentage of colonies stressed by varroa mites (a type of parasite) changing, and if so, how? Do trends in Oregon reflect those across the country?

Given your question, what is your expectation about the data?

In order to address my first research question, about climate change, I will have to make some assumptions. My major assumption is going to be that temperature trends in Hillsboro, OR reflect overall state temperature trends, so that I can use temperature data from the National Oceanic and Atmospheric Administration (NOAA) from Hillsboro to examine bee colony loss trends with regard to average temperatures. I chose Hillsboro because monthly average temperature data is available during this time period and because a lot of farming is conducted around the Hillsboro area, so I would expect farming trends there to be at least somewhat reflective of those in the state as a whole.

I expect that becauses the bee colony data is reported quarterly, and the data is only reported from 2015 on, it will be difficult to see climate trends reflected in bee colony losses. I don’t expect to see much correlation between average temperature and percentage of colonies lost, because there just isn’t enough change in temperature over this short time period. If anything, I expect to see a very slight increase in percentage of colonies lost in the summer in recent years, and a very slight decrease in percentage of colonies lost during the winter: as average winter temperatures increase slightly, the number of really cold days that can freeze colonies should decrease, and as summer average temperatures increase slightly, the number of really hot days that can overheat colonies should increase. (I think a correlation between number of extreme temperature days and colony losses might be easier to see, but data for the number of extreme temperature days was not readily available.) Furthermore, I’m not sure I’ll be able to see any correlation between precipitation and proportion of colony losses, but I would expect losses to increase with lower precipitation in the summer months. I will get temperature and precipitation data from https://www.weather.gov/pqr/cliplot.

I am also not sure what to expect with regard to the types of stressors colonies face. On the one hand, I think that as organic farming becomes more popular over time, colony stress due to pesticides might decrease; on the other hand, there might be increases in the types and total amount of pesticides applied over time, so I don’t know what trend to expect. Similarly, I don’t really know if varroa mites are becoming more or less of a problem over time. I would be interested to see if the problems differ by region of the country, too!

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

# Get the Data

# Read bee data in with tidytuesdayR package 
 #install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest
tuesdata <-tidytuesdayR::tt_load(2022, week = 2)

## --- Compiling #TidyTuesday Information for 2022-01-11 ----

## --- There are 2 files available ---

## --- Starting Download ---

## 
##  Downloading file 1 of 2: `colony.csv`
##  Downloading file 2 of 2: `stressor.csv`

## --- Download complete ---

colony <- tuesdata$colony
stressor <- tuesdata$stressor

#first let's look at the "colony" data, which reports losses for each region over time
glimpse(colony)

## Rows: 1,222
## Columns: 10
## $ year            <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months          <chr> "January-March", "January-March", "January-March", "Ja~
## $ state           <chr> "Alabama", "Arizona", "Arkansas", "California", "Color~
## $ colony_n        <dbl> 7000, 35000, 13000, 1440000, 3500, 3900, 305000, 10400~
## $ colony_max      <dbl> 7000, 35000, 14000, 1690000, 12500, 3900, 315000, 1050~
## $ colony_lost     <dbl> 1800, 4600, 1500, 255000, 1500, 870, 42000, 14500, 380~
## $ colony_lost_pct <dbl> 26, 13, 11, 15, 12, 22, 13, 14, 4, 4, 40, 22, 18, 23, ~
## $ colony_added    <dbl> 2800, 3400, 1200, 250000, 200, 290, 54000, 47000, 3400~
## $ colony_reno     <dbl> 250, 2100, 90, 124000, 140, NA, 25000, 9500, 760, 8000~
## $ colony_reno_pct <dbl> 4, 6, 1, 7, 1, NA, 8, 9, 7, 9, 4, 1, 2, 1, NA, 13, NA,~

skim(colony)

Data summary
Name	colony
Number of rows	1222
Number of columns	10
_______________________
Column type frequency:
character	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
months	0	1	10	16	0	4	0
state	0	1	4	14	0	47	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2017.77	1.89	2015	2016	2018	2019	2021	▇▃▃▃▆
colony_n	47	0.96	123578.04	437835.18	1300	8000	17500	55500	3181180	▇▁▁▁▁
colony_max	72	0.94	79112.77	190823.42	1700	9000	21000	68750	1710000	▇▁▁▁▁
colony_lost	47	0.96	16551.32	60544.42	20	950	2200	6500	502350	▇▁▁▁▁
colony_lost_pct	54	0.96	11.38	7.23	1	6	10	15	52	▇▅▁▁▁
colony_added	83	0.93	17243.20	68167.65	10	420	1800	6500	736920	▇▁▁▁▁
colony_reno	131	0.89	15278.86	62588.04	10	260	960	4585	806170	▇▁▁▁▁
colony_reno_pct	260	0.79	9.10	9.66	1	2	6	12	77	▇▁▁▁▁

#I'm mainly interested in the year, months, state, and colony_lost_pct variables
#isn't it interesting that the colony_lost_pct variable is heavily right-skewed? 
#I wonder what stressor(s) are responsible for the very heavy losses.

#now let's look at the "stressor" data
glimpse(stressor)

## Rows: 7,332
## Columns: 5
## $ year       <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ months     <chr> "January-March", "January-March", "January-March", "January~
## $ state      <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ stressor   <chr> "Varroa mites", "Other pests/parasites", "Disesases", "Pest~
## $ stress_pct <dbl> 10.0, 5.4, NA, 2.2, 9.1, 9.4, 26.9, 20.5, 0.1, NA, 1.8, 3.1~

skim(stressor)

Data summary
Name	stressor
Number of rows	7332
Number of columns	5
_______________________
Column type frequency:
character	3
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
months	1	10	16	4
state	1	4	14	47
stressor	1	5	21	6

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2017.77	1.89	2015.0	2016.0	2018.0	2019.0	2021	▇▃▃▃▆
stress_pct	843	0.89	11.20	14.87	0.1	1.7	5.3	14.2	102	▇▁▁▁▁

#I can see that the variable stressor is currently character, not a factor. I will have to change that at some point.
#Interesting that the stress_pct variable is also heavily right-skewed!

#Read in average monthly temperature data for Hillsboro from NOAA
#this data is from:
#Note that missing is coded as M, sheet 1 is Temperature, sheet 2 is Precipitation.  On sheet 2, trace is coded as T, so I need to reset this to something small at some point.  I will assume it's zero.
hills_temps <- read_excel(here("data", "noaatempdata_monthly.xlsx"), sheet = "Temperature", na = "M")

hills_precip <- read_excel(here("data", "noaatempdata_monthly.xlsx"), sheet = "Precipitation", na = "M")

glimpse(hills_temps)

## Rows: 23
## Columns: 14
## $ Year   <dbl> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 201~
## $ Jan    <dbl> 38.5, 39.8, 39.9, 44.0, 37.8, 41.2, 43.3, 36.3, 36.6, 38.1, 44.~
## $ Feb    <dbl> 42.3, 39.9, 42.3, 42.9, 43.8, 41.3, 40.4, 43.6, 41.8, 39.0, 45.~
## $ Mar    <dbl> 43.7, 45.7, 42.9, 47.4, 49.3, 47.6, 44.4, 48.2, 42.2, 42.4, 45.~
## $ April  <dbl> 51.3, 46.9, 50.0, 49.0, 53.0, 49.3, 50.9, 49.4, 45.5, 48.7, 48.~
## $ May    <dbl> 55.3, 56.1, 53.1, 55.5, 57.1, 57.7, 56.6, 55.9, 56.5, 55.8, 53.~
## $ June   <dbl> 61.8, 57.2, 61.5, 63.8, 63.0, 58.6, 63.0, 60.0, 58.8, 61.9, 57.~
## $ July   <dbl> 64.0, 63.4, 67.1, 68.5, 68.9, 67.0, 69.1, 68.3, 65.4, 69.9, 64.~
## $ Aug    <dbl> 63.2, 65.5, 65.0, 67.1, 69.3, 67.5, 67.1, 65.4, 66.3, 66.1, 64.~
## $ Sept   <dbl> 60.4, 61.2, 60.7, 64.1, 59.9, 58.8, 62.1, 57.8, 61.3, 62.0, 61.~
## $ Oct    <dbl> 52.0, 49.9, 50.5, 55.9, 54.1, 53.5, 50.5, 48.1, 50.6, 50.9, 52.~
## $ Nov    <dbl> 40.0, 46.1, 45.1, 42.7, 43.6, 41.3, 44.8, 42.5, 46.9, 44.5, 43.~
## $ Dec    <dbl> 39.7, 40.5, 41.9, 41.2, 41.5, 38.8, 38.7, 38.9, 35.2, 34.7, 42.~
## $ Annual <dbl> 51.0, 51.0, 51.7, 53.5, 53.4, 51.9, 52.6, 51.2, 50.6, 51.2, 52.~

skim(hills_temps)

Data summary
Name	hills_temps
Number of rows	23
Number of columns	14
_______________________
Column type frequency:
numeric	14
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	0	1.00	2011.00	6.78	2000.0	2005.50	2011.00	2016.50	2022.0	▇▆▇▆▇
Jan	0	1.00	40.21	3.01	33.2	38.30	40.20	42.80	44.7	▁▅▇▆▇
Feb	1	0.96	41.77	2.56	36.9	40.32	41.75	42.75	47.6	▃▃▇▂▂
Mar	1	0.96	45.62	2.29	42.2	44.02	45.20	47.35	50.4	▇▇▅▇▂
April	1	0.96	50.00	2.26	45.5	49.05	50.10	51.27	55.5	▂▅▇▂▁
May	1	0.96	56.60	2.19	51.2	55.58	56.70	58.05	59.9	▁▂▆▇▅
June	1	0.96	61.63	2.64	57.2	59.25	61.90	63.00	67.4	▅▂▇▂▂
July	1	0.96	67.30	2.23	63.2	65.93	67.25	69.05	70.9	▅▂▇▅▆
Aug	1	0.96	67.37	1.96	63.2	66.03	67.40	68.95	71.3	▁▇▆▇▂
Sept	1	0.96	61.72	1.96	57.8	60.45	61.30	62.42	65.1	▂▇▇▃▇
Oct	1	0.96	52.39	2.53	48.1	50.60	52.20	53.98	57.9	▃▇▅▃▂
Nov	1	0.96	44.31	2.14	40.0	42.88	44.50	45.62	49.5	▂▇▇▆▁
Dec	1	0.96	39.32	2.79	33.7	38.00	40.10	41.50	43.3	▅▁▆▆▇
Annual	0	1.00	51.83	2.79	40.3	51.20	52.00	53.40	54.8	▁▁▁▆▇

#looks like all variables that should be numeric actually are numeric

glimpse(hills_precip)

## Rows: 28
## Columns: 14
## $ Year   <chr> "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007",~
## $ Jan    <dbl> 6.92, 1.94, 7.31, 8.29, 5.90, 2.27, 11.90, 3.24, 5.38, 4.36, 5.~
## $ Feb    <dbl> 4.35, 1.58, 3.13, 2.93, 4.27, 0.68, 1.99, 3.80, 1.49, 1.08, 4.0~
## $ Mar    <dbl> 3.02, 2.33, 3.49, 5.16, 1.68, 4.42, 3.57, 2.39, 3.31, 2.40, 3.7~
## $ Apr    <dbl> 1.36, 1.86, 1.71, 5.91, 1.79, 2.56, 2.02, 1.96, 1.94, 1.24, 3.2~
## $ May    <dbl> 1.91, 0.85, 1.44, 0.75, 1.24, 4.35, 2.70, 1.29, 0.97, 2.92, 3.1~
## $ Jun    <dbl> 1.04, 1.20, 1.30, 0.15, 0.82, 1.55, 1.08, 0.97, 0.36, 1.34, 3.5~
## $ Jul    <chr> "0.08", "0.45", "0.32", "T", "T", "0.24", "0.14000000000000001"~
## $ Aug    <chr> NA, "0.79", "0.05", "0.55000000000000004", "2.31", "0.32", "0.0~
## $ Sep    <dbl> 1.27, 0.79, 0.83, 0.94, 1.37, 1.36, 0.59, 1.73, 0.22, 1.51, 2.2~
## $ Oct    <dbl> 3.00, 3.13, 0.43, 3.07, 3.55, 3.68, 0.90, 3.12, 1.69, 3.32, 3.9~
## $ Nov    <dbl> 2.16, 8.54, 2.61, 4.43, 2.61, 6.09, 12.88, 3.90, 4.51, 5.72, 5.~
## $ Dec    <dbl> 3.24, 6.98, 9.88, 7.93, 3.72, 9.09, 7.49, 8.94, NA, 3.96, 8.16,~
## $ Annual <dbl> NA, 30.44, 32.50, 40.11, 29.26, 36.61, 45.34, 32.27, NA, NA, 43~

skim(hills_precip)

Data summary
Name	hills_precip
Number of rows	28
Number of columns	14
_______________________
Column type frequency:
character	3
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Year	2	0.93	3	4	26
Jul	1	0.96	1	21	17
Aug	2	0.93	1	19	20

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Jan	1	0.96	153.76	534.89	1.47	3.18	5.18	7.42	2013	▇▁▁▁▁
Feb	1	0.96	152.21	535.77	0.68	1.93	3.80	4.46	2017	▇▁▁▁▁
Mar	1	0.96	152.95	537.85	1.28	2.36	3.57	5.35	2021	▇▁▁▁▁
Apr	1	0.96	151.28	536.32	0.45	1.75	2.33	3.37	2021	▇▁▁▁▁
May	2	0.93	156.38	546.14	0.11	0.88	1.72	2.87	2018	▇▁▁▁▁
Jun	1	0.96	149.78	535.16	0.15	0.74	1.20	1.52	2010	▇▁▁▁▁
Sep	1	0.96	150.54	536.67	0.04	0.81	1.28	1.97	2013	▇▁▁▁▁
Oct	1	0.96	151.88	535.29	0.43	1.78	3.32	4.01	2016	▇▁▁▁▁
Nov	1	0.96	153.88	535.72	1.16	2.78	5.23	7.01	2019	▇▁▁▁▁
Dec	2	0.93	160.79	545.58	1.08	4.15	6.52	8.75	2015	▇▁▁▁▁
Annual	5	0.82	206.88	568.84	24.57	29.85	36.17	42.88	2013	▇▁▁▁▁

#both months have missing data in 2022. Precipitation data has some other missings too.
#there is some ugly footer junk on the precipitation data sheet that needs to come off.

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Clean up

I need to clean up the precipitation data a little by removing the weird footer summary data. I also need to replace ‘T’ (trace) with zero. I set ‘M’ (missing) to NA for both the temperature and precipitation data during import.

Preparation for joining

My colony data reports losses quarterly, so I’ll need to wrangle my temperature and precipitation data into quarterly averages (for temperature) and sums (for precipitation). The temperature and precipitation data is wide data so I will also need to pivot it before joining.

Make sure your data types are correct!

I see that the stressor data has stressor as a character variable rather than a factor. I’ll want to change it to a factor so I can facet plots by type of stressor. I might also want to change the time period (“January - March” etc or year) to a factor for similar reasons, eventually.

All of the numeric variables seem to be properly numeric except for some of the precipitation data. I think changing the ’T’s (trace) to zeros will fix that.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

#Clean up Precipitation data

#let's take off the footer stuff from Precipitation data
hills_precip <- hills_precip %>% filter(!(Year %in% c("Min", "Max", "Mean", NA)))

#Now replace 'T' (trace) with zero in July and August
#have to do this in 2 steps because R doesn't like it if I don't use a string for str_replace
#first replace "T" with "0" (string verson of zero)
#hills_precip$Jul <- str_replace(hills_precip$Jul, "T", "0")
#hills_precip$Aug <- str_replace(hills_precip$Aug, "T", "0")

#Try to do this with across
hills_precip <- hills_precip %>% mutate(
  across(c(Jul,Aug),
         str_replace, "T", "0")
)

#Next change the variables to numeric
hills_precip <-hills_precip %>% mutate(
  across(c(Jul, Aug, Year),
         as.numeric)
)

glimpse(hills_precip)

## Rows: 23
## Columns: 14
## $ Year   <dbl> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 201~
## $ Jan    <dbl> 6.92, 1.94, 7.31, 8.29, 5.90, 2.27, 11.90, 3.24, 5.38, 4.36, 5.~
## $ Feb    <dbl> 4.35, 1.58, 3.13, 2.93, 4.27, 0.68, 1.99, 3.80, 1.49, 1.08, 4.0~
## $ Mar    <dbl> 3.02, 2.33, 3.49, 5.16, 1.68, 4.42, 3.57, 2.39, 3.31, 2.40, 3.7~
## $ Apr    <dbl> 1.36, 1.86, 1.71, 5.91, 1.79, 2.56, 2.02, 1.96, 1.94, 1.24, 3.2~
## $ May    <dbl> 1.91, 0.85, 1.44, 0.75, 1.24, 4.35, 2.70, 1.29, 0.97, 2.92, 3.1~
## $ Jun    <dbl> 1.04, 1.20, 1.30, 0.15, 0.82, 1.55, 1.08, 0.97, 0.36, 1.34, 3.5~
## $ Jul    <dbl> 0.08, 0.45, 0.32, 0.00, 0.00, 0.24, 0.14, 0.40, 0.09, 0.13, 0.4~
## $ Aug    <dbl> NA, 0.79, 0.05, 0.55, 2.31, 0.32, 0.08, 0.53, 1.37, 0.72, 0.17,~
## $ Sep    <dbl> 1.27, 0.79, 0.83, 0.94, 1.37, 1.36, 0.59, 1.73, 0.22, 1.51, 2.2~
## $ Oct    <dbl> 3.00, 3.13, 0.43, 3.07, 3.55, 3.68, 0.90, 3.12, 1.69, 3.32, 3.9~
## $ Nov    <dbl> 2.16, 8.54, 2.61, 4.43, 2.61, 6.09, 12.88, 3.90, 4.51, 5.72, 5.~
## $ Dec    <dbl> 3.24, 6.98, 9.88, 7.93, 3.72, 9.09, 7.49, 8.94, NA, 3.96, 8.16,~
## $ Annual <dbl> NA, 30.44, 32.50, 40.11, 29.26, 36.61, 45.34, 32.27, NA, NA, 43~

#Now all the variables are numeric and T's have been replaced with zeros.  YAY!!

#Get temperature data ready to join to colony data
#I realized after wrangling the temperature data that I could also have pivoted first, then used summarize to get the mean temperature for each quarter, except since I don't want annual average temps this seemed hard. Anyway, this method worked.

#get mean temperature per quarter for each year
hills_temps <- hills_temps %>%
  mutate(q1 = (Jan + Feb + Mar)/3,
         q2 = (April + May + June)/3,
         q3 = (July + Aug + Sept)/3,
         q4 = (Oct + Nov + Dec)/3)

#this is wide data. Now we need to pivot it to long data so we can join with the colony dataframe
hills_temps_long <- hills_temps %>%
  pivot_longer(
    cols = c(q1,q2, q3, q4),
    names_to = "months",
    values_to = "avg_temp"
  ) %>% select(!Jan:Dec)

#now let's replace strings so the `months` values match those in the colony data
hills_temps_long$months <- hills_temps_long$months %>%
  str_replace_all(c("q1" = "January-March",
              "q2" = "April-June",
              "q3" = "July-September",
              "q4" = "October-December"))

head(hills_temps_long)

## # A tibble: 6 x 4
##    Year Annual months           avg_temp
##   <dbl>  <dbl> <chr>               <dbl>
## 1  2000     51 January-March        41.5
## 2  2000     51 April-June           56.1
## 3  2000     51 July-September       62.5
## 4  2000     51 October-December     43.9
## 5  2001     51 January-March        41.8
## 6  2001     51 April-June           53.4

#now this dataframe is ready to join to the colony dataframe, hurray!

#Get precipitation data ready to join to colony data
#first get sum of precipitation for each quarter
hills_precip<- hills_precip%>%
  mutate(q1 = Jan + Feb + Mar, 
         q2 = Apr + May + Jun,
         q3 = Jul + Aug + Sep,
         q4 = Oct + Nov + Dec)
#I realized here that the NAs are a big problem, because then I get NAs for my sums.  I'm not sure how to use the sum function across some of the functions so I can use na.rm ...I'm sure there's a way to do it, but it's making my head hurt.  Let's just replace the NAs with zeros, which will give us a reasonable estimate for total precipitation even if not totally accurate. Using na.rm with sum would be doing the equivalent of replacing the NAs with zeros, here!

hills_precip[is.na(hills_precip)] <- 0 #put 0 wherever there's an na in hills_precip

#now let's try again to get precipitation sums for each quarter
hills_precip<- hills_precip%>%
  mutate(q1 = Jan + Feb + Mar, 
         q2 = Apr + May + Jun,
         q3 = Jul + Aug + Sep,
         q4 = Oct + Nov + Dec)

#pivot the data to long so we can join it to the colony dataframe
hills_precip_long <- hills_precip %>%
  pivot_longer(
    cols = c(q1,q2, q3, q4),
    names_to = "months",
    values_to = "precipitation"
  ) %>% select(!Jan:Dec)

#now let's replace strings so the `months` values match those in the colony data
hills_precip_long$months <- hills_precip_long$months %>%
  str_replace_all(c("q1" = "January-March",
              "q2" = "April-June",
              "q3" = "July-September",
              "q4" = "October-December"))

head(hills_precip_long)

## # A tibble: 6 x 4
##    Year Annual months           precipitation
##   <dbl>  <dbl> <chr>                    <dbl>
## 1  2000    0   January-March            14.3 
## 2  2000    0   April-June                4.31
## 3  2000    0   July-September            1.35
## 4  2000    0   October-December          8.4 
## 5  2001   30.4 January-March             5.85
## 6  2001   30.4 April-June                3.91

#Get colony data ready to join: We have temp data for Oregon only
colony_or <- colony %>% filter(state == "Oregon")

#Stressor data (includes some cleaning)
#We want the proportion of colonies stressed by each stressor for each quarter of each year, averaged over the entire country (including Oregon)
#stressor %>% group_by(year, months) %>% summarize(avg_stress_pct = mean(stress_pct, na.rm = T)) #this gives average of all stressors.  We want it broken down by type of stressor.  Need to pivot the data to wide so we can get that.
stressor_wide <- stressor %>%
  pivot_wider(
    names_from = "stressor", #column names from values of stressor
    values_from = "stress_pct" # the data points are in a column = stress_pct
  )
#clean name: want column to be "Diseases" not "Disesases"; want others easy to type
stressor_wide <- stressor_wide %>% rename(diseases = Disesases, other_pests = "Other pests/parasites", varroa_mites = "Varroa mites", pesticides = Pesticides, other = Other, unknown = Unknown)

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

I want to do a couple of joins of my datasets:

1. Left join of temperature data to Oregon colony data (colony_or on the left). The temperature data has many more years’ worth of data (2000-2022) than the colony data does (2015-2021), but I’m only interested in the years with colony data, so I will use the colony_or dataframe as the left one and do a left join.

2. Left join of precipitation data to Oregon colony data (colony_temps on the left). Similarly, the precipitation data has many more years’ worth of data (2000-2022) than the colony data does (2015-2021), but I’m only interested in the years with colony data. I can add it to the colony_temps dataframe that I created with the first join, because that is the same colony data, only with the average temperature for each quarter tacked on.

#Joining temperature data to colony data (left join) by year, months
colony_temps <- left_join(colony_or, hills_temps_long, by = c("year" = "Year", "months" = "months"))

#Joining precipitation data to colony data (left join) by year, months
#we can just slap this onto the same colony_temps dataframe if it works right!

colony_temps <- left_join(colony_temps, hills_precip_long, by = c("year" = "Year", "months" = "months"))

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

colony_temps %>% select(c(year, months,colony_lost_pct, avg_temp, precipitation)) %>% head()

## # A tibble: 6 x 5
##    year months           colony_lost_pct avg_temp precipitation
##   <dbl> <chr>                      <dbl>    <dbl>         <dbl>
## 1  2015 January-March                  8     46.9         12.3 
## 2  2015 April-June                     6     58.8          2.39
## 3  2015 July-September                 9     66.8          1.73
## 4  2015 October-December               8     46.8         22.0 
## 5  2016 January-March                  3     45.0         16.8 
## 6  2016 April-June                     3     59.6          4.01

Are the values what you expected for the variables? Why or Why not?

Well, the first several times I tried to join this data, they were sure not what I expected, particularly when I tried to add on the precipitation data. I managed to get the temperature data joined to the colony data, but then when I tried to join the precipitation data, R complained that the x and y key values were different types. I realized that in colony_temps, Year was a numeric variable, while in hills_precip_long, Year was a character variable. Whoops! After I fixed that, I could join the dataframes.

I made sure to join on both Year and months so there were no duplicate keys! The resulting colony_temps dataframe (after both joins) has the colony data, plus the average temperatures per quarter, plus the total precipitation per quarter. It looks okay to me!

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#Can we see climate change (increase in avg temperature over time)?
#group temperature data by year and take the average to see if it's increasing over time
temp_time <- colony_temps %>% group_by(year) %>% summarize(mean_t = mean(avg_temp, na.rm = T))
temp_time_table <- temp_time %>% gt::gt()


#Is the percent of colonies lost each year changing?  How?
#add up %colonies lost per quarter to get total each year
colony_temps_annual <- colony_temps %>% group_by(year) %>% summarize(total_lost_pct= sum(colony_lost_pct, na.rm = T))

colony_losses_annual_table <- colony_temps_annual %>% gt::gt()
#I can't see a clear trend.  Note that only 2 quarters of 2021 have data so the total yearly colony losses look artificially low

#Let's compare mean losses in Oregon and in the US each year
#for this we'll take the mean percent loss over the quarters of each year

#oregon average losses per quarter
or_losses <- colony_temps %>% group_by(year) %>% summarize(or_avg_lost_pct = mean(colony_lost_pct, na.rm = T))

#Preparation of national average data per year to compare to Oregon data:
colony_states <- colony %>% group_by(year) %>% summarize(us_avg_lost_pct = mean(colony_lost_pct, na.rm = T))

quarterly_losses <- left_join(x = or_losses, y = colony_states, by = "year")

comparison_table <- quarterly_losses %>% gt::gt()

#I still can't see a clear trend, but OR seems to be losing a slightly lower-than-average proportion of bee colonies each year.

What are your findings about the summary? Are they what you expected?

Research question 1

My first research question was about the effects of climate change on bee colonies in Oregon, and I expected that the time frame was too short to see much of a change in average temperature. That turned out to be true; there’s no discernable increasing trend in yearly average temperature in Oregon over this time period.

year	mean_t
2015	54.80833
2016	54.01667
2017	52.61667
2018	53.53333
2019	52.18333
2020	53.42500
2021	50.36667

I still think looking at the number of days of extreme temperature per quarter might give a better picture of climate change.

I will look at temperature vs percent of colonies lost per quarter, and precipitation vs colonies lost per quarter, again in the next section. Even though the average yearly temperature isn’t clearly increasing over this time period, there are still some interesting things to see!

There is no clear trend in the proportion of bee colonies that Oregon loses each year (note that data for 2021 is incomplete, leading to artificially low total losses):

year	total_lost_pct
2015	31
2016	21
2017	38
2018	32
2019	27
2020	30
2021	8

Oregon seems to be losing a slightly lower-than-average proportion of bee colonies per quarter, compared to the national average losses per quarter for each year:

year	or_avg_lost_pct	us_avg_lost_pct
2015	7.75	12.154255
2016	5.25	11.363636
2017	9.50	11.470588
2018	8.00	12.096257
2019	9.00	11.864286
2020	7.50	9.962366
2021	4.00	10.354839

#examining stressors on bee colonies in OR vs the rest of the country

#filter out just the oregon data
stressor_wide_or <- stressor_wide %>% filter(state == "Oregon")
glimpse(stressor_wide_or)

## Rows: 26
## Columns: 9
## $ year         <dbl> 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 201~
## $ months       <chr> "January-March", "April-June", "July-September", "October~
## $ state        <chr> "Oregon", "Oregon", "Oregon", "Oregon", "Oregon", "Oregon~
## $ varroa_mites <dbl> 20.2, 39.3, 56.0, 16.4, 20.9, 52.7, 52.9, 25.5, 28.7, 32.~
## $ other_pests  <dbl> 2.0, 20.9, 20.8, 1.3, 15.0, 14.5, 18.5, 2.8, 8.5, 9.0, 10~
## $ diseases     <dbl> 0.4, 13.2, 5.2, 1.5, 5.3, 9.9, 12.5, 13.0, 10.1, 10.2, 11~
## $ pesticides   <dbl> 0.6, 16.0, 3.1, 1.5, NA, 2.6, 9.6, 2.8, 9.1, 3.6, 13.2, 1~
## $ other        <dbl> 1.6, 5.8, 4.8, 1.7, 2.4, 16.7, 5.1, 2.6, 15.4, 14.8, 6.4,~
## $ unknown      <dbl> 1.1, 0.4, 3.8, 1.5, 0.3, 6.0, 1.5, 1.3, 8.5, 9.6, 1.8, 1.~

#let's get averages across the year
stressor_wide_or_avg <- stressor_wide_or %>% group_by(year) %>% 
summarize(across(c(diseases, varroa_mites, pesticides, other, unknown, other_pests),
          .fns = list(mean = mean), na.rm = TRUE))

stressor_table_or <- stressor_wide_or_avg %>% gt::gt()

#national data: group by year, months and then summarize for all states including OR
#want to summarize across all the the columns that are types of stressors
avg_stressor <- stressor_wide %>% group_by(year, state) %>% 
  summarize(across(c(diseases, varroa_mites, pesticides, other, unknown, other_pests), 
       .fns = list(mean = mean), na.rm = TRUE))

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

#check to make sure it looks right
head(avg_stressor, 10)

## # A tibble: 10 x 8
## # Groups:   year [1]
##     year state       diseases_mean varroa_mites_mean pesticides_mean other_mean
##    <dbl> <chr>               <dbl>             <dbl>           <dbl>      <dbl>
##  1  2015 Alabama             0.2                23.2            1.82       4.32
##  2  2015 Arizona             1.78               34.7           13.6        9.52
##  3  2015 Arkansas            2.38               47.7            9.8        9.67
##  4  2015 California          7.38               38.2           15.4       11.6 
##  5  2015 Colorado            7.58               37.8            9.9        4.53
##  6  2015 Connecticut         0.933              15.6            1.13       8.35
##  7  2015 Florida             3.18               33.3           14.1       10.9 
##  8  2015 Georgia             3.78               37.5           14.2        7.22
##  9  2015 Hawaii              0.55               52              0.1        2.57
## 10  2015 Idaho               6.25               37.8            6.2        6.9 
## # ... with 2 more variables: unknown_mean <dbl>, other_pests_mean <dbl>

#let's get just one mean value for each stressor each year so we can compare to OR
stressor_table_us <- stressor_wide %>% group_by(year) %>% summarize(across(c(diseases, varroa_mites, pesticides, other, unknown, other_pests), 
       .fns = list(mean = mean), na.rm = TRUE) ) %>% gt::gt()

Research question 2

My second research question was about the stressors on bee colonies in Oregon and across the US. The summary of percent of colonies in Oregon stressed by each type of stressor each year shows that while there’s some fluctuation, the stressor affecting the most colonies is Varroa mites:

year	diseases_mean	varroa_mites_mean	pesticides_mean	other_mean	unknown_mean	other_pests_mean
2015	5.075000	32.97500	5.300	3.47500	1.700000	11.25
2016	10.175000	38.00000	5.000	6.70000	2.275000	12.70
2017	14.900000	41.60000	10.025	10.12500	5.425000	12.85
2018	8.075000	45.80000	18.300	7.70000	0.550000	2.90
2019	8.233333	29.73333	5.700	11.96667	5.633333	5.00
2020	6.450000	30.82500	4.700	4.52500	1.825000	4.80
2021	3.800000	15.05000	10.100	3.85000	1.200000	9.50

There’s no clear pattern in the percent of colonies in Oregon affected by Varroa mites or by pesticides.

This is fairly reflective of trends in the US as a whole, where the stressor affecting the most colonies each year is Varroa mites:

year	diseases_mean	varroa_mites_mean	pesticides_mean	other_mean	unknown_mean	other_pests_mean
2015	3.665644	27.17287	7.693373	6.368280	4.246629	11.25598
2016	4.237576	29.36667	7.263473	5.806989	4.467778	10.38956
2017	5.758182	33.00213	7.961310	6.626738	4.443258	11.48846
2018	4.279532	34.91390	8.774118	7.779144	4.776923	14.65956
2019	4.381452	31.73121	9.218966	6.480714	4.194074	12.68657
2020	3.718571	28.88245	6.285714	5.134973	4.630588	10.36747
2021	3.336170	28.17000	5.623333	5.490698	4.082667	11.19867

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

#let's first look at the total losses in bee colonies per year and per quarter in Oregon

#make quarters into a factor and plot % colonies lost vs year, faceted by quarter
colony_temps$months <- factor(colony_temps$months, levels = c("January-March", "April-June", "July-September", "October-December"))
colony_temps$year <-factor(colony_temps$year)

levels(colony_temps$months) #check it

## [1] "January-March"    "April-June"       "July-September"   "October-December"

#shows that many years >= 30% of colonies are lost
#most losses occur in summer and fall
ggplot(colony_temps, aes(x = year, y = colony_lost_pct, fill= months)) + geom_col() +
  labs(title = "Percent of bee colonies lost each year in Oregon",
       x = "Year", 
       y = "Percent of colonies lost")

## Warning: Removed 1 rows containing missing values (position_stack).

#same thing, different visualization
#ggplot(colony_temps, aes(x = year, y = colony_lost_pct, fill= months)) + geom_col(position = "dodge")

You can see that many years, around 30% of bee colonies in Oregon are lost in total. You can also see that the majority of losses occur in July-September (summer) and October-December (fall). This makes some intuitive sense, if bees don’t like extreme temperatures… the summer is the hottest time of the year, and fall would be when the bees are adapting to colder temperatures (although I’m a little surprised that the big losses don’t occur in winter).

#Now let's look at the effect of temperature on losses.  Is there any correlation?  
#Number of data points is super small so don't read too much into findings!
ggplot(colony_temps, aes(x = avg_temp, y = colony_lost_pct, color = year)) + geom_point() + facet_wrap(vars(months),  scales = "free_x") +
  labs(title = "Bee colony losses vs average quarterly temperature in Oregon",
        x = "Average quarterly temperature (F)",
        y = "Percent of colonies lost")

## Warning: Removed 1 rows containing missing values (geom_point).

#in January - March and October-December, as temperature goes down, colony losses increase (one big outlier in each)
#in July- Sept, as temperature goes up, colony losses increase
#this makes sense!  bees can't handle extremes.

You can see that while the losses in spring, fall and winter are not really all that well correlated with average quarterly temperature (at least just from eyeballing the data – it looks like losses in fall and winter might increase with decreasing temperature, but there are some points that don’t fit this trend), the summertime data shows a clear trend: higher average quarterly temperatures correspond with higher losses.

There are only six data points in the summertime data so I hesitate to read too much into this trend, but it indicates that as summers become warmer, bee colonies might face heavier losses. It will be interesting to see if future data bears out this association or not!

Can we get a (probably not particularly accurate, because our sample is tiny!) estimate of how much colony losses increase with increasing temperature? Let’s treat this as a simple linear regression and find the slope coefficient.

#select only the summer data
summer <- colony_temps %>% filter(months == "July-September")

#fit a simple linear regression model
summer_model <- lm(colony_lost_pct ~ avg_temp, data = summer)
summary(summer_model)

## 
## Call:
## lm(formula = colony_lost_pct ~ avg_temp, data = summer)
## 
## Residuals:
##        1        2        3        4        5        6 
## -1.35711  0.14924  0.22741  1.26624 -0.24036 -0.04543 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -145.7921    30.0360  -4.854  0.00832 **
## avg_temp       2.3376     0.4521   5.171  0.00665 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9459 on 4 degrees of freedom
## Multiple R-squared:  0.8699, Adjusted R-squared:  0.8373 
## F-statistic: 26.74 on 1 and 4 DF,  p-value: 0.006649

#plot it
ggplot(summer, aes(x = avg_temp, y = colony_lost_pct)) + 
  geom_point() +
  geom_smooth(method = lm, color = "red", se = T) +
  labs(title = "Simple linear regression of bee colony loss on average temperature",
       subtitle = "In Oregon; summertime data only",
       x = "Average quarterly temperature", 
       y = "Percent of colonies lost")

## `geom_smooth()` using formula 'y ~ x'

We have very few data points and this model is very simplistic so it should be taken with a big grain of salt! But this model predicts an increase of about 2.3% of bee colonies will be lost in the summer with each increase of one degree in average quarterly temperature.

#Looking at OR colony losses vs total precipitation in a quarter

#a scatterplot of colony losses vs precipitation in a quarter doesn't show any clear trends, so I won't show it here
#ggplot(colony_temps, aes(x = precipitation, y = colony_lost_pct, color = months)) + geom_point() + facet_wrap(vars(months), scales = "free_x")

#let's look at the distribution of colony losses for categories of precipitation in a quarter

#check the distribution of quarterly precipitation
#ggplot(colony_temps, aes(x = precipitation)) + geom_histogram()
#it's pretty heavily right-skewed!

#first bin the precipitation. We could also do this with case_when if we wanted to.  Try binning the precipitation by 5 inch increments
#colony_temps <- colony_temps %>%
  # Add a new column called 'bin': cut the initial 'precipitation' in bins
  #mutate( bin=cut_width(precipitation, width=5.0, boundary=0) )
  
# plot
  #ggplot(colony_temps, aes(x=bin, y=colony_lost_pct) ) +
   # geom_boxplot(fill = "lightblue") +
   # labs(title = "Distribution of bee colony losses vs quarterly precipitation in Oregon",
         #x = "Quarterly precipitation, by 5 in increments",
       #  y = "Percent of colonies lost")
#this graph is misleading because there are not equal numbers of data points in each bin!
  
#now let's try defining only 4 bins, of unequal widths: <=2 inches, 2 to 5 inches, 5 to 12 inches, greater than 12 inches
#I chose these values based on the precipitation histogram that I commented out above.
  
colony_temps <- colony_temps %>%
  mutate(precip_cat = case_when(
    precipitation <=2 ~ "low",
    (precipitation > 2) & (precipitation <= 5) ~ "medium",
    (precipitation > 5) & (precipitation <= 12) ~ "high" ,
    precipitation > 12 ~ "very high")) %>%
  mutate(precip_cat = factor(precip_cat, 
                  levels = c('low', 'medium', 'high', 'very high')))

#let's make sure there aren't wildly different numbers in each bin
colony_temps %>% select(precip_cat) %>% 
  summarize(n_low = sum(precip_cat == "low"),
            n_med = sum(precip_cat == "medium"),
            n_high = sum(precip_cat == "high"),
            n_veryhigh = sum(precip_cat == "very high"))

## # A tibble: 1 x 4
##   n_low n_med n_high n_veryhigh
##   <int> <int>  <int>      <int>
## 1     5     6      8          7

#now plot this
  ggplot(colony_temps, aes(x=precip_cat, y=colony_lost_pct) ) +
    geom_boxplot(fill = "lightblue") +
    labs(title = "Distribution of bee colony losses vs quarterly precipitation in Oregon",
         x = "Quarterly precipitation, by category",
         y = "Percent of colonies lost")

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Contrary to what I expected, the relationship between low precipitation and high bee colony losses is not really all that clear. I first tried binning precipitation in 5 inch increments, but the distribution of quarterly precipitation was so right-skewed that this seemed misleading. Next, I created a categorical variable to divide the quarterly precipitation into “low”, “medium” etc so that each category had a roughly equal number of data points. Plotting the percent of colonies lost per category shows that low precipitation quarters have a higher median percent colony loss, and medium precipitation quarters have the lowest median percent colony loss, but the distributions are pretty wide so this isn’t really all that convincing. I think since it tends to rain more in the fall/winter and less in the summer, this relationship may be due more to temperature than to precipitation.

#Stressor data

#now let's pivot these again so we can facet by type of stressor
stressor_long_or_avg <- stressor_wide_or_avg %>%
  pivot_longer(cols = c(diseases_mean, varroa_mites_mean, pesticides_mean, other_mean, unknown_mean, other_pests_mean), names_to= "stressor", values_to = "mean_stress_pct")

stressor_long_avg <- avg_stressor %>% 
  pivot_longer(cols = c(diseases_mean, varroa_mites_mean, pesticides_mean, other_mean, unknown_mean, other_pests_mean), names_to= "stressor", values_to = "mean_stress_pct")
#want to make a factor out of `year` so I can make multiple boxplots 
stressor_long_avg <- stressor_long_avg %>% 
  mutate(year_fct = factor(year))
#check that it worked
glimpse(stressor_long_avg)

## Rows: 1,974
## Columns: 5
## Groups: year [7]
## $ year            <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ state           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",~
## $ stressor        <chr> "diseases_mean", "varroa_mites_mean", "pesticides_mean~
## $ mean_stress_pct <dbl> 0.20000, 23.22500, 1.82500, 4.32500, 8.27500, 31.22500~
## $ year_fct        <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~

stressor_long_or_avg <- stressor_long_or_avg %>% mutate(year_fct = factor(year))

#rename facets so the titles are pretty with labeller
facets <-c("Diseases", "Other", "Other Pests", "Pesticides", "Unknown", "Varroa Mites")
names(facets)<-c("diseases_mean", "other_mean", "other_pests_mean", "pesticides_mean", "unknown_mean", "varroa_mites_mean")
  
#comparing bee colony stressors in Oregon versus the distribution in the US
ggplot(data = stressor_long_avg, aes(x = year_fct, y = mean_stress_pct)) +   
  geom_boxplot(data = stressor_long_avg, aes(x = year_fct, y = mean_stress_pct, fill = stressor), show.legend = F) +
  geom_point(data = stressor_long_or_avg, aes(x = year_fct, y = mean_stress_pct), color = "yellow", fill = "red", shape = 21) +
  facet_wrap(vars(stressor), scales = "free_y", ncol =2, labeller = labeller(stressor = facets)) +
  labs(title = "Percent of bee colonies affected by stressors",
       subtitle = "Oregon (red point) versus all US states",
       x = "Year",
       y = "Percent of colonies stressed")

## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

The stressors affecting bee colonies in Oregon and in the US as a whole seem to fluctuate quite a lot each year; there are also outliers in many years where one state was disproportionately affected by a particular stressor. Varroa mites seem to be a particular problem, affecting an average of 30-40% of bee colonies across the country. From 2015 - 2018 Oregon was on the high end of the distribution for percent of colonies affected by Varroa mites, but the percent of affected colonies in Oregon seems to have declined relative to the US as a whole after 2018. Oregon also seems to have a consistently higher than average percent of colonies affected by disease! Pesticide stress on bee colonies does not seem to be changing in any definable way, either in Oregon or in the country as a whole.

#Compare levels of stressors in Oregon over time
ggplot(stressor_long_or_avg, aes(x = year, y = mean_stress_pct, fill = stressor)) + 
  geom_col(position = "dodge") + 
  scale_fill_discrete(name = "Stressor", labels = c("Diseases", "Other", "Other Pests", "Pesticides", "Unknown", "Varroa Mites")) + 
  labs(title = "Bee colony stressors in Oregon over time",
       x = "Year", 
       y = "Percent of colonies stressed")

Here you can clearly see that the biggest stressor in Oregon (in terms of percent of colonies affected) is Varroa mites.

Final Summary (10 points)

Summarize your research question and findings below.

Research question 1

I did not observe an increase in average temperature over time in Oregon; however, I did observe a correlation between summertime temperature and percent of colonies lost during the summer.

I did not observe any convincing relationship between quarterly precipitation and percent of colonies lost in Oregon. The differences observed may have been due to temperature rather than to amount of precipitation.

Research question 2

The stressor that affects the highest percent of bee colonies in Oregon is Varroa mites. I did not observe any trends in stressor levels over time. These Oregon findings are mirrored by those for the USA as a whole; however, I found that a consistently higher percentage of Oregon bee colonies faces stress by disease, compared to the nation as a whole.

Are your findings what you expected? Why or Why not?

I didn’t expect to be able to see any relationship between temperature and bee colony losses in Oregon – I’m a little worried that the observed relationship is spurious.

I was surprised that Varroa mites are the biggest stressor of bee colonies in Oregon and around the country! I’m not sure what I was expecting – maybe pesticides. I also didn’t expect that Oregon bees would be stressed by disease to a greater extent than the country average.

Midterm

Shauna Rakshe

2022-02-12

Midterm (Due Sunday 2/13/2022 at 11:55 pm)

Define Your Research Question (10 points)

Loading the Data (10 points)

Transforming the data (15 points)

Visualizing and Summarizing the Data (15 points)

Final Summary (10 points)