Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
#data
channel on Slack..csv
file into your data
folder.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
I am interested in determining what the most harmful stressor and time of year is for bee colonies. More specifically, I want to answer the following two questions.
Which stressor results in the largest loss of bee colonies?
Is there a time of year that appears to be most detrimental to bee colonies?
Given your question, what is your expectation about the data?
I expect that pesticides cause the largest loss of bee colonies and anticipate that Q4 (October-December) results in the highest losses of bee colonies compared to other calendar year quarters.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
```{#r data_import_original} #note: commenting our this code chunk after first run to avoid re-downloading data and potential error (Error in github_sha(“static”) : response code 403: API rate limit exceeded)
#original data download and import (see “readme.txt” in data folder for data source and data dictionary details) #colony data colony <- readr::read_csv(‘https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv’) #stressor data stressor <- readr::read_csv(‘https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/stressor.csv’)
#save data to permanent location in data subfolder of project folder #colony data write.csv(colony, file = here(“data/colony.csv”)) #stressor data write.csv(stressor, file = here(“data/stressor.csv”))
```r
#load data from project folder
colony <- read.csv(file = "data/colony.csv")
stressor <- read.csv(file = "data/stressor.csv")
#check import as expected
colony %>% glimpse()
## Rows: 1,222
## Columns: 11
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,~
## $ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months <chr> "January-March", "January-March", "January-March", "Ja~
## $ state <chr> "Alabama", "Arizona", "Arkansas", "California", "Color~
## $ colony_n <dbl> 7000, 35000, 13000, 1440000, 3500, 3900, 305000, 10400~
## $ colony_max <dbl> 7000, 35000, 14000, 1690000, 12500, 3900, 315000, 1050~
## $ colony_lost <int> 1800, 4600, 1500, 255000, 1500, 870, 42000, 14500, 380~
## $ colony_lost_pct <int> 26, 13, 11, 15, 12, 22, 13, 14, 4, 4, 40, 22, 18, 23, ~
## $ colony_added <dbl> 2800, 3400, 1200, 250000, 200, 290, 54000, 47000, 3400~
## $ colony_reno <dbl> 250, 2100, 90, 124000, 140, NA, 25000, 9500, 760, 8000~
## $ colony_reno_pct <int> 4, 6, 1, 7, 1, NA, 8, 9, 7, 9, 4, 1, 2, 1, NA, 13, NA,~
stressor %>% glimpse()
## Rows: 7,332
## Columns: 6
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ~
## $ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ months <chr> "January-March", "January-March", "January-March", "January~
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ stressor <chr> "Varroa mites", "Other pests/parasites", "Disesases", "Pest~
## $ stress_pct <dbl> 10.0, 5.4, NA, 2.2, 9.1, 9.4, 26.9, 20.5, 0.1, NA, 1.8, 3.1~
#looks good - taking a closer look in next code chunk
#linking data sets by year, months, state
#stressor has many more rows than colony - colony:stressor appears to be 1:many
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Make sure your data types are correct!
#colony
#confirm data types are correct
summary(colony)
## X year months state
## Min. : 1.0 Min. :2015 Length:1222 Length:1222
## 1st Qu.: 306.2 1st Qu.:2016 Class :character Class :character
## Median : 611.5 Median :2018 Mode :character Mode :character
## Mean : 611.5 Mean :2018
## 3rd Qu.: 916.8 3rd Qu.:2019
## Max. :1222.0 Max. :2021
##
## colony_n colony_max colony_lost colony_lost_pct
## Min. : 1300 Min. : 1700 Min. : 20 Min. : 1.00
## 1st Qu.: 8000 1st Qu.: 9000 1st Qu.: 950 1st Qu.: 6.00
## Median : 17500 Median : 21000 Median : 2200 Median :10.00
## Mean : 123578 Mean : 79113 Mean : 16551 Mean :11.38
## 3rd Qu.: 55500 3rd Qu.: 68750 3rd Qu.: 6500 3rd Qu.:15.00
## Max. :3181180 Max. :1710000 Max. :502350 Max. :52.00
## NA's :47 NA's :72 NA's :47 NA's :54
## colony_added colony_reno colony_reno_pct
## Min. : 10 Min. : 10 Min. : 1.000
## 1st Qu.: 420 1st Qu.: 260 1st Qu.: 2.000
## Median : 1800 Median : 960 Median : 6.000
## Mean : 17243 Mean : 15279 Mean : 9.098
## 3rd Qu.: 6500 3rd Qu.: 4585 3rd Qu.:12.000
## Max. :736920 Max. :806170 Max. :77.000
## NA's :83 NA's :131 NA's :260
#notes:
#data types are as expected
#all numeric variables, except year, contain "NA" entries
#per data documentation, "For operations with five or more colonies, data was collected on a quarterly basis; operations with less than five colonies were collected with an annual survey."
#check character variables for unexpected values
table(colony$months)
##
## April-June January-March July-September October-December
## 329 329 282 282
#notes:
#3 month bands are consistent throughout the data - will convert to quarters for plots in later step
#more measurements in first two quarters than in last two quarters - investigating further...
freq_yearq <- table(colony$year, colony$months)
freq_yearq
##
## April-June January-March July-September October-December
## 2015 47 47 47 47
## 2016 47 47 47 47
## 2017 47 47 47 47
## 2018 47 47 47 47
## 2019 47 47 47 47
## 2020 47 47 47 47
## 2021 47 47 0 0
#no records for 2021 July-September and 2021 October-December - data collection ended 2021 June-April
#stressor
summary(stressor)
## X year months state
## Min. : 1 Min. :2015 Length:7332 Length:7332
## 1st Qu.:1834 1st Qu.:2016 Class :character Class :character
## Median :3666 Median :2018 Mode :character Mode :character
## Mean :3666 Mean :2018
## 3rd Qu.:5499 3rd Qu.:2019
## Max. :7332 Max. :2021
##
## stressor stress_pct
## Length:7332 Min. : 0.1
## Class :character 1st Qu.: 1.7
## Mode :character Median : 5.3
## Mean : 11.2
## 3rd Qu.: 14.2
## Max. :102.0
## NA's :843
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
The colony data set (referring to as left) contains information relating to the percentage of colonies lost, and the stressor data set (referring to as right) contains information to what types of stressors the bee colonies were exposed to. To understand how stressors affect bee colony growth/decline, these two data sets need to be merged. These data sets can be linked by the following variables: year, months, state. A left_join will be used for this analysis because we want to retain all records relating to bee colony populations; however, stressor data without a corresponding bee colony population is not useful for this analysis and should be removed.
#merge colony and stressor
bee <- left_join(colony, stressor,
by = c("year" = "year", "months" = "months", "state" = "state")) %>% glimpse() #check merge as expected
## Rows: 7,332
## Columns: 14
## $ X.x <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, ~
## $ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months <chr> "January-March", "January-March", "January-March", "Ja~
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",~
## $ colony_n <dbl> 7000, 7000, 7000, 7000, 7000, 7000, 35000, 35000, 3500~
## $ colony_max <dbl> 7000, 7000, 7000, 7000, 7000, 7000, 35000, 35000, 3500~
## $ colony_lost <int> 1800, 1800, 1800, 1800, 1800, 1800, 4600, 4600, 4600, ~
## $ colony_lost_pct <int> 26, 26, 26, 26, 26, 26, 13, 13, 13, 13, 13, 13, 11, 11~
## $ colony_added <dbl> 2800, 2800, 2800, 2800, 2800, 2800, 3400, 3400, 3400, ~
## $ colony_reno <dbl> 250, 250, 250, 250, 250, 250, 2100, 2100, 2100, 2100, ~
## $ colony_reno_pct <int> 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 1, ~
## $ X.y <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,~
## $ stressor <chr> "Varroa mites", "Other pests/parasites", "Disesases", ~
## $ stress_pct <dbl> 10.0, 5.4, NA, 2.2, 9.1, 9.4, 26.9, 20.5, 0.1, NA, 1.8~
#prepare data for summarization
bee_cleaned <- bee %>%
mutate(
#convert month bands to quarters
quarter = case_when(
months == "January-March" ~ "Q1",
months == "April-June" ~ "Q2",
months == "July-September" ~ "Q3",
months == "October-December" ~ "Q4"
),
yearq = str_c(year, "-", quarter),
key = str_c(year, "-", quarter, "-", stressor),
num_col_lost = stress_pct*colony_lost
) %>% filter(!is.na(colony_max)) %>% # filter(!is.na(colony_lost)) %>%
filter(!is.na(num_col_lost)) %>%
glimpse() #check transformations as expected
## Rows: 6,339
## Columns: 18
## $ X.x <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, ~
## $ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months <chr> "January-March", "January-March", "January-March", "Ja~
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",~
## $ colony_n <dbl> 7000, 7000, 7000, 7000, 7000, 35000, 35000, 35000, 350~
## $ colony_max <dbl> 7000, 7000, 7000, 7000, 7000, 35000, 35000, 35000, 350~
## $ colony_lost <int> 1800, 1800, 1800, 1800, 1800, 4600, 4600, 4600, 4600, ~
## $ colony_lost_pct <int> 26, 26, 26, 26, 26, 13, 13, 13, 13, 13, 11, 11, 11, 11~
## $ colony_added <dbl> 2800, 2800, 2800, 2800, 2800, 3400, 3400, 3400, 3400, ~
## $ colony_reno <dbl> 250, 250, 250, 250, 250, 2100, 2100, 2100, 2100, 2100,~
## $ colony_reno_pct <int> 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 7, 7, ~
## $ X.y <int> 1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18~
## $ stressor <chr> "Varroa mites", "Other pests/parasites", "Pesticides",~
## $ stress_pct <dbl> 10.0, 5.4, 2.2, 9.1, 9.4, 26.9, 20.5, 0.1, 1.8, 3.1, 1~
## $ quarter <chr> "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", ~
## $ yearq <chr> "2015-Q1", "2015-Q1", "2015-Q1", "2015-Q1", "2015-Q1",~
## $ key <chr> "2015-Q1-Varroa mites", "2015-Q1-Other pests/parasites~
## $ num_col_lost <dbl> 18000, 9720, 3960, 16380, 16920, 123740, 94300, 460, 8~
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
#this table shows the merged colony and stressor table with quarter groupings
bee_cleaned %>%
glimpse() #check transformations as expected
## Rows: 6,339
## Columns: 18
## $ X.x <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, ~
## $ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months <chr> "January-March", "January-March", "January-March", "Ja~
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",~
## $ colony_n <dbl> 7000, 7000, 7000, 7000, 7000, 35000, 35000, 35000, 350~
## $ colony_max <dbl> 7000, 7000, 7000, 7000, 7000, 35000, 35000, 35000, 350~
## $ colony_lost <int> 1800, 1800, 1800, 1800, 1800, 4600, 4600, 4600, 4600, ~
## $ colony_lost_pct <int> 26, 26, 26, 26, 26, 13, 13, 13, 13, 13, 11, 11, 11, 11~
## $ colony_added <dbl> 2800, 2800, 2800, 2800, 2800, 3400, 3400, 3400, 3400, ~
## $ colony_reno <dbl> 250, 250, 250, 250, 250, 2100, 2100, 2100, 2100, 2100,~
## $ colony_reno_pct <int> 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 7, 7, ~
## $ X.y <int> 1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18~
## $ stressor <chr> "Varroa mites", "Other pests/parasites", "Pesticides",~
## $ stress_pct <dbl> 10.0, 5.4, 2.2, 9.1, 9.4, 26.9, 20.5, 0.1, 1.8, 3.1, 1~
## $ quarter <chr> "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", ~
## $ yearq <chr> "2015-Q1", "2015-Q1", "2015-Q1", "2015-Q1", "2015-Q1",~
## $ key <chr> "2015-Q1-Varroa mites", "2015-Q1-Other pests/parasites~
## $ num_col_lost <dbl> 18000, 9720, 3960, 16380, 16920, 123740, 94300, 460, 8~
Are the values what you expected for the variables? Why or Why not?
These data sets are clean. I did not find any unexpected values. However, I did come across NA values, which I removed to complete summary calculations.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
#summarize dat for question 1: Which stressors result in the largest loss of bee colonies?
ans1 <- bee_cleaned %>%
group_by(year, stressor) %>% #opting to use year in favor of yearq, after reviewing both versions of visualizations in later step, as yearq crowds the plot and does not add more insight
summarise(total_sum_lost = sum(num_col_lost)/100000) %>% #dividing by 100,000 to show bee colony loss in 100,000s for ease of viewing
glimpse()
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## Rows: 42
## Columns: 3
## Groups: year [7]
## $ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2~
## $ stressor <chr> "Disesases", "Other", "Other pests/parasites", "Pestici~
## $ total_sum_lost <dbl> 76.45518, 130.30903, 209.51183, 181.99093, 76.06085, 53~
#remove 2021 for quarterly total calculation
bee_cleaned2<-subset(bee_cleaned, year!="2021")
ans2 <- bee_cleaned2 %>%
group_by(year, quarter) %>%
summarise(total_sum_lost = sum(num_col_lost)/100000) %>% #dividing by 100,000 to show bee colony loss in 100,000s for ease of viewing
glimpse()
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## Rows: 23
## Columns: 3
## Groups: year [6]
## $ year <int> 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2~
## $ quarter <chr> "Q1", "Q2", "Q3", "Q4", "Q1", "Q2", "Q3", "Q4", "Q1", "~
## $ total_sum_lost <dbl> 237.4069, 291.5905, 405.3462, 270.1899, 282.7862, 264.4~
What are your findings about the summary? Are they what you expected?
Based on the summary, pesticides are not the largest contributor to bee colony loss over the years 2015-2021. Pesticides were the third leading cause of bee colony loss, behind varroa mites and other pests/parasites. In fact, varroa mites are the cause of the loss of approximately 400,000,000 more bee colonies than any other stressor.
Q4 (October to December) did not result in the highest loss of bee colony population; however, it was a close second to Q4 (July-September).
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
The figure below plots bee colony loss from the year 2015 to 2021 by stressor. Note that there appears to be a significant drop off in bee population loss in 2021. This is not due to improved bee conservation efforts but, rather, is due to incomplete data for the year 2021.
#to answer q1
ggplot(data = ans1) +
aes(x = year,
y = total_sum_lost,
color = stressor) +
geom_line() +
labs(title = "Bee Colony Loss vs Year by Stressor",
x = "Year",
y = "Bee Colony Loss (in 100,000s)")
The figure below is a boxplot of the total number of be colonies lost (in 100,000s) by quarter from the year 2015-2020. Note that the 2021 data have been removed, as they are missing Q3 and Q4.
#to answer q2
ggplot(data = ans2) +
aes(x = quarter,
y = total_sum_lost,
color = year) +
geom_boxplot() +
labs(title = "Aggregate Bee Colony Loss from 2015-2021 by Quarter",
x = "Quarter (Calendar Year)",
y = "Bee Colony Loss (in 100,000s)")
Summarize your research question and findings below. Are your findings what you expected? Why or Why not?
I was interested in determining which stressor and what time of year result in the highest loss of bee colony populations. I suspected that pesticides were the most harmful stressor and Q4 (October-December) were most detrimental to bee colony populations. However, through my analysis, I found that varro mites were the most harmful stressor, and July through September were the most dangerous months for bees.