Midterm (Due Sunday 2/13/2022 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday.
See other data sources in the #data channel on Slack.
Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.
You may use another dataset or your own data, but please make sure it is de-identified.

Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:

mutate()
group_by()
summarize()
ggplot()

and at least one of the following:

case_when()
across()
*_join() (i.e. left_join())
pivot_*() (i.e. pivot_longer())

The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I am interested in determining what the most harmful stressor and time of year is for bee colonies. More specifically, I want to answer the following two questions.

Which stressor results in the largest loss of bee colonies?
Is there a time of year that appears to be most detrimental to bee colonies?

Given your question, what is your expectation about the data?

I expect that pesticides cause the largest loss of bee colonies and anticipate that Q4 (October-December) results in the highest losses of bee colonies compared to other calendar year quarters.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

```{#r data_import_original} #note: commenting our this code chunk after first run to avoid re-downloading data and potential error (Error in github_sha(“static”) : response code 403: API rate limit exceeded)

#original data download and import (see “readme.txt” in data folder for data source and data dictionary details) #colony data colony <- readr::read_csv(‘https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv’) #stressor data stressor <- readr::read_csv(‘https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/stressor.csv’)

#save data to permanent location in data subfolder of project folder #colony data write.csv(colony, file = here(“data/colony.csv”)) #stressor data write.csv(stressor, file = here(“data/stressor.csv”))



```r
#load data from project folder
colony <- read.csv(file = "data/colony.csv")
stressor <- read.csv(file = "data/stressor.csv")

#check import as expected
colony %>% glimpse()

## Rows: 1,222
## Columns: 11
## $ X               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,~
## $ year            <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months          <chr> "January-March", "January-March", "January-March", "Ja~
## $ state           <chr> "Alabama", "Arizona", "Arkansas", "California", "Color~
## $ colony_n        <dbl> 7000, 35000, 13000, 1440000, 3500, 3900, 305000, 10400~
## $ colony_max      <dbl> 7000, 35000, 14000, 1690000, 12500, 3900, 315000, 1050~
## $ colony_lost     <int> 1800, 4600, 1500, 255000, 1500, 870, 42000, 14500, 380~
## $ colony_lost_pct <int> 26, 13, 11, 15, 12, 22, 13, 14, 4, 4, 40, 22, 18, 23, ~
## $ colony_added    <dbl> 2800, 3400, 1200, 250000, 200, 290, 54000, 47000, 3400~
## $ colony_reno     <dbl> 250, 2100, 90, 124000, 140, NA, 25000, 9500, 760, 8000~
## $ colony_reno_pct <int> 4, 6, 1, 7, 1, NA, 8, 9, 7, 9, 4, 1, 2, 1, NA, 13, NA,~

stressor %>% glimpse()

## Rows: 7,332
## Columns: 6
## $ X          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ~
## $ year       <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ months     <chr> "January-March", "January-March", "January-March", "January~
## $ state      <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ stressor   <chr> "Varroa mites", "Other pests/parasites", "Disesases", "Pest~
## $ stress_pct <dbl> 10.0, 5.4, NA, 2.2, 9.1, 9.4, 26.9, 20.5, 0.1, NA, 1.8, 3.1~

  #looks good - taking a closer look in next code chunk
  #linking data sets by year, months, state
      #stressor has many more rows than colony - colony:stressor appears to be 1:many

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

#colony
  #confirm data types are correct
  summary(colony)

##        X               year         months             state          
##  Min.   :   1.0   Min.   :2015   Length:1222        Length:1222       
##  1st Qu.: 306.2   1st Qu.:2016   Class :character   Class :character  
##  Median : 611.5   Median :2018   Mode  :character   Mode  :character  
##  Mean   : 611.5   Mean   :2018                                        
##  3rd Qu.: 916.8   3rd Qu.:2019                                        
##  Max.   :1222.0   Max.   :2021                                        
##                                                                       
##     colony_n         colony_max       colony_lost     colony_lost_pct
##  Min.   :   1300   Min.   :   1700   Min.   :    20   Min.   : 1.00  
##  1st Qu.:   8000   1st Qu.:   9000   1st Qu.:   950   1st Qu.: 6.00  
##  Median :  17500   Median :  21000   Median :  2200   Median :10.00  
##  Mean   : 123578   Mean   :  79113   Mean   : 16551   Mean   :11.38  
##  3rd Qu.:  55500   3rd Qu.:  68750   3rd Qu.:  6500   3rd Qu.:15.00  
##  Max.   :3181180   Max.   :1710000   Max.   :502350   Max.   :52.00  
##  NA's   :47        NA's   :72        NA's   :47       NA's   :54     
##   colony_added     colony_reno     colony_reno_pct 
##  Min.   :    10   Min.   :    10   Min.   : 1.000  
##  1st Qu.:   420   1st Qu.:   260   1st Qu.: 2.000  
##  Median :  1800   Median :   960   Median : 6.000  
##  Mean   : 17243   Mean   : 15279   Mean   : 9.098  
##  3rd Qu.:  6500   3rd Qu.:  4585   3rd Qu.:12.000  
##  Max.   :736920   Max.   :806170   Max.   :77.000  
##  NA's   :83       NA's   :131      NA's   :260

    #notes: 
      #data types are as expected
      #all numeric variables, except year, contain "NA" entries
        #per data documentation, "For operations with five or more colonies, data was collected on a quarterly basis; operations with less than five colonies were collected with an annual survey."
  #check character variables for unexpected values
  table(colony$months)

## 
##       April-June    January-March   July-September October-December 
##              329              329              282              282

    #notes:
      #3 month bands are consistent throughout the data - will convert to quarters for plots in later step
      #more measurements in first two quarters than in last two quarters - investigating further...
  freq_yearq <- table(colony$year, colony$months) 
  freq_yearq

##       
##        April-June January-March July-September October-December
##   2015         47            47             47               47
##   2016         47            47             47               47
##   2017         47            47             47               47
##   2018         47            47             47               47
##   2019         47            47             47               47
##   2020         47            47             47               47
##   2021         47            47              0                0

    #no records for  2021 July-September and 2021 October-December - data collection ended 2021 June-April
  
#stressor
  summary(stressor)

##        X             year         months             state          
##  Min.   :   1   Min.   :2015   Length:7332        Length:7332       
##  1st Qu.:1834   1st Qu.:2016   Class :character   Class :character  
##  Median :3666   Median :2018   Mode  :character   Mode  :character  
##  Mean   :3666   Mean   :2018                                        
##  3rd Qu.:5499   3rd Qu.:2019                                        
##  Max.   :7332   Max.   :2021                                        
##                                                                     
##    stressor           stress_pct   
##  Length:7332        Min.   :  0.1  
##  Class :character   1st Qu.:  1.7  
##  Mode  :character   Median :  5.3  
##                     Mean   : 11.2  
##                     3rd Qu.: 14.2  
##                     Max.   :102.0  
##                     NA's   :843

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

The colony data set (referring to as left) contains information relating to the percentage of colonies lost, and the stressor data set (referring to as right) contains information to what types of stressors the bee colonies were exposed to. To understand how stressors affect bee colony growth/decline, these two data sets need to be merged. These data sets can be linked by the following variables: year, months, state. A left_join will be used for this analysis because we want to retain all records relating to bee colony populations; however, stressor data without a corresponding bee colony population is not useful for this analysis and should be removed.

#merge colony and stressor
bee <- left_join(colony, stressor,
    by = c("year" = "year", "months" = "months", "state" = "state")) %>% glimpse() #check merge as expected

## Rows: 7,332
## Columns: 14
## $ X.x             <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, ~
## $ year            <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months          <chr> "January-March", "January-March", "January-March", "Ja~
## $ state           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",~
## $ colony_n        <dbl> 7000, 7000, 7000, 7000, 7000, 7000, 35000, 35000, 3500~
## $ colony_max      <dbl> 7000, 7000, 7000, 7000, 7000, 7000, 35000, 35000, 3500~
## $ colony_lost     <int> 1800, 1800, 1800, 1800, 1800, 1800, 4600, 4600, 4600, ~
## $ colony_lost_pct <int> 26, 26, 26, 26, 26, 26, 13, 13, 13, 13, 13, 13, 11, 11~
## $ colony_added    <dbl> 2800, 2800, 2800, 2800, 2800, 2800, 3400, 3400, 3400, ~
## $ colony_reno     <dbl> 250, 250, 250, 250, 250, 250, 2100, 2100, 2100, 2100, ~
## $ colony_reno_pct <int> 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 1, ~
## $ X.y             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,~
## $ stressor        <chr> "Varroa mites", "Other pests/parasites", "Disesases", ~
## $ stress_pct      <dbl> 10.0, 5.4, NA, 2.2, 9.1, 9.4, 26.9, 20.5, 0.1, NA, 1.8~

#prepare data for summarization
 bee_cleaned <- bee %>%
  mutate(
    #convert month bands to quarters
    quarter = case_when(
        months == "January-March" ~ "Q1",
        months == "April-June" ~ "Q2",
        months == "July-September" ~ "Q3",
        months == "October-December" ~ "Q4"
    ),
    yearq = str_c(year, "-", quarter),
    key = str_c(year, "-", quarter, "-", stressor),
    num_col_lost = stress_pct*colony_lost
  ) %>% filter(!is.na(colony_max)) %>% #    filter(!is.na(colony_lost)) %>%
   filter(!is.na(num_col_lost)) %>%
  glimpse() #check transformations as expected

## Rows: 6,339
## Columns: 18
## $ X.x             <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, ~
## $ year            <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months          <chr> "January-March", "January-March", "January-March", "Ja~
## $ state           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",~
## $ colony_n        <dbl> 7000, 7000, 7000, 7000, 7000, 35000, 35000, 35000, 350~
## $ colony_max      <dbl> 7000, 7000, 7000, 7000, 7000, 35000, 35000, 35000, 350~
## $ colony_lost     <int> 1800, 1800, 1800, 1800, 1800, 4600, 4600, 4600, 4600, ~
## $ colony_lost_pct <int> 26, 26, 26, 26, 26, 13, 13, 13, 13, 13, 11, 11, 11, 11~
## $ colony_added    <dbl> 2800, 2800, 2800, 2800, 2800, 3400, 3400, 3400, 3400, ~
## $ colony_reno     <dbl> 250, 250, 250, 250, 250, 2100, 2100, 2100, 2100, 2100,~
## $ colony_reno_pct <int> 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 7, 7, ~
## $ X.y             <int> 1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18~
## $ stressor        <chr> "Varroa mites", "Other pests/parasites", "Pesticides",~
## $ stress_pct      <dbl> 10.0, 5.4, 2.2, 9.1, 9.4, 26.9, 20.5, 0.1, 1.8, 3.1, 1~
## $ quarter         <chr> "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", ~
## $ yearq           <chr> "2015-Q1", "2015-Q1", "2015-Q1", "2015-Q1", "2015-Q1",~
## $ key             <chr> "2015-Q1-Varroa mites", "2015-Q1-Other pests/parasites~
## $ num_col_lost    <dbl> 18000, 9720, 3960, 16380, 16920, 123740, 94300, 460, 8~

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

#this table shows the merged colony and stressor table with quarter groupings
bee_cleaned %>% 
  glimpse() #check transformations as expected

## Rows: 6,339
## Columns: 18
## $ X.x             <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, ~
## $ year            <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months          <chr> "January-March", "January-March", "January-March", "Ja~
## $ state           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",~
## $ colony_n        <dbl> 7000, 7000, 7000, 7000, 7000, 35000, 35000, 35000, 350~
## $ colony_max      <dbl> 7000, 7000, 7000, 7000, 7000, 35000, 35000, 35000, 350~
## $ colony_lost     <int> 1800, 1800, 1800, 1800, 1800, 4600, 4600, 4600, 4600, ~
## $ colony_lost_pct <int> 26, 26, 26, 26, 26, 13, 13, 13, 13, 13, 11, 11, 11, 11~
## $ colony_added    <dbl> 2800, 2800, 2800, 2800, 2800, 3400, 3400, 3400, 3400, ~
## $ colony_reno     <dbl> 250, 250, 250, 250, 250, 2100, 2100, 2100, 2100, 2100,~
## $ colony_reno_pct <int> 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 7, 7, ~
## $ X.y             <int> 1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18~
## $ stressor        <chr> "Varroa mites", "Other pests/parasites", "Pesticides",~
## $ stress_pct      <dbl> 10.0, 5.4, 2.2, 9.1, 9.4, 26.9, 20.5, 0.1, 1.8, 3.1, 1~
## $ quarter         <chr> "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", ~
## $ yearq           <chr> "2015-Q1", "2015-Q1", "2015-Q1", "2015-Q1", "2015-Q1",~
## $ key             <chr> "2015-Q1-Varroa mites", "2015-Q1-Other pests/parasites~
## $ num_col_lost    <dbl> 18000, 9720, 3960, 16380, 16920, 123740, 94300, 460, 8~

Are the values what you expected for the variables? Why or Why not?

These data sets are clean. I did not find any unexpected values. However, I did come across NA values, which I removed to complete summary calculations.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#summarize dat for question 1: Which stressors result in the largest loss of bee colonies?

ans1 <- bee_cleaned %>%
   group_by(year, stressor) %>% #opting to use year in favor of yearq, after reviewing both versions of visualizations in later step, as yearq crowds the plot and does not add more insight
   summarise(total_sum_lost = sum(num_col_lost)/100000) %>% #dividing by 100,000 to show bee colony loss in 100,000s for ease of viewing
   glimpse()

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

## Rows: 42
## Columns: 3
## Groups: year [7]
## $ year           <int> 2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2~
## $ stressor       <chr> "Disesases", "Other", "Other pests/parasites", "Pestici~
## $ total_sum_lost <dbl> 76.45518, 130.30903, 209.51183, 181.99093, 76.06085, 53~

#remove 2021 for quarterly total calculation
bee_cleaned2<-subset(bee_cleaned, year!="2021")

ans2 <- bee_cleaned2 %>%
   group_by(year, quarter) %>% 
   summarise(total_sum_lost = sum(num_col_lost)/100000) %>% #dividing by 100,000 to show bee colony loss in 100,000s for ease of viewing
   glimpse()

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

## Rows: 23
## Columns: 3
## Groups: year [6]
## $ year           <int> 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2~
## $ quarter        <chr> "Q1", "Q2", "Q3", "Q4", "Q1", "Q2", "Q3", "Q4", "Q1", "~
## $ total_sum_lost <dbl> 237.4069, 291.5905, 405.3462, 270.1899, 282.7862, 264.4~

What are your findings about the summary? Are they what you expected?

Based on the summary, pesticides are not the largest contributor to bee colony loss over the years 2015-2021. Pesticides were the third leading cause of bee colony loss, behind varroa mites and other pests/parasites. In fact, varroa mites are the cause of the loss of approximately 400,000,000 more bee colonies than any other stressor.

Q4 (October to December) did not result in the highest loss of bee colony population; however, it was a close second to Q4 (July-September).

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

The figure below plots bee colony loss from the year 2015 to 2021 by stressor. Note that there appears to be a significant drop off in bee population loss in 2021. This is not due to improved bee conservation efforts but, rather, is due to incomplete data for the year 2021.

#to answer q1
ggplot(data = ans1) +

  aes(x = year,
      y = total_sum_lost,  
      color = stressor) +
  
  geom_line() +
  
  labs(title = "Bee Colony Loss vs Year by Stressor",
       x = "Year",
       y = "Bee Colony Loss (in 100,000s)")

The figure below is a boxplot of the total number of be colonies lost (in 100,000s) by quarter from the year 2015-2020. Note that the 2021 data have been removed, as they are missing Q3 and Q4.

#to answer q2
ggplot(data = ans2) +

  aes(x = quarter,
      y = total_sum_lost,  
      color = year) +
  
  geom_boxplot() +
  
  labs(title = "Aggregate Bee Colony Loss from 2015-2021 by Quarter",
       x = "Quarter (Calendar Year)",
       y = "Bee Colony Loss (in 100,000s)")

Final Summary (10 points)

Summarize your research question and findings below. Are your findings what you expected? Why or Why not?

I was interested in determining which stressor and what time of year result in the highest loss of bee colony populations. I suspected that pesticides were the most harmful stressor and Q4 (October-December) were most detrimental to bee colony populations. However, through my analysis, I found that varro mites were the most harmful stressor, and July through September were the most dangerous months for bees.

Midterm

Kathryn Liu

2022-02-13