Midterm (Due Sunday 2/13/2022 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.
  • Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday.
  • See other data sources in the #data channel on Slack.
  • Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.
  • You may use another dataset or your own data, but please make sure it is de-identified.
  1. Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.

  2. You must use each of the following functions at least once:

  • mutate()
  • group_by()
  • summarize()
  • ggplot()

and at least one of the following:

  • case_when()
  • across()
  • *_join() (i.e. left_join())
  • pivot_*() (i.e. pivot_longer())
  1. The code chunks below are guides, please add more code chunks to do what you need.

  2. If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Since my family has kept bees in the past, I am very interested in understanding how colony collapse disorder and other disorders have affected the number of colonies within the apiaries throughout Oregon and the greater Northwest region (Oregon, Washington, and Idaho) during the years reported within the dataset. The data from the “colony” dataset will be used towards (hopefully) delineating any changes that have occurred to the apiaries in Oregon and throughout our region. My research question is–have the number of colonies decreased through the years?

Given your question, what is your expectation about the data?

Since the data are varied (some locations are listed as “United States”), I am concerned about the number of reported data for the state of Oregon. However, in case that there are not enough data, I will add the states of Washington and Idaho to the analysis as a “Northwest Cohort” to analyze the amount of colony losses within these three states.

Nevertheless, my expectation about the data is that there will be increases in the number of colonies in each state throughout the reported years.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

colony <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv')
## Rows: 1222 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): months, state
## dbl (8): year, colony_n, colony_max, colony_lost, colony_lost_pct, colony_ad...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
stressor <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/stressor.csv')
## Rows: 7332 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): months, state, stressor
## dbl (2): year, stress_pct
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(colony)
## Rows: 1,222
## Columns: 10
## $ year            <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months          <chr> "January-March", "January-March", "January-March", "Ja~
## $ state           <chr> "Alabama", "Arizona", "Arkansas", "California", "Color~
## $ colony_n        <dbl> 7000, 35000, 13000, 1440000, 3500, 3900, 305000, 10400~
## $ colony_max      <dbl> 7000, 35000, 14000, 1690000, 12500, 3900, 315000, 1050~
## $ colony_lost     <dbl> 1800, 4600, 1500, 255000, 1500, 870, 42000, 14500, 380~
## $ colony_lost_pct <dbl> 26, 13, 11, 15, 12, 22, 13, 14, 4, 4, 40, 22, 18, 23, ~
## $ colony_added    <dbl> 2800, 3400, 1200, 250000, 200, 290, 54000, 47000, 3400~
## $ colony_reno     <dbl> 250, 2100, 90, 124000, 140, NA, 25000, 9500, 760, 8000~
## $ colony_reno_pct <dbl> 4, 6, 1, 7, 1, NA, 8, 9, 7, 9, 4, 1, 2, 1, NA, 13, NA,~

Note: The stressor data may not be needed since there are data that may not contribute to answering the specific research question about colony losses.

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Character variables: Months and state

Integer variables: Year, number of colonies (colony_n), maximum colonies (colony_max), number of colonies that are lost (colony_lost), percentage of the number of colonies lost (colony_lost_pct), number of colonies added (colony_added), number of colonies that have been renovated (colony_reno), percentage of the number of colonies that have been renovated (colony_reno_pct)

Note: Variables that contain NA values: Colony_reno, colony_reno_pct, colony_max, colony_added AT LEAST. There is an entry from 2019 that is comletely “NA”, which means that downstream work will have to compensate for this. Filtering the data will be necessary in order to select for Idaho, Washington, and Oregon into the Northwest Honeybees cohort.

Colony dataset: Number of columns: 10 Number of rows: 1222

skim(colony)
Data summary
Name colony
Number of rows 1222
Number of columns 10
_______________________
Column type frequency:
character 2
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
months 0 1 10 16 0 4 0
state 0 1 4 14 0 47 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2017.77 1.89 2015 2016 2018 2019 2021 ▇▃▃▃▆
colony_n 47 0.96 123578.04 437835.18 1300 8000 17500 55500 3181180 ▇▁▁▁▁
colony_max 72 0.94 79112.77 190823.42 1700 9000 21000 68750 1710000 ▇▁▁▁▁
colony_lost 47 0.96 16551.32 60544.42 20 950 2200 6500 502350 ▇▁▁▁▁
colony_lost_pct 54 0.96 11.38 7.23 1 6 10 15 52 ▇▅▁▁▁
colony_added 83 0.93 17243.20 68167.65 10 420 1800 6500 736920 ▇▁▁▁▁
colony_reno 131 0.89 15278.86 62588.04 10 260 960 4585 806170 ▇▁▁▁▁
colony_reno_pct 260 0.79 9.10 9.66 1 2 6 12 77 ▇▁▁▁▁

Make sure your data types are correct!

Character valies:2 Numerical values:8 #REMEMBER CATEGORICAL/CONTINUOUS DATA#

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

#Notes: #Use Case_when or Across? #oregon, wa , id –color for NW to create pacific northwest #use filter

#Note: Filter is used to select for the states of Oregon, Washington, and Idaho to create a Northwest cohort since there are not enough data values for a single state.

Northwest_Honeybees <- colony %>% filter(state %in% c("Oregon","Washington", "Idaho")) 


Northwest_Honeybees
## # A tibble: 78 x 10
##     year months           state  colony_n colony_max colony_lost colony_lost_pct
##    <dbl> <chr>            <chr>     <dbl>      <dbl>       <dbl>           <dbl>
##  1  2015 January-March    Idaho     81000      88000        3700               4
##  2  2015 January-March    Oregon    77000      87000        6500               8
##  3  2015 January-March    Washi~    52000     105000       14000              13
##  4  2015 April-June       Idaho     62000      72000        6500               9
##  5  2015 April-June       Oregon    82000      95000        5500               6
##  6  2015 April-June       Washi~   105000     127000        5000               4
##  7  2015 July-September   Idaho     80000     128000       14000              11
##  8  2015 July-September   Oregon    68000     100000        8500               9
##  9  2015 July-September   Washi~    84000      97000       11500              12
## 10  2015 October-December Idaho    121000     145000       22000              15
## # ... with 68 more rows, and 3 more variables: colony_added <dbl>,
## #   colony_reno <dbl>, colony_reno_pct <dbl>

#Note: To confirm the data above

summary(Northwest_Honeybees)
##       year         months             state              colony_n     
##  Min.   :2015   Length:78          Length:78          Min.   : 41000  
##  1st Qu.:2016   Class :character   Class :character   1st Qu.: 69000  
##  Median :2018   Mode  :character   Mode  :character   Median : 84000  
##  Mean   :2018                                         Mean   : 85693  
##  3rd Qu.:2019                                         3rd Qu.: 96000  
##  Max.   :2021                                         Max.   :168000  
##                                                       NA's   :3       
##    colony_max      colony_lost    colony_lost_pct   colony_added  
##  Min.   : 67000   Min.   : 1500   Min.   : 2.000   Min.   :   70  
##  1st Qu.: 89500   1st Qu.: 5750   1st Qu.: 5.000   1st Qu.: 4000  
##  Median :103000   Median : 9000   Median : 9.000   Median : 7000  
##  Mean   :109747   Mean   : 9745   Mean   : 8.907   Mean   : 9427  
##  3rd Qu.:124000   3rd Qu.:13250   3rd Qu.:11.500   3rd Qu.:13250  
##  Max.   :196000   Max.   :22000   Max.   :22.000   Max.   :40000  
##  NA's   :3        NA's   :3       NA's   :3        NA's   :3      
##   colony_reno    colony_reno_pct
##  Min.   :  100   Min.   : 1.00  
##  1st Qu.: 1725   1st Qu.: 2.00  
##  Median : 5500   Median : 7.50  
##  Mean   : 9614   Mean   :10.25  
##  3rd Qu.:16375   3rd Qu.:16.00  
##  Max.   :49000   Max.   :42.00  
##  NA's   :4       NA's   :10

#Note: To find the total number of hives from the number that are lost.

colony_total <- Northwest_Honeybees %>% 
    mutate(colony_total = colony_n - colony_lost)

#Note: To clean up the data and remove NA values.

colony_total <- colony_total %>%
  mutate(
    across(where(is.numeric), replace_na, replace = 0))

glimpse(colony_total)
## Rows: 78
## Columns: 11
## $ year            <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months          <chr> "January-March", "January-March", "January-March", "Ap~
## $ state           <chr> "Idaho", "Oregon", "Washington", "Idaho", "Oregon", "W~
## $ colony_n        <dbl> 81000, 77000, 52000, 62000, 82000, 105000, 80000, 6800~
## $ colony_max      <dbl> 88000, 87000, 105000, 72000, 95000, 127000, 128000, 10~
## $ colony_lost     <dbl> 3700, 6500, 14000, 6500, 5500, 5000, 14000, 8500, 1150~
## $ colony_lost_pct <dbl> 4, 8, 13, 9, 6, 4, 11, 9, 12, 15, 8, 7, 7, 3, 10, 8, 3~
## $ colony_added    <dbl> 2600, 4300, 13500, 16500, 14500, 15000, 5500, 8000, 18~
## $ colony_reno     <dbl> 8000, 2400, 9000, 10500, 9500, 13000, 11500, 21000, 20~
## $ colony_reno_pct <dbl> 9, 3, 9, 15, 10, 10, 9, 21, 21, 4, 1, 0, 7, 2, 2, 7, 1~
## $ colony_total    <dbl> 77300, 70500, 38000, 55500, 76500, 100000, 66000, 5950~

#Note: To find the average number of colonies from each state (Idaho, Oregon, & Washington).

Colony_average <- colony_total %>% group_by(state) %>% summarise(across(where(is.numeric),mean))

Colony_average #%>% filter(state=="Oregon")
## # A tibble: 3 x 10
##   state       year colony_n colony_max colony_lost colony_lost_pct colony_added
##   <chr>      <dbl>    <dbl>      <dbl>       <dbl>           <dbl>        <dbl>
## 1 Idaho      2018.   92962.    120038.      12231.            9.85       10646.
## 2 Oregon     2018.   84692.    100692.       7492.            7.19        8558.
## 3 Washington 2018.   69538.     95846.       8388.            8.65        7989.
## # ... with 3 more variables: colony_reno <dbl>, colony_reno_pct <dbl>,
## #   colony_total <dbl>

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

#Note: To find the average (mean) of the number of colonies in each state for each year. The number of entries for each year more/less corresponds to the reported month range (i.e. January-March) in the tables above.

colony_state_year <- colony_total %>%
  group_by(state, year) %>%
  summarize(mean_colony_total = mean(colony_total, na.rm = TRUE))
## `summarise()` has grouped output by 'state'. You can override using the `.groups` argument.
colony_state_year
## # A tibble: 21 x 3
## # Groups:   state [3]
##    state   year mean_colony_total
##    <chr>  <dbl>             <dbl>
##  1 Idaho   2015             74450
##  2 Idaho   2016             77250
##  3 Idaho   2017             79750
##  4 Idaho   2018             87300
##  5 Idaho   2019             74875
##  6 Idaho   2020             76250
##  7 Idaho   2021            109750
##  8 Oregon  2015             74500
##  9 Oregon  2016             81975
## 10 Oregon  2017             76625
## # ... with 11 more rows

#Note: Skim works the best since the other tools displayed the information from this table in a much messier way.

skim(colony_state_year)
Data summary
Name colony_state_year
Number of rows 21
Number of columns 3
_______________________
Column type frequency:
numeric 2
________________________
Group variables state

Variable type: numeric

skim_variable state n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year Idaho 0 1 2018.00 2.16 2015 2016.5 2018 2019.5 2021 ▇▃▃▃▇
year Oregon 0 1 2018.00 2.16 2015 2016.5 2018 2019.5 2021 ▇▃▃▃▇
year Washington 0 1 2018.00 2.16 2015 2016.5 2018 2019.5 2021 ▇▃▃▃▇
mean_colony_total Idaho 0 1 82803.57 12662.86 74450 75562.5 77250 83525.0 109750 ▇▂▁▁▂
mean_colony_total Oregon 0 1 77028.57 9102.40 64125 73937.5 74800 79300.0 93800 ▂▇▂▂▂
mean_colony_total Washington 0 1 60535.71 9670.52 46125 53262.5 64500 66925.0 72750 ▂▅▁▇▂

Are the values what you expected for the variables? Why or Why not?

Yes, the values are expected since they have been step-wise arranged to perform the calculation of the averages, and they are arranged in the “mean_colony_total” column–this is the culmination of the work that has been done to transform the data from separate datapoints in each month-range of each year for each state. It makes sense that the values reflect the fluctuation of the number of the number of colonies throughout time.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#Please see above for summarize data for “colony_state_year”–it had to be moved because of R not “seeing it” down here, and was not allowing the file to knit.#

#Note: To find the average (mean) of the number of colonies in each state.

colony_total %>%
  group_by(state) %>%
  summarize(mean_colony = mean(colony_total, na.rm = TRUE))
## # A tibble: 3 x 2
##   state      mean_colony
##   <chr>            <dbl>
## 1 Idaho           80731.
## 2 Oregon          77200 
## 3 Washington      61150

#Note: This is another way of summarizing the data by grouping the “state” data with the “year” data and the rest of the data in order to show how the dataset has been worked with. It is not as succinct/informative as the summary above.

summary_colony_state_year <- colony_total %>%
  group_by(state, year) 

summary_colony_state_year
## # A tibble: 78 x 11
## # Groups:   state, year [21]
##     year months           state  colony_n colony_max colony_lost colony_lost_pct
##    <dbl> <chr>            <chr>     <dbl>      <dbl>       <dbl>           <dbl>
##  1  2015 January-March    Idaho     81000      88000        3700               4
##  2  2015 January-March    Oregon    77000      87000        6500               8
##  3  2015 January-March    Washi~    52000     105000       14000              13
##  4  2015 April-June       Idaho     62000      72000        6500               9
##  5  2015 April-June       Oregon    82000      95000        5500               6
##  6  2015 April-June       Washi~   105000     127000        5000               4
##  7  2015 July-September   Idaho     80000     128000       14000              11
##  8  2015 July-September   Oregon    68000     100000        8500               9
##  9  2015 July-September   Washi~    84000      97000       11500              12
## 10  2015 October-December Idaho    121000     145000       22000              15
## # ... with 68 more rows, and 4 more variables: colony_added <dbl>,
## #   colony_reno <dbl>, colony_reno_pct <dbl>, colony_total <dbl>

What are your findings about the summary? Are they what you expected?

When looking at the overall averages of the numbers of colonies of each state from the summary above, I was first surprised to see that the state of Washington has the fewest number of colonies within the Northwest honeybee cohort. Ultimately, this makes sense when you look at the averages for each year for the state of Washington in comparison to Oregon and Idaho.

As a side note, it is still surprising and a bit concerning since Washington is one of the largest producers of apples in the United States, which is an industry that relies on commercial apiarists to come and set up their hives to pollinate the trees. However, assumptions cannot truly be made since there are not enough data from enough years and there are too many variables to account for.

#Note: This is just to organize the data to see how the averages of the number of colonies in each state per year have changed.

colony_state_year %>% arrange(desc(year))
## # A tibble: 21 x 3
## # Groups:   state [3]
##    state       year mean_colony_total
##    <chr>      <dbl>             <dbl>
##  1 Idaho       2021            109750
##  2 Oregon      2021             74800
##  3 Washington  2021             52550
##  4 Idaho       2020             76250
##  5 Oregon      2020             93800
##  6 Washington  2020             66725
##  7 Idaho       2019             74875
##  8 Oregon      2019             64125
##  9 Washington  2019             46125
## 10 Idaho       2018             87300
## # ... with 11 more rows

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

#Note: This plot is to visualize the average number of colonies in each state throughout the years (2015~2021). ##Special note: The subtitle “Aggregate Months Data = Year” refers to the fact that the original values were reported from month ranges (i.e. January-March), which makes up the whole year that is used to calculate the overall average.

ggplot(
  data = colony_state_year,  
       aes(y = mean_colony_total, x= year)) +
  geom_bar(stat = "identity") +
  facet_wrap(vars(state), 
             ncol = 3) +
  labs(y="Mean Colony Total", x="Year", title="Average Number of Reported Colonies per Year",
       subtitle = "Aggregate Months Data = Year") +
  theme_bw()

#Note: This plot is to display the number of colonies lost against versus the number of total colonies. The subtitle refers to the point that the “year” is based on aggregate data from the reported months range (i.e. January-March).

ggplot(colony_total) +
  
  aes(x = colony_total, 
      y = colony_lost, 
      color = year) +
  
  geom_point() + 
  labs(y="Number of Colonies Lost", x="Total Number of Colonies", title="Number of Reported Colonies Lost per Year Vs. Total Number of Colonies",
       subtitle = "Aggregate Months Data = Year") +
  
  facet_wrap(vars(state))

Final Summary (10 points)

Summarize your research question and findings below.

My research question set out to determine whether the number of colonies in the Pacific Northwest (Washington, Oregon, and Idaho) have decreased through the years. The findings from the data indicate that the number of colonies throughout the reported years (2015-2021) are not necessarily increasing or decreasing–I would be disingenuous to say that they are necessarily increasing/decreasing since it appears from the data that there is not enough evidence to make this statement.

These findings may possibly be due to there not being enough data since only seven years are reported. However, from the data a trend appears in that there are fewer number of colonies within the state of Washington than there are in Idaho and Oregon. However, there are too many variables to indicate why this may be the case.

Are your findings what you expected? Why or Why not?

The findings are not what I expected. My expectation was that there would be a trend showing that the number of colonies are actually increasing since apiarists are taking greater measures to protect their hives from pesticides and disease within the last few years. My expectation is also due to there being some increases in governmental regulations (depending on the state) to prevent the overuse of pesticides. However, the data do somewhat agree with the fact that apiarists across the country have been reporting cases of pesticide-resistant parasites and pests that have been harming the colonies and knocking down the bee populations.

In this way it can be assumed from the data (which do not indicate clear increases/decreases) that the fight is ongoing to maintain bee populations and the overall number of colonies throughout the Northwest towards hopefully seeing a positive trend in the future.