2022_02_02
Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
#data
channel on Slack..csv
file into your data
folder.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
Since my family has kept bees in the past, I am very interested in understanding how colony collapse disorder and other disorders have affected the number of colonies within the apiaries throughout Oregon and the greater Northwest region (Oregon, Washington, and Idaho) during the years reported within the dataset. The data from the “colony” dataset will be used towards (hopefully) delineating any changes that have occurred to the apiaries in Oregon and throughout our region. My research question is–have the number of colonies decreased through the years?
Given your question, what is your expectation about the data?
Since the data are varied (some locations are listed as “United States”), I am concerned about the number of reported data for the state of Oregon. However, in case that there are not enough data, I will add the states of Washington and Idaho to the analysis as a “Northwest Cohort” to analyze the amount of colony losses within these three states.
Nevertheless, my expectation about the data is that there will be increases in the number of colonies in each state throughout the reported years.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
colony <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv')
## Rows: 1222 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): months, state
## dbl (8): year, colony_n, colony_max, colony_lost, colony_lost_pct, colony_ad...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
stressor <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/stressor.csv')
## Rows: 7332 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): months, state, stressor
## dbl (2): year, stress_pct
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(colony)
## Rows: 1,222
## Columns: 10
## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months <chr> "January-March", "January-March", "January-March", "Ja~
## $ state <chr> "Alabama", "Arizona", "Arkansas", "California", "Color~
## $ colony_n <dbl> 7000, 35000, 13000, 1440000, 3500, 3900, 305000, 10400~
## $ colony_max <dbl> 7000, 35000, 14000, 1690000, 12500, 3900, 315000, 1050~
## $ colony_lost <dbl> 1800, 4600, 1500, 255000, 1500, 870, 42000, 14500, 380~
## $ colony_lost_pct <dbl> 26, 13, 11, 15, 12, 22, 13, 14, 4, 4, 40, 22, 18, 23, ~
## $ colony_added <dbl> 2800, 3400, 1200, 250000, 200, 290, 54000, 47000, 3400~
## $ colony_reno <dbl> 250, 2100, 90, 124000, 140, NA, 25000, 9500, 760, 8000~
## $ colony_reno_pct <dbl> 4, 6, 1, 7, 1, NA, 8, 9, 7, 9, 4, 1, 2, 1, NA, 13, NA,~
Note: The stressor data may not be needed since there are data that may not contribute to answering the specific research question about colony losses.
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Character variables: Months and state
Integer variables: Year, number of colonies (colony_n), maximum colonies (colony_max), number of colonies that are lost (colony_lost), percentage of the number of colonies lost (colony_lost_pct), number of colonies added (colony_added), number of colonies that have been renovated (colony_reno), percentage of the number of colonies that have been renovated (colony_reno_pct)
Note: Variables that contain NA values: Colony_reno, colony_reno_pct, colony_max, colony_added AT LEAST. There is an entry from 2019 that is comletely “NA”, which means that downstream work will have to compensate for this. Filtering the data will be necessary in order to select for Idaho, Washington, and Oregon into the Northwest Honeybees cohort.
Colony dataset: Number of columns: 10 Number of rows: 1222
skim(colony)
Name | colony |
Number of rows | 1222 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
months | 0 | 1 | 10 | 16 | 0 | 4 | 0 |
state | 0 | 1 | 4 | 14 | 0 | 47 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2017.77 | 1.89 | 2015 | 2016 | 2018 | 2019 | 2021 | ▇▃▃▃▆ |
colony_n | 47 | 0.96 | 123578.04 | 437835.18 | 1300 | 8000 | 17500 | 55500 | 3181180 | ▇▁▁▁▁ |
colony_max | 72 | 0.94 | 79112.77 | 190823.42 | 1700 | 9000 | 21000 | 68750 | 1710000 | ▇▁▁▁▁ |
colony_lost | 47 | 0.96 | 16551.32 | 60544.42 | 20 | 950 | 2200 | 6500 | 502350 | ▇▁▁▁▁ |
colony_lost_pct | 54 | 0.96 | 11.38 | 7.23 | 1 | 6 | 10 | 15 | 52 | ▇▅▁▁▁ |
colony_added | 83 | 0.93 | 17243.20 | 68167.65 | 10 | 420 | 1800 | 6500 | 736920 | ▇▁▁▁▁ |
colony_reno | 131 | 0.89 | 15278.86 | 62588.04 | 10 | 260 | 960 | 4585 | 806170 | ▇▁▁▁▁ |
colony_reno_pct | 260 | 0.79 | 9.10 | 9.66 | 1 | 2 | 6 | 12 | 77 | ▇▁▁▁▁ |
Make sure your data types are correct!
Character valies:2 Numerical values:8 #REMEMBER CATEGORICAL/CONTINUOUS DATA#
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
#Notes: #Use Case_when or Across? #oregon, wa , id –color for NW to create pacific northwest #use filter
#Note: Filter is used to select for the states of Oregon, Washington, and Idaho to create a Northwest cohort since there are not enough data values for a single state.
Northwest_Honeybees <- colony %>% filter(state %in% c("Oregon","Washington", "Idaho"))
Northwest_Honeybees
## # A tibble: 78 x 10
## year months state colony_n colony_max colony_lost colony_lost_pct
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2015 January-March Idaho 81000 88000 3700 4
## 2 2015 January-March Oregon 77000 87000 6500 8
## 3 2015 January-March Washi~ 52000 105000 14000 13
## 4 2015 April-June Idaho 62000 72000 6500 9
## 5 2015 April-June Oregon 82000 95000 5500 6
## 6 2015 April-June Washi~ 105000 127000 5000 4
## 7 2015 July-September Idaho 80000 128000 14000 11
## 8 2015 July-September Oregon 68000 100000 8500 9
## 9 2015 July-September Washi~ 84000 97000 11500 12
## 10 2015 October-December Idaho 121000 145000 22000 15
## # ... with 68 more rows, and 3 more variables: colony_added <dbl>,
## # colony_reno <dbl>, colony_reno_pct <dbl>
#Note: To confirm the data above
summary(Northwest_Honeybees)
## year months state colony_n
## Min. :2015 Length:78 Length:78 Min. : 41000
## 1st Qu.:2016 Class :character Class :character 1st Qu.: 69000
## Median :2018 Mode :character Mode :character Median : 84000
## Mean :2018 Mean : 85693
## 3rd Qu.:2019 3rd Qu.: 96000
## Max. :2021 Max. :168000
## NA's :3
## colony_max colony_lost colony_lost_pct colony_added
## Min. : 67000 Min. : 1500 Min. : 2.000 Min. : 70
## 1st Qu.: 89500 1st Qu.: 5750 1st Qu.: 5.000 1st Qu.: 4000
## Median :103000 Median : 9000 Median : 9.000 Median : 7000
## Mean :109747 Mean : 9745 Mean : 8.907 Mean : 9427
## 3rd Qu.:124000 3rd Qu.:13250 3rd Qu.:11.500 3rd Qu.:13250
## Max. :196000 Max. :22000 Max. :22.000 Max. :40000
## NA's :3 NA's :3 NA's :3 NA's :3
## colony_reno colony_reno_pct
## Min. : 100 Min. : 1.00
## 1st Qu.: 1725 1st Qu.: 2.00
## Median : 5500 Median : 7.50
## Mean : 9614 Mean :10.25
## 3rd Qu.:16375 3rd Qu.:16.00
## Max. :49000 Max. :42.00
## NA's :4 NA's :10
#Note: To find the total number of hives from the number that are lost.
colony_total <- Northwest_Honeybees %>%
mutate(colony_total = colony_n - colony_lost)
#Note: To clean up the data and remove NA values.
colony_total <- colony_total %>%
mutate(
across(where(is.numeric), replace_na, replace = 0))
glimpse(colony_total)
## Rows: 78
## Columns: 11
## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ~
## $ months <chr> "January-March", "January-March", "January-March", "Ap~
## $ state <chr> "Idaho", "Oregon", "Washington", "Idaho", "Oregon", "W~
## $ colony_n <dbl> 81000, 77000, 52000, 62000, 82000, 105000, 80000, 6800~
## $ colony_max <dbl> 88000, 87000, 105000, 72000, 95000, 127000, 128000, 10~
## $ colony_lost <dbl> 3700, 6500, 14000, 6500, 5500, 5000, 14000, 8500, 1150~
## $ colony_lost_pct <dbl> 4, 8, 13, 9, 6, 4, 11, 9, 12, 15, 8, 7, 7, 3, 10, 8, 3~
## $ colony_added <dbl> 2600, 4300, 13500, 16500, 14500, 15000, 5500, 8000, 18~
## $ colony_reno <dbl> 8000, 2400, 9000, 10500, 9500, 13000, 11500, 21000, 20~
## $ colony_reno_pct <dbl> 9, 3, 9, 15, 10, 10, 9, 21, 21, 4, 1, 0, 7, 2, 2, 7, 1~
## $ colony_total <dbl> 77300, 70500, 38000, 55500, 76500, 100000, 66000, 5950~
#Note: To find the average number of colonies from each state (Idaho, Oregon, & Washington).
Colony_average <- colony_total %>% group_by(state) %>% summarise(across(where(is.numeric),mean))
Colony_average #%>% filter(state=="Oregon")
## # A tibble: 3 x 10
## state year colony_n colony_max colony_lost colony_lost_pct colony_added
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Idaho 2018. 92962. 120038. 12231. 9.85 10646.
## 2 Oregon 2018. 84692. 100692. 7492. 7.19 8558.
## 3 Washington 2018. 69538. 95846. 8388. 8.65 7989.
## # ... with 3 more variables: colony_reno <dbl>, colony_reno_pct <dbl>,
## # colony_total <dbl>
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
#Note: To find the average (mean) of the number of colonies in each state for each year. The number of entries for each year more/less corresponds to the reported month range (i.e. January-March) in the tables above.
colony_state_year <- colony_total %>%
group_by(state, year) %>%
summarize(mean_colony_total = mean(colony_total, na.rm = TRUE))
## `summarise()` has grouped output by 'state'. You can override using the `.groups` argument.
colony_state_year
## # A tibble: 21 x 3
## # Groups: state [3]
## state year mean_colony_total
## <chr> <dbl> <dbl>
## 1 Idaho 2015 74450
## 2 Idaho 2016 77250
## 3 Idaho 2017 79750
## 4 Idaho 2018 87300
## 5 Idaho 2019 74875
## 6 Idaho 2020 76250
## 7 Idaho 2021 109750
## 8 Oregon 2015 74500
## 9 Oregon 2016 81975
## 10 Oregon 2017 76625
## # ... with 11 more rows
#Note: Skim works the best since the other tools displayed the information from this table in a much messier way.
skim(colony_state_year)
Name | colony_state_year |
Number of rows | 21 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | state |
Variable type: numeric
skim_variable | state | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
year | Idaho | 0 | 1 | 2018.00 | 2.16 | 2015 | 2016.5 | 2018 | 2019.5 | 2021 | ▇▃▃▃▇ |
year | Oregon | 0 | 1 | 2018.00 | 2.16 | 2015 | 2016.5 | 2018 | 2019.5 | 2021 | ▇▃▃▃▇ |
year | Washington | 0 | 1 | 2018.00 | 2.16 | 2015 | 2016.5 | 2018 | 2019.5 | 2021 | ▇▃▃▃▇ |
mean_colony_total | Idaho | 0 | 1 | 82803.57 | 12662.86 | 74450 | 75562.5 | 77250 | 83525.0 | 109750 | ▇▂▁▁▂ |
mean_colony_total | Oregon | 0 | 1 | 77028.57 | 9102.40 | 64125 | 73937.5 | 74800 | 79300.0 | 93800 | ▂▇▂▂▂ |
mean_colony_total | Washington | 0 | 1 | 60535.71 | 9670.52 | 46125 | 53262.5 | 64500 | 66925.0 | 72750 | ▂▅▁▇▂ |
Are the values what you expected for the variables? Why or Why not?
Yes, the values are expected since they have been step-wise arranged to perform the calculation of the averages, and they are arranged in the “mean_colony_total” column–this is the culmination of the work that has been done to transform the data from separate datapoints in each month-range of each year for each state. It makes sense that the values reflect the fluctuation of the number of the number of colonies throughout time.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
#Please see above for summarize data for “colony_state_year”–it had to be moved because of R not “seeing it” down here, and was not allowing the file to knit.#
#Note: To find the average (mean) of the number of colonies in each state.
colony_total %>%
group_by(state) %>%
summarize(mean_colony = mean(colony_total, na.rm = TRUE))
## # A tibble: 3 x 2
## state mean_colony
## <chr> <dbl>
## 1 Idaho 80731.
## 2 Oregon 77200
## 3 Washington 61150
#Note: This is another way of summarizing the data by grouping the “state” data with the “year” data and the rest of the data in order to show how the dataset has been worked with. It is not as succinct/informative as the summary above.
summary_colony_state_year <- colony_total %>%
group_by(state, year)
summary_colony_state_year
## # A tibble: 78 x 11
## # Groups: state, year [21]
## year months state colony_n colony_max colony_lost colony_lost_pct
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2015 January-March Idaho 81000 88000 3700 4
## 2 2015 January-March Oregon 77000 87000 6500 8
## 3 2015 January-March Washi~ 52000 105000 14000 13
## 4 2015 April-June Idaho 62000 72000 6500 9
## 5 2015 April-June Oregon 82000 95000 5500 6
## 6 2015 April-June Washi~ 105000 127000 5000 4
## 7 2015 July-September Idaho 80000 128000 14000 11
## 8 2015 July-September Oregon 68000 100000 8500 9
## 9 2015 July-September Washi~ 84000 97000 11500 12
## 10 2015 October-December Idaho 121000 145000 22000 15
## # ... with 68 more rows, and 4 more variables: colony_added <dbl>,
## # colony_reno <dbl>, colony_reno_pct <dbl>, colony_total <dbl>
What are your findings about the summary? Are they what you expected?
When looking at the overall averages of the numbers of colonies of each state from the summary above, I was first surprised to see that the state of Washington has the fewest number of colonies within the Northwest honeybee cohort. Ultimately, this makes sense when you look at the averages for each year for the state of Washington in comparison to Oregon and Idaho.
As a side note, it is still surprising and a bit concerning since Washington is one of the largest producers of apples in the United States, which is an industry that relies on commercial apiarists to come and set up their hives to pollinate the trees. However, assumptions cannot truly be made since there are not enough data from enough years and there are too many variables to account for.
#Note: This is just to organize the data to see how the averages of the number of colonies in each state per year have changed.
colony_state_year %>% arrange(desc(year))
## # A tibble: 21 x 3
## # Groups: state [3]
## state year mean_colony_total
## <chr> <dbl> <dbl>
## 1 Idaho 2021 109750
## 2 Oregon 2021 74800
## 3 Washington 2021 52550
## 4 Idaho 2020 76250
## 5 Oregon 2020 93800
## 6 Washington 2020 66725
## 7 Idaho 2019 74875
## 8 Oregon 2019 64125
## 9 Washington 2019 46125
## 10 Idaho 2018 87300
## # ... with 11 more rows
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
#Note: This plot is to visualize the average number of colonies in each state throughout the years (2015~2021). ##Special note: The subtitle “Aggregate Months Data = Year” refers to the fact that the original values were reported from month ranges (i.e. January-March), which makes up the whole year that is used to calculate the overall average.
ggplot(
data = colony_state_year,
aes(y = mean_colony_total, x= year)) +
geom_bar(stat = "identity") +
facet_wrap(vars(state),
ncol = 3) +
labs(y="Mean Colony Total", x="Year", title="Average Number of Reported Colonies per Year",
subtitle = "Aggregate Months Data = Year") +
theme_bw()
#Note: This plot is to display the number of colonies lost against versus the number of total colonies. The subtitle refers to the point that the “year” is based on aggregate data from the reported months range (i.e. January-March).
ggplot(colony_total) +
aes(x = colony_total,
y = colony_lost,
color = year) +
geom_point() +
labs(y="Number of Colonies Lost", x="Total Number of Colonies", title="Number of Reported Colonies Lost per Year Vs. Total Number of Colonies",
subtitle = "Aggregate Months Data = Year") +
facet_wrap(vars(state))
Summarize your research question and findings below.
My research question set out to determine whether the number of colonies in the Pacific Northwest (Washington, Oregon, and Idaho) have decreased through the years. The findings from the data indicate that the number of colonies throughout the reported years (2015-2021) are not necessarily increasing or decreasing–I would be disingenuous to say that they are necessarily increasing/decreasing since it appears from the data that there is not enough evidence to make this statement.
These findings may possibly be due to there not being enough data since only seven years are reported. However, from the data a trend appears in that there are fewer number of colonies within the state of Washington than there are in Idaho and Oregon. However, there are too many variables to indicate why this may be the case.
Are your findings what you expected? Why or Why not?
The findings are not what I expected. My expectation was that there would be a trend showing that the number of colonies are actually increasing since apiarists are taking greater measures to protect their hives from pesticides and disease within the last few years. My expectation is also due to there being some increases in governmental regulations (depending on the state) to prevent the overuse of pesticides. However, the data do somewhat agree with the fact that apiarists across the country have been reporting cases of pesticide-resistant parasites and pests that have been harming the colonies and knocking down the bee populations.
In this way it can be assumed from the data (which do not indicate clear increases/decreases) that the fight is ongoing to maintain bee populations and the overall number of colonies throughout the Northwest towards hopefully seeing a positive trend in the future.