Midterm (Due Sunday 2/13/2022 at 11:55 pm)

  • Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

  • 60 / 60 points total

  • The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.


Instructions: Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.
  • Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday.
  • See other data sources in the #data channel on Slack.
  • Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.
  • You may use another dataset or your own data, but please make sure it is de-identified.
  1. Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.

  2. You must use each of the following functions at least once:

  • mutate()
  • group_by()
  • summarize()
  • ggplot()
  • and at least one of the following:
    • case_when()
    • across()
    • *_join() (i.e. left_join())
    • pivot_*() (i.e. pivot_longer())
  1. The code chunks below are guides, please add more code chunks to do what you need.

  2. If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

  • I am fine with having my final project posted on the public website :)

  • You may remove these instructions from your final Rmd if you like


Working Together

  • If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.

  • No acknowledgements of contributions = -10 points overall.


Please Note

  • I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.


Define Your Research Question (10 points)

  • Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
    • I will use linear regression to try to find which of the following variables are associated with total coffee rating: species (Arabica or Robusta), continent, number of bags tested, color of bean, method for processing, and the year which the beans were harvested.
  • Given your question, what is your expectation about the data?
    • I expect to see coffees from high altitude receive more total cup points, with average number of total cup points increasing with increasing altitude. Furthermore, I expect Robusta coffees to score more cup points than Arabica, as I know Robusta is more niche (less common) and will likely be produced by more devoted farmers. I do not expect to find a significant association between continent, harvest year, or number of bags tested as I think these are more administrative variables rather than being associated with quality of the coffee. I honestly don’t know enough about processing methods or bean color to make specific predictions, but I do not think these factors will be significantly associated with total cup points, as I have not heard much about these factors in my coffee experiences.


Loading the Data (10 points)

  • Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.
    • I used the Coffee Ratings data set from Tidy Tuesday (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-07/readme.md). The code to read in this data set also came from the Tidy Tuesday website and is not my own code. This data set consists of basic information and ratings for 1339 types of coffee. Variables include rating (0-100), species (Arabica or Robusta), farm owner, country of origin, farm name, lot number, mill name, company name, region, altitude, number of bags tested, variety of beans, color of bean, method for processing, year beans were harvested, and different grades (aroma, flavor, aftertaste, acidity, body, balance, uniformity, clean cup, sweetness, and moisture).
coffee_ratings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   total_cup_points = col_double(),
##   number_of_bags = col_double(),
##   aroma = col_double(),
##   flavor = col_double(),
##   aftertaste = col_double(),
##   acidity = col_double(),
##   body = col_double(),
##   balance = col_double(),
##   uniformity = col_double(),
##   clean_cup = col_double(),
##   sweetness = col_double(),
##   cupper_points = col_double(),
##   moisture = col_double(),
##   category_one_defects = col_double(),
##   quakers = col_double(),
##   category_two_defects = col_double(),
##   altitude_low_meters = col_double(),
##   altitude_high_meters = col_double(),
##   altitude_mean_meters = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
dim(coffee_ratings)
## [1] 1339   43


  • This data set contains 1339 rows (coffees) and 43 columns (variables).


dplyr::glimpse(coffee_ratings)
## Rows: 1,339
## Columns: 43
## $ total_cup_points      <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75,…
## $ species               <chr> "Arabica", "Arabica", "Arabica", "Arabica", "Ara…
## $ owner                 <chr> "metad plc", "metad plc", "grounds for health ad…
## $ country_of_origin     <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia",…
## $ farm_name             <chr> "metad plc", "metad plc", "san marcos barrancas …
## $ lot_number            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mill                  <chr> "metad plc", "metad plc", NA, "wolensu", "metad …
## $ ico_number            <chr> "2014/2015", "2014/2015", NA, NA, "2014/2015", N…
## $ company               <chr> "metad agricultural developmet plc", "metad agri…
## $ altitude              <chr> "1950-2200", "1950-2200", "1600 - 1800 m", "1800…
## $ region                <chr> "guji-hambela", "guji-hambela", NA, "oromia", "g…
## $ producer              <chr> "METAD PLC", "METAD PLC", NA, "Yidnekachew Dabes…
## $ number_of_bags        <dbl> 300, 300, 5, 320, 300, 100, 100, 300, 300, 50, 3…
## $ bag_weight            <chr> "60 kg", "60 kg", "1", "60 kg", "60 kg", "30 kg"…
## $ in_country_partner    <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ harvest_year          <chr> "2014", "2014", NA, "2014", "2014", "2013", "201…
## $ grading_date          <chr> "April 4th, 2015", "April 4th, 2015", "May 31st,…
## $ owner_1               <chr> "metad plc", "metad plc", "Grounds for Health Ad…
## $ variety               <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other"…
## $ processing_method     <chr> "Washed / Wet", "Washed / Wet", NA, "Natural / D…
## $ aroma                 <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25, …
## $ flavor                <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, …
## $ aftertaste            <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50, …
## $ acidity               <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42, …
## $ body                  <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33, …
## $ balance               <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50, …
## $ uniformity            <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ clean_cup             <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ sweetness             <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ cupper_points         <dbl> 8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.00, …
## $ moisture              <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03, …
## $ category_one_defects  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ quakers               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ color                 <chr> "Green", "Green", NA, "Green", "Green", "Bluish-…
## $ category_two_defects  <dbl> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0, …
## $ expiration            <chr> "April 3rd, 2016", "April 3rd, 2016", "May 31st,…
## $ certification_body    <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ certification_address <chr> "309fcf77415a3661ae83e027f7e5f05dad786e44", "309…
## $ certification_contact <chr> "19fef5a731de2db57d16da10287413f5f99bc2dd", "19f…
## $ unit_of_measurement   <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
## $ altitude_low_meters   <dbl> 1950.0, 1950.0, 1600.0, 1800.0, 1950.0, NA, NA, …
## $ altitude_high_meters  <dbl> 2200.0, 2200.0, 1800.0, 2200.0, 2200.0, NA, NA, …
## $ altitude_mean_meters  <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA, …


  • If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
    • The weirdest variable is definitely harvest year, which is a character variable, all written in various manners and units (precise day, month, season, range of years, etc.). This will need to be cleaned going forward in order to use this variable. I would like the variable to be numeric to the specific year.
  • Make sure your data types are correct!


Transforming the data (15 points)

  • If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Species

  • I made a binary variable for species, with Arabica as the referent group as it was the largest group.
# Making species into a binary variable with Arabica as the referent group
coffee_ratings <- coffee_ratings %>%
  mutate(species_binary = ifelse(species == "Arabica", 0, 1))


Continent

  • I made a variable for continent using fct_collapse() to reduce the number of levels for location from 36 countries to 4 continents.
# Making variable for continent
coffee_ratings <- coffee_ratings %>%
  mutate(continent = fct_collapse(country_of_origin,
    "South_America" = c("Brazil", "Colombia", "Ecuador", "Peru"),
    "North_America" = c("Costa Rica", "Nicaragua", "United States (Puerto Rico)",
                        "El Salvador", "Haiti", "Mexico", "Honduras", "United States",
                        "Guatemala", "Panama", "United States (Hawaii)"),
    "Asia" = c("India", "Laos", "Vietnam", "Indonesia", "Japan", "Myanmar", "Taiwan",
               "Philippines", "China", "Thailand", "Papua New Guinea"),
    "Africa" = c("Burundi", "Ethiopia", "Malawi", "Cote d?Ivoire", "Kenya", "Mauritius",
                 "Uganda", "Tanzania, United Republic Of", "Rwanda", "Zambia")
    ))


Altitude Variable

  • I had to re-code a few values into NAs because they were impossibly high (like over 100,000 meters when the height of Mt. Everest is just under 9,000 meters). I removed all values above 8,000 meters, a total of 4 coffees.
max(coffee_ratings$altitude_mean_meters, na.rm = TRUE)
## [1] 190164
coffee_ratings %>% filter(altitude_mean_meters > 8000) %>% nrow()
## [1] 4
# Removing impossibly large altitudes from the data set
coffee_ratings <- coffee_ratings %>%
  mutate(altitude_mean_meters = ifelse(altitude_mean_meters > 8000,
                                       NA, altitude_mean_meters))


Harvest Year

  • I made a variable for harvest year (from 2008 to 2018). This was a very difficult variable as there were ranges of years as well as a mixture of precise dates, months, and a few months.
  • With ranges, I used the first year listed (ex. 2008-2009 became 2008).
  • In regression, I will use this variable as a factor, as I do not expect any linear trend as years progress, but I would like to look for differences between years that the beans were harvested.
# Making a variable for harvest year
coffee_ratings$harvest_year <- gsub("March", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("September", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("Sept", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("April", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("May", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("June", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("July", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("August", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("December", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("January", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("Abril", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("Julio", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("Through", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("Spring", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("/", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("-", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("to", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub(" ", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("inColombia", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("Fall", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("4T10", "2010", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("mmm", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("0809crop", "2008", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("1t", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("1T", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("2t", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("2T", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("3t", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("3T", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("4t", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("4T", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("4T7", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("472010", "2010", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("232010", "2010", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("oa", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("TEST", "", coffee_ratings$harvest_year)
coffee_ratings$harvest_year <- gsub("72010", "2010", coffee_ratings$harvest_year)

coffee_ratings$harvest_year <- as.numeric(substr(coffee_ratings$harvest_year, 1, 4))


  • Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.
    • My data set is comprised of only one data set, but, to practice merging, I created two data sets and re-merged them together using left_join().
coffee_ratings1 <- data.frame(coffee_ratings[,1], coffee_ratings[,5], coffee_ratings[,2:4], coffee_ratings[,6:10])

coffee_ratings2 <- data.frame(coffee_ratings[,1], coffee_ratings[,5], coffee_ratings[,(11:45)])

coffee_ratings <- coffee_ratings1 %>%
  left_join(coffee_ratings2,
             by = c("total_cup_points", "farm_name"))


  • Show your transformed table here.
    • Use tools such as glimpse(), skim() or head() to illustrate your point.
glimpse(coffee_ratings)
## Rows: 3,187
## Columns: 45
## $ total_cup_points      <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75,…
## $ farm_name             <chr> "metad plc", "metad plc", "san marcos barrancas …
## $ species               <chr> "Arabica", "Arabica", "Arabica", "Arabica", "Ara…
## $ owner                 <chr> "metad plc", "metad plc", "grounds for health ad…
## $ country_of_origin     <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia",…
## $ lot_number            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mill                  <chr> "metad plc", "metad plc", NA, "wolensu", "metad …
## $ ico_number            <chr> "2014/2015", "2014/2015", NA, NA, "2014/2015", N…
## $ company               <chr> "metad agricultural developmet plc", "metad agri…
## $ altitude              <chr> "1950-2200", "1950-2200", "1600 - 1800 m", "1800…
## $ region                <chr> "guji-hambela", "guji-hambela", NA, "oromia", "g…
## $ producer              <chr> "METAD PLC", "METAD PLC", NA, "Yidnekachew Dabes…
## $ number_of_bags        <dbl> 300, 300, 5, 320, 300, 100, 100, 300, 300, 50, 3…
## $ bag_weight            <chr> "60 kg", "60 kg", "1", "60 kg", "60 kg", "30 kg"…
## $ in_country_partner    <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ harvest_year          <dbl> 2014, 2014, NA, 2014, 2014, 2013, 2012, 2010, 20…
## $ grading_date          <chr> "April 4th, 2015", "April 4th, 2015", "May 31st,…
## $ owner_1               <chr> "metad plc", "metad plc", "Grounds for Health Ad…
## $ variety               <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other"…
## $ processing_method     <chr> "Washed / Wet", "Washed / Wet", NA, "Natural / D…
## $ aroma                 <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25, …
## $ flavor                <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, …
## $ aftertaste            <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50, …
## $ acidity               <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42, …
## $ body                  <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33, …
## $ balance               <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50, …
## $ uniformity            <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ clean_cup             <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ sweetness             <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ cupper_points         <dbl> 8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.00, …
## $ moisture              <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03, …
## $ category_one_defects  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ quakers               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ color                 <chr> "Green", "Green", NA, "Green", "Green", "Bluish-…
## $ category_two_defects  <dbl> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0, …
## $ expiration            <chr> "April 3rd, 2016", "April 3rd, 2016", "May 31st,…
## $ certification_body    <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ certification_address <chr> "309fcf77415a3661ae83e027f7e5f05dad786e44", "309…
## $ certification_contact <chr> "19fef5a731de2db57d16da10287413f5f99bc2dd", "19f…
## $ unit_of_measurement   <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
## $ altitude_low_meters   <dbl> 1950.0, 1950.0, 1600.0, 1800.0, 1950.0, NA, NA, …
## $ altitude_high_meters  <dbl> 2200.0, 2200.0, 1800.0, 2200.0, 2200.0, NA, NA, …
## $ altitude_mean_meters  <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA, …
## $ species_binary        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ continent             <fct> Africa, Africa, North_America, Africa, Africa, S…


  • Are the values what you expected for the variables? Why or Why not?
    • All the variables I hope to look at are now all the correct type and appear to be in a usable form.


Visualizing and Summarizing the Data (15 points)

  • Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question
coffee_ratings %>%
  group_by(species) %>%
  summarise(Obs = n(),
            Mean_cup_points = mean(total_cup_points),
            SE_cup_points = round(sd(total_cup_points)/Obs, 5))
## # A tibble: 2 × 4
##   species   Obs Mean_cup_points SE_cup_points
##   <chr>   <int>           <dbl>         <dbl>
## 1 Arabica  3149            82.7       0.00078
## 2 Robusta    38            81.2       0.0589
  • The majority of coffees come from “Arabica” species. Due to this, I made this the reference group for later linear regression.
  • The mean number of cup points is high in the Arabica group as compared to the Robusta group. I personally prefer Arabica coffee so I can personally agree with this finding.


coffee_ratings$continent <- factor(coffee_ratings$continent,
                                  levels = c("North_America", "South_America",
                                           "Asia", "Africa", NA))

coffee_ratings %>%
  group_by(continent) %>%
  summarise(Obs = n(),
            Mean_cup_points = mean(total_cup_points),
            SE_cup_points = round(sd(total_cup_points)/Obs, 5))
## # A tibble: 5 × 4
##   continent       Obs Mean_cup_points SE_cup_points
##   <fct>         <int>           <dbl>         <dbl>
## 1 North_America  1140            81.9       0.00312
## 2 South_America  1492            83.1       0.00077
## 3 Asia            251            82.2       0.00723
## 4 Africa          303            83.7       0.00575
## 5 <NA>              1            79.1      NA
  • The majority of coffees come from North America. Due to this, I made this the reference group for later linear regression.
  • African coffees have the highest average number of cup points, followed by South American coffees, then Asian coffees, and finally North American coffees had the lowest average number of cup points.


coffee_ratings$processing_method <- factor(coffee_ratings$processing_method,
                                  levels = c("Washed / Wet", "Natural / Dry",
                                           "Semi-washed / Semi-pulped", "Pulped natural / honey",
                                           "Other"))

coffee_ratings %>%
  group_by(processing_method) %>%
  summarise(Obs = n(),
            Mean_cup_points = mean(total_cup_points),
            SE_cup_points = round(sd(total_cup_points)/Obs, 5))
## # A tibble: 6 × 4
##   processing_method           Obs Mean_cup_points SE_cup_points
##   <fct>                     <int>           <dbl>         <dbl>
## 1 Washed / Wet               1852            82.6       0.00112
## 2 Natural / Dry               663            82.8       0.00299
## 3 Semi-washed / Semi-pulped    76            82.6       0.0216 
## 4 Pulped natural / honey       25            82.4       0.0601 
## 5 Other                        27            81.4       0.146  
## 6 <NA>                        544            82.8       0.00716
  • The majority of coffees are processed “Washed / Wet.” Due to this, I made this the reference group for later linear regression.
  • “Natural / Dry” processed coffee appears to have the highest average total cup points, followed by “Semi-washed / Semi-pulped,” then “Washed / Wet,” then “Pulped natural / honey,” and finally “Other.” Interestingly, coffees with processing method missing actually had the highest average total cup points of all.


coffee_ratings$harvest_year <- factor(as.factor(coffee_ratings$harvest_year),
                                  levels = c("2012", "2008", "2009", "2010", "2011", "2013",
                                             "2014", "2015", "2016", "2017", "2018"))

coffee_ratings %>%
  group_by(harvest_year) %>%
  summarise(Obs = n(),
            Mean_cup_points = mean(total_cup_points),
            SE_cup_points = round(sd(total_cup_points)/Obs, 5))
## # A tibble: 12 × 4
##    harvest_year   Obs Mean_cup_points SE_cup_points
##    <fct>        <int>           <dbl>         <dbl>
##  1 2012           788            82.3       0.00295
##  2 2008             2            78.0       0.856  
##  3 2009            32            84.6       0.0515 
##  4 2010            98            83.2       0.0205 
##  5 2011           114            82.9       0.0116 
##  6 2013           546            82.7       0.00308
##  7 2014           575            82.7       0.00415
##  8 2015           308            83.1       0.00429
##  9 2016           340            82.9       0.00535
## 10 2017           152            82.1       0.0456 
## 11 2018             1            81.6      NA      
## 12 <NA>           231            83.0       0.00559
  • The largest number of coffees came from 2012, so this was made the referent year.
  • 2009 appears to have the highest average total cup points, followed by 2010, and then 2015.


coffee_ratings$color <- factor(coffee_ratings$color,
                                  levels = c("Green", "Bluish-Green", "Blue-Green", "None"))

coffee_ratings %>%
  group_by(color) %>%
  summarise(Obs = n(),
            Mean_cup_points = mean(total_cup_points),
            SE_cup_points = round(sd(total_cup_points)/Obs, 5))
## # A tibble: 5 × 4
##   color          Obs Mean_cup_points SE_cup_points
##   <fct>        <int>           <dbl>         <dbl>
## 1 Green         1948            82.6       0.00142
## 2 Bluish-Green   307            83.0       0.00512
## 3 Blue-Green     179            82.9       0.0103 
## 4 None            90            81.4       0.0316 
## 5 <NA>           663            83.0       0.00278
  • The largest number of coffees were green, so this was made the referent year.
  • Bluish-green coffees appear to have the highest average total cup points, followed by blue-green, and then green. Interestingly, coffees with color missing actually had the highest average total cup points of all colors.


coffee_model <- glm(total_cup_points ~ species_binary + altitude_mean_meters +
                      continent + number_of_bags + as.factor(harvest_year) +
                      processing_method + color,
                    data = coffee_ratings)

summary(coffee_model)
## 
## Call:
## glm(formula = total_cup_points ~ species_binary + altitude_mean_meters + 
##     continent + number_of_bags + as.factor(harvest_year) + processing_method + 
##     color, data = coffee_ratings)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -21.4660   -0.6042    0.1833    0.8176    6.7159  
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                 8.058e+01  1.907e-01 422.570
## species_binary                             -1.450e+00  7.300e-01  -1.986
## altitude_mean_meters                        4.860e-04  9.550e-05   5.089
## continentSouth_America                      1.665e+00  1.122e-01  14.845
## continentAsia                               7.756e-01  1.875e-01   4.136
## continentAfrica                             2.101e+00  1.615e-01  13.006
## number_of_bags                              8.429e-05  4.694e-04   0.180
## as.factor(harvest_year)2011                -3.006e-01  3.210e-01  -0.936
## as.factor(harvest_year)2013                 1.117e-01  1.302e-01   0.858
## as.factor(harvest_year)2014                 1.453e-01  1.328e-01   1.094
## as.factor(harvest_year)2015                 6.580e-01  1.528e-01   4.308
## as.factor(harvest_year)2016                 3.257e-01  1.859e-01   1.752
## as.factor(harvest_year)2017                 5.377e-01  1.935e-01   2.778
## as.factor(harvest_year)2018                -4.789e-02  1.878e+00  -0.025
## processing_methodNatural / Dry              1.645e-01  1.224e-01   1.343
## processing_methodSemi-washed / Semi-pulped  6.385e-01  2.482e-01   2.573
## processing_methodPulped natural / honey    -6.160e-01  5.718e-01  -1.077
## processing_methodOther                     -5.642e-01  4.112e-01  -1.372
## colorBluish-Green                           2.648e-01  1.560e-01   1.697
## colorBlue-Green                             4.375e-01  1.776e-01   2.464
## colorNone                                  -7.573e-01  2.653e-01  -2.855
##                                            Pr(>|t|)    
## (Intercept)                                 < 2e-16 ***
## species_binary                              0.04719 *  
## altitude_mean_meters                       3.96e-07 ***
## continentSouth_America                      < 2e-16 ***
## continentAsia                              3.69e-05 ***
## continentAfrica                             < 2e-16 ***
## number_of_bags                              0.85750    
## as.factor(harvest_year)2011                 0.34918    
## as.factor(harvest_year)2013                 0.39106    
## as.factor(harvest_year)2014                 0.27422    
## as.factor(harvest_year)2015                1.74e-05 ***
## as.factor(harvest_year)2016                 0.08002 .  
## as.factor(harvest_year)2017                 0.00552 ** 
## as.factor(harvest_year)2018                 0.97966    
## processing_methodNatural / Dry              0.17933    
## processing_methodSemi-washed / Semi-pulped  0.01017 *  
## processing_methodPulped natural / honey     0.28145    
## processing_methodOther                      0.17025    
## colorBluish-Green                           0.08992 .  
## colorBlue-Green                             0.01384 *  
## colorNone                                   0.00436 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3.486427)
## 
##     Null deviance: 8135.4  on 1857  degrees of freedom
## Residual deviance: 6404.6  on 1837  degrees of freedom
##   (1329 observations deleted due to missingness)
## AIC: 7616.1
## 
## Number of Fisher Scoring iterations: 2
species_beta <- -1*(round(as.numeric(data.frame(coffee_model$coefficients)[2,1]), 2))
altitude_beta <- round(as.numeric(100*data.frame(coffee_model$coefficients)[3,1]), 2)
SA_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[4,1]), 2)
Asia_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[5,1]), 2)
Africa_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[6,1]), 2)
no_bags_beta <- round(as.numeric(10*data.frame(coffee_model$coefficients)[7,1]), 2)
y2011_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[8,1]), 2)
y2013_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[9,1]), 2)
y2014_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[10,1]), 2)
y2015_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[11,1]), 2)
y2016_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[12,1]), 2)
y2017_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[13,1]), 2)
y2018_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[14,1]), 2)
ND_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[15,1]), 2)
Semi_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[16,1]), 2)
Pulp_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[17,1]), 2)
Other_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[18,1]), 2)
BishG_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[19,1]), 2)
BG_beta <- round(as.numeric(data.frame(coffee_model$coefficients)[20,1]), 2)
None_beta <- -round(as.numeric(data.frame(coffee_model$coefficients)[21,1]), 2)
  • Relative to coffee grown in North America, coffee grown in South America, Africa, or Asia had significantly different total number of cup points, with all other variables kept constant (p < 0.01).
  • Relative to coffees grown in 2012, coffees grown in 2015, 2016, or 2017, on average, had significantly different total cup points, with all other variables kept constant (p < 0.05).
  • Altitude was significantly associated with total cup points (p < 0.0001).
  • Relative to Arabica coffees, Robusta coffees scored a significantly different number of total cup points, with all other variables kept constant (p < 0.05).
  • Relative to “Washed / Wet” coffee, “Semi-washed / Semi-pulped” scored a significantly different number of total cup points, with all other variables kept constant (p < 0.01).
  • Relative to green coffee beans, beans with color listed as “None” had significantly different total number of cup points (p < 0.05).
  • No other variables were significantly associated a significant difference in total number of cup points.


  • What are your findings about the summary? Are they what you expected?
    • The only variable that, across all categories, was significantly associated with cup points was continent. African coffees scored an average of \(2.1\) points higher than North American coffees (p < 0.0001). Asian coffees scored an average of \(0.78\) points higher than North American coffees (p < 0.0001). South American coffees scored an average of \(1.67\) points higher than North American coffees (p < 0.0001).
      • I did not expect continent to be significantly associated with total cup points, so this is a big surprise to me.
    • Every 100m increase in altitude was associated with an expected \(0.05\) point increase in total cup points, with all other variables constant (p < 0.0001).
      • I expected altitude to be significantly associated with total cup points, with points increasing with higher altitude, so this is not surprising to me.
    • Robusta coffees scored an average of \(1.45\) points fewer than Arabica coffees (p < 0.05).
      • I expected Robusta to score higher than Arabica, so this is surprising to me, having the opposite association.
    • Even though it did not achieve statistical significance, with every 10-unit increase in number of bags tested the average cup points decreased by \(0\) points (p = 0.901).
      • I did not expect number of bags tested to be significantly associated with total cup points so this is not surprising to me.
    • As compared to coffees grown in 2012, coffees grown in 2015, on average, had \(0.66\) more total cup points (p < 0.01). As compared to coffees grown in 2012, coffees grown in 2016, on average, had \(0.33\) more total cup points (p < 0.05). As compared to coffees grown in 2012, coffees grown in 2017, on average, had \(0.54\) more total cup points (p < 0.01). No other years had a significantly different number of total cup points than coffees grown in 2012.
      • I did not expect harvest year to be significantly associated with total cup points so this is somewhat surprising to me. I guess it makes sense that the harvest of certain years may be better than others.
    • Relative to “Washed / Wet” coffee, “Semi-washed / Semi-pulped” scored \(0.64\) more total cup points, with all other variables kept constant (p < 0.01). No other processing methods were associated with a significant difference in total cup points relative to “Washed / Wet” processed coffees.
      • I did not expect processing method to be significantly associated with total cup points so this is somewhat surprising to me.
    • As compared to green coffee beans, coffees with bean color listed as “None”, on average, had \(0.76\) fewer total cup points (p < 0.05). Although it did not achieve statistical significance, blue-green beans scored an average of \(0.44\) more total cup points than green beans (p = 0.073).
      • I did not expect bean color to have a significan association with total cup points, so this is interesting to me. I would like to learn more about the differences between green, bluish-green, and blue-green beans.


  • Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

Continent

ggplot(data = coffee_ratings,
       aes(x = continent,
           fill = continent)) +
  geom_bar() +
  scale_fill_viridis_d() +
  labs(
    title = "Number of Coffees From Each Continent",
    x = "Continent",
    y = "Number of Coffees Sampled"
  ) +
  theme_bw() +
  theme(legend.position = "none")

ggplot(data = (coffee_ratings %>% filter(total_cup_points > 0)), # I removed an outlier
       aes(x = continent,
           y = total_cup_points,
           fill = continent)) +
  geom_boxplot() +
  scale_fill_viridis_d() +
  labs(
    title = "Total Cup Points of Coffees",
    subtitle = "By Continent",
    x = "Continent",
    y = "Total Cup Points"
  ) +
  theme_bw() +
  theme(legend.position = "none")

  • South America produced the largest number of coffees, followed by North America, then Africa, and finally Asia. African countries appear to score the best, followed by South America, then Asia, and then North America. That being said, all continents appear to have a very broad range of total cup points.


Species

ggplot(data = coffee_ratings,
       aes(x = species,
           fill = species)) +
  geom_bar() +
  scale_fill_viridis_d() +
  labs(
    title = "Number of Coffees From Each Species",
    x = "Species",
    y = "Number of Coffees Sampled"
  ) +
  theme_bw() +
  theme(legend.position = "none")

ggplot(data = (coffee_ratings %>% filter(total_cup_points > 0)), # I removed an outlier
       aes(x = species,
           y = total_cup_points,
           fill = species)) +
  geom_boxplot() +
  scale_fill_viridis_d() +
  labs(
    title = "Total Cup Points of Coffees",
    subtitle = "By Species",
    x = "Species",
    y = "Total Cup Points"
  ) +
  theme_bw() +
  theme(legend.position = "none")

  • The vast majority of coffees were Arabica. Arabica coffees appear to score more total cup points, on average, than Robusta coffees. That being said, the range of total cup points for Arabica coffees is much wider than that of Robusta coffees.


Altitude

ggplot(data = (coffee_ratings %>% filter(total_cup_points > 0)), # I removed an outlier
       aes(x = altitude_mean_meters,
           y = total_cup_points)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Total Cup Points of Coffees vs Altitude",
    x = "Altitude (m)",
    y = "Total Cup Points"
  ) +
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 860 rows containing non-finite values (stat_smooth).
## Warning: Removed 860 rows containing missing values (geom_point).

  • From the scatterplot, it appears that average cup points increases as altitude increases. The correlation appears to be fairly weak, though.


Number of Bags Tested

ggplot(data = (coffee_ratings %>% filter(total_cup_points > 0)), # I removed an outlier
       aes(x = number_of_bags,
           y = total_cup_points)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Total Cup Points of Coffees vs Number of Bags Tested",
    x = "Number of Bags Tested",
    y = "Total Cup Points"
  ) +
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

  • As number of bags tested increases, so does the total cup points. That being said, the increase is very modest and likely not significant.


Harvest Year

ggplot(data = (coffee_ratings %>% filter(total_cup_points > 0)), # I removed an outlier
       aes(x = harvest_year,
           y = total_cup_points)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Total Cup Points of Coffees vs Harvest Year",
    x = "Harvest Year",
    y = "Total Cup Points"
  ) +
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = coffee_ratings,
       aes(x = as.factor(harvest_year),
           fill = as.factor(harvest_year))) +
  geom_bar() +
  scale_fill_viridis_d() +
  labs(
    title = "Number of Coffees From Each Harvest Year",
    x = "Harvest Year",
    y = "Number of Coffees Sampled"
  ) +
  theme_bw() +
  theme(legend.position = "none")

ggplot(data = (coffee_ratings %>% filter(total_cup_points > 0)), # I removed an outlier
       aes(x = as.factor(harvest_year),
           y = total_cup_points,
           fill = as.factor(harvest_year))) +
  geom_boxplot() +
  scale_fill_viridis_d() +
  labs(
    title = "Total Cup Points of Coffees",
    subtitle = "By Harvest Year",
    x = "Harvest Year",
    y = "Total Cup Points"
  ) +
  theme_bw() +
  theme(legend.position = "none")

  • Coffees harvested in 2008 appear to score significantly fewer total cup points than coffees harvested in any other year. There were also a lot fewer coffees from this year, so it could have just been a bad harvest year. Coffees from 2009 appear to score the highest number of total cup points, although this year had the second fewest coffees. 2012 had by far the most coffees, followed by 2014, and then 2013.


Processing Method

ggplot(data = coffee_ratings,
       aes(x = processing_method,
           fill = processing_method)) +
  geom_bar() +
  scale_fill_viridis_d() +
  labs(
    title = "Number of Coffees From Each Processing Method",
    x = "Processing Method",
    y = "Number of Coffees Sampled"
  ) +
  theme_bw() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 20,hjust=1))

ggplot(data = (coffee_ratings %>% filter(total_cup_points > 0)), # I removed an outlier
       aes(x = processing_method,
           y = total_cup_points,
           fill = processing_method)) +
  geom_boxplot() +
  scale_fill_viridis_d() +
  labs(
    title = "Total Cup Points of Coffees",
    subtitle = "By Processing Method",
    x = "Processing Method",
    y = "Total Cup Points"
  ) +
  theme_bw() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 20,hjust=1))

  • Looking at the boxplot above, there does not appear to be a significant difference in median total cup points across the different processing methods. That being said, the range of total cup points is much wider for washed/wet as compared to semi-washed/semi-pulped or pulped natural/honey. This could partly be due to a smaller sample size in the latter methods.


Bean Color

ggplot(data = coffee_ratings,
       aes(x = color,
           fill = color)) +
  geom_bar() +
  scale_fill_viridis_d() +
  labs(
    title = "Number of Coffees From Each Coffee Bean Color",
    x = "Coffee Bean Color",
    y = "Number of Coffees Sampled"
  ) +
  theme_bw() +
  theme(legend.position = "none")

ggplot(data = (coffee_ratings %>% filter(total_cup_points > 0)), # I removed an outlier
       aes(x = color,
           y = total_cup_points,
           fill = color)) +
  geom_boxplot() +
  scale_fill_viridis_d() +
  labs(
    title = "Total Cup Points of Coffees",
    subtitle = "By Coffee Bean Color",
    x = "Coffee Bean Color",
    y = "Total Cup Points"
  ) +
  theme_bw() +
  theme(legend.position = "none")

  • From the boxpot above, the blue-green coffee beans appear to score higher, on average, than the green coffee beans. Those with color listed as “None” appear to score significantly lower than any of the coffees with color listed. The variation of total cup points for green coffee beans is much larger than any other color category.


Final Summary (10 points)

  • Summarize your research question and findings below.
    • Data was obtained from the Coffee Quality Database through Tidy Tuesday courtesy of James LeDoux. The data and information on the data set, variables, etc. can be obtained here: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-07/readme.md.
    • The data were analyzed to assess for which variables were associated with higher total cup points, or ratings. This topic was of interest to see what factors are influential in the overall perceived quality of the coffee, after accounting for other variables. Exploratory analyses were conducted followed by a full linear model. Visuals were created to assess the relationship between each variable and total cup points.
    • It was discovered that being harvested in Africa, South America, or Asia (versus North America), being harvested at higher altitude, being Arabica coffee, being blue-green (as opposed to green), having undergone semi-washed/semi-pulped processing (as opposed to wet/washed processing), or having been harvested in 2015-2017 (as opposed to 2012) is significantly associated with high average total cup score, holding all other variables constant (p < 0.05). Number of bags tested was not significantly associated with total cup points. Having bean color listed as “none” was significantly associated with fewer total cup points as compared to green coffee beans.
    • Armed with this new information, when looking for a nice coffee to purchase, I would likely look for blue-green coffees from high-elevation areas of Africa that have undergone semi-washed/semi-pulped processing. For future research, I would like to know why coffees from 2015-2017 were significantly better than those from 2012. Furthermore, it may be interesting to have a variable for time of year that the beans were harvested to see if there is a seasonal component to higher-rated coffees.


  • Are your findings what you expected? Why or Why not?
    • I had expected to find higher altitude to produce higher rated coffee. I had expected Robusta coffee to be higher rated than Arabica, but it was actually the opposite. It is possible that, since Arabica is so much more common, that the raters preferred this species due to familiarity. I did not expect continent to be significantly associated with total sup points, but all continents had significantly better rated coffee than North American coffees. I did not have any predictions on processing as I did not know anything about coffee processing, but I find it interesting that only one method, semi-washed/semi-pulped, was better rated than the most popular processing method, wet/washed.