Midterm (Due Sunday 2/13/2022 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday.
See other data sources in the #data channel on Slack.
Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.
You may use another dataset or your own data, but please make sure it is de-identified.

Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:

mutate()
group_by()
summarize()
ggplot()

and at least one of the following:

case_when()
across()
*_join() (i.e. left_join())
pivot_*() (i.e. pivot_longer())

The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

[As an avid candy lover, I am interested in seeing what candy bars have the highest ratings and where their origin of beans are located. I grew up near Hershey Pennsylvania, and I was able to explore the Hershey Factory and see how Hershey kisses are made. So this assignment was nostalgic. My specific question is: What region of country bean of origin has the highest ratings?]

Given your question, what is your expectation about the data? [ I expect South America to have the highest rating because alot of outsourcing of cocoa beans come from that region. Also, I think South America’s environment and agriculture favors cocoa bean growth.]

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv'
                             )

## Rows: 2530 Columns: 10

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): company_manufacturer, company_location, country_of_bean_origin, spe...
## dbl (3): ref, review_date, rating

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(chocolate)

## Rows: 2,530
## Columns: 10
## $ ref                              <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2~
## $ company_manufacturer             <chr> "5150", "5150", "5150", "5150", "5150~
## $ company_location                 <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.~
## $ review_date                      <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2~
## $ country_of_bean_origin           <chr> "Tanzania", "Dominican Republic", "Ma~
## $ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat~
## $ cocoa_percent                    <chr> "76%", "76%", "76%", "68%", "72%", "8~
## $ ingredients                      <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "~
## $ most_memorable_characteristics   <chr> "rich cocoa, fatty, bready", "cocoa, ~
## $ rating                           <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3~

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section. [The data was pretty clean with no NA codes, but there were some “other” variables in regards to country_of_bean_origin. This doesn’t answer my question so I filtered it out.]

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

chocolate_filtered <- chocolate %>%
  filter(!country_of_bean_origin=="Blend") %>%
  select(-ref,-company_manufacturer,-review_date,-cocoa_percent,-ingredients,-most_memorable_characteristics,-specific_bean_origin_or_bar_name,-cocoa_percent)



chocolate_ratings <- chocolate_filtered %>%
                    filter(rating> 3.50)



chocolate_ratings <- chocolate_ratings %>%
  mutate(
    country_of_bean_origin = case_when(
      country_of_bean_origin == "Madagascar"~"Africa",
      country_of_bean_origin == "Peru"~"South America",
      country_of_bean_origin == "Ecuador"~"South America",
      country_of_bean_origin == "Venezuela"~"South America",
      country_of_bean_origin == "Sao Tome"~"Africa",
      country_of_bean_origin == "Mexico"~"North America",
      country_of_bean_origin == "Indonesia"~"South East Asia",
      country_of_bean_origin == "Dominican Republic"~"North America",
      country_of_bean_origin == "Papua New Guinea"~"Oceania",
      country_of_bean_origin == "Brazil"~"South America",
      country_of_bean_origin == "Belize"~"North America",
      country_of_bean_origin == "Nicaragua"~"North America",
      country_of_bean_origin == "Costa Rica"~"North America",
      country_of_bean_origin == "Bolivia"~"South America",
      country_of_bean_origin == "Haiti"~"North America",
      country_of_bean_origin == "Colombia"~"South America",
      country_of_bean_origin == "Philippines"~"South East Asia",
      country_of_bean_origin == "Tanzania"~"Africa",
      country_of_bean_origin == "Ghana"~"Africa",
      country_of_bean_origin == "India"~"Asia",
      country_of_bean_origin == "Guatemala"~"North America",
      country_of_bean_origin == "Solomon Islands"~"Oceania",
      country_of_bean_origin == "Fiji"~"Oceania",
      country_of_bean_origin == "Jamaica"~"North America",
      country_of_bean_origin == "U.S.A."~"North America",
      country_of_bean_origin == "Honduras"~"North America",
      country_of_bean_origin == "Vanuatu"~"Oceania",
      country_of_bean_origin == "Trinidad"~"North America",
      country_of_bean_origin == "St.Lucia"~"North America",
      country_of_bean_origin == "Cuba"~"North America",
      country_of_bean_origin == "Uganda"~"Africa",
      country_of_bean_origin == "Australia"~"Oceania",
      country_of_bean_origin == "Tobago"~"North America",
      TRUE~ "other"
      
  )
  )

chocolate_ratings <- chocolate_ratings %>%
  filter(!country_of_bean_origin=="other")

[ In order to organize my data, I needed to get rid of variables that I am not interested in. I removed columns that were not a part of my research question using select (-)and then I filtered ratings to > 3.5, so I can have less data. I removed “blend” from my country_of_bean_origin because it would not help answer my specific question. To further narrow down the regions of origins, I created a region variable with (mutate) such as , North and South America, and Africa, etc. Additionally, I got rid of country_of_bean_origin variables that equal “other”. Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

view(chocolate_ratings)
glimpse(chocolate_ratings)

## Rows: 375
## Columns: 3
## $ company_location       <chr> "U.S.A.", "France", "France", "France", "France~
## $ country_of_bean_origin <chr> "Africa", "South America", "South America", "So~
## $ rating                 <dbl> 3.75, 3.75, 3.75, 4.00, 4.00, 3.75, 4.00, 3.75,~

chocolate_ratings %>% janitor::tabyl(country_of_bean_origin)

##  country_of_bean_origin   n     percent
##                  Africa  51 0.136000000
##                    Asia   1 0.002666667
##           North America 120 0.320000000
##                 Oceania  15 0.040000000
##           South America 183 0.488000000
##         South East Asia   5 0.013333333

Are the values what you expected for the variables? Why or Why not? [I think the values make sense, but we need to consider that the number of countries in each region varies. The summarize fuction, was mainly used to see the number of countries in each region.]

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question [I wanted to take my regions and get the average ratings from each region. So I needed to calculate sum(rating)/length(rating) which will give me the average of all the countries in each region]

chocolate_ratings %>%
  group_by(country_of_bean_origin) %>%
  summarize(rating= sum(rating)/length(rating))%>%
  arrange(desc(rating))

## # A tibble: 6 x 2
##   country_of_bean_origin rating
##   <chr>                   <dbl>
## 1 South America            3.83
## 2 Oceania                  3.83
## 3 Africa                   3.82
## 4 South East Asia          3.8 
## 5 North America            3.79
## 6 Asia                     3.75

chocolateregions <- chocolate_ratings %>%
  group_by(country_of_bean_origin) %>%
  summarize(rating= sum(rating)/length(rating))%>%
  arrange(desc(rating))

What are your findings about the summary? Are they what you expected?

The findings are predictable to what I hypothesized. I knew just by public knowledge that a lot of cocoa beans and chocolate manufacturing came from South America. I am actually surprised that Oceania had cocoa bean growth and that Asia is below North America. In the North America region, there were a lot of countries in the caribbean which has favorable weather for cocoa bean growth.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

I had a difficult time creating the plot for this project. At first, I was using the chocolate regions data set but the values were too close to each other and had only 6 observations. This is why the original showed only a line than a box plot. My mistake was using the newer chocolateregions data instead of the chocolate_ratings data set which has more observations.

chocolate_ratings %>%
  group_by(country_of_bean_origin) %>%
  summarize(rating = mean(rating)) %>%
  mutate(
    country_of_bean_origin = fct_reorder(country_of_bean_origin, rating, .desc = TRUE)
  ) %>%
  ggplot(aes(x = country_of_bean_origin, y = rating, fill = country_of_bean_origin)) +
  geom_point(shape = 21, size = 2) + 
  theme(legend.position = "bottom")+
  labs(title ="Country of Bean Origin Ratings",
       x = "Country Of Bean Origin ",
       y = "Ratings")

chocolate_ratings %>%
  mutate(
    country_of_bean_origin = fct_reorder(country_of_bean_origin, rating, .desc = TRUE)
  ) %>%
  ggplot(aes(x = country_of_bean_origin, y = rating)) +
  geom_violin() +
  geom_point() +
   labs(title ="Country of Bean Origin Ratings",
       x = "Country Of Bean Origin ",
       y = "Ratings")

## Warning: Groups with fewer than two data points have been dropped.

Final Summary (10 points)

Summarize your research question and findings below.

What region of country bean of origin has the highest ratings? From my data, I found that cocoa beans from South America had the highest ratings for candy bars , followed by Oceania, Africa, South East Asia, North America, and Asia. The data charts represent the mean of ratings for each region. South America has the highest average and the violin boxplot shows a larger amount of ratings near 4.0 and North America has more ratings near 3.75.

Are your findings what you expected? Why or Why not?

My findings are what I expected from general information. I knew that most cocoa beans and coffee beans are from South America and are preferred , so it is not by surprise that it has the higher on average rating.Climate and agriculture distributions most likely contribute to the high rating. Even though, there are more countries from our data in North and South America, I do think the data would still look like this even if there is an equal distribution of countries in each region.

Midterm

Barbara Cassese

2022-02-16