Midterm (Due Sunday 2/13/2022 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.
  1. Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.

  2. You must use each of the following functions at least once:

and at least one of the following:

  1. The code chunks below are guides, please add more code chunks to do what you need.

  2. If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.


Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I am a chocolate lover and I enjoy all types of chocolate. Luckily I found the chocolatebar dataset from TidyTuesday on GitHub website so I could dedicate this project to chocolate. I’ve used this dataset for my function of the week, and I’d like to keep exploring this dataset in the midterm project.
The specific questions that I’m interested in is to see if there is a correlation between percentage of cocoa and the rating of the chocolate, and if adding different numbers of ingredients would correlate with the rating score as well. I also want to know where the most popular beans are among the countries of origin and who the company manufacturers are with relatively high rating.

Given your question, what is your expectation about the data?

Intuitively, customers do not like the chocolate with high cocoa percentage. More ingredients can maximize chocolate’s flavor, therefore the chocolate bars with more ingredients probably have better reviews scores. I’m expecting the most popular beans come from the countries where located in the southern hemisphere. I’ve never heard of the chocolate manufacturers in this datasets, so I’d like to know which of them have the best reviews.


Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

#tuesdata <- tt_load('2022-01-18')
#ChocolateBar <-  tuesdata$chocolate

ChocolateBar <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')

glimpse(ChocolateBar)
## Rows: 2,530
## Columns: 10
## $ ref                              <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2~
## $ company_manufacturer             <chr> "5150", "5150", "5150", "5150", "5150~
## $ company_location                 <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.~
## $ review_date                      <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2~
## $ country_of_bean_origin           <chr> "Tanzania", "Dominican Republic", "Ma~
## $ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat~
## $ cocoa_percent                    <chr> "76%", "76%", "76%", "68%", "72%", "8~
## $ ingredients                      <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "~
## $ most_memorable_characteristics   <chr> "rich cocoa, fatty, bready", "cocoa, ~
## $ rating                           <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3~
# Check all data frame columns and get a sum of NA values in them
cbind(lapply(lapply(ChocolateBar, is.na), sum))
##                                  [,1]
## ref                              0   
## company_manufacturer             0   
## company_location                 0   
## review_date                      0   
## country_of_bean_origin           0   
## specific_bean_origin_or_bar_name 0   
## cocoa_percent                    0   
## ingredients                      87  
## most_memorable_characteristics   0   
## rating                           0

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

After I had an general idea of the dataset, I made a list about what I wanted to do in the next step, transforming the data:


Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

# Remove the variables that not needed
ChocolateBar <- ChocolateBar %>% 
  select(-ref, -company_location, -review_date, -specific_bean_origin_or_bar_name, -most_memorable_characteristics)
  
  
# Remove % sign 
ChocolateBar <- ChocolateBar %>% 
  mutate(cocoa_percent = as.numeric(gsub("%", "", cocoa_percent)))


# Separate ingredients variable
ChocolateBar <- ChocolateBar %>%
  separate(
    col = ingredients,
    into = c("ingredients", "type"),
    sep = "-") %>%
  mutate(ingredients = na_if(ingredients, "NA"),
         type = na_if(type, "NA")) %>% 
  select(-type) %>% 
  mutate(ingredients = as.factor(ingredients)) %>% 
  drop_na()

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

skim(ChocolateBar)
Data summary
Name ChocolateBar
Number of rows 2443
Number of columns 5
_______________________
Column type frequency:
character 2
factor 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
company_manufacturer 0 1 2 39 0 542 0
country_of_bean_origin 0 1 4 21 0 62 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
ingredients 0 1 FALSE 6 3: 1023, 2: 750, 4: 469, 5: 191

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
cocoa_percent 0 1 71.50 5.16 42 70 70.00 74.0 100 ▁▁▇▁▁
rating 0 1 3.21 0.43 1 3 3.25 3.5 4 ▁▁▅▇▇

Are the values what you expected for the variables? Why or Why not?

After transformed my variables as I planed, I got two character variables(company_manufacturer and country_of_bean_origin), two numeric variables (cocoa_percent and rating) and one factor variable(ingredients). They are the variables with right formats as I expected, and I’m going to use them for answering my research questions.


Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

ChocolateBar %>%
  summarise(across(where(is.character), n_distinct))
## # A tibble: 1 x 2
##   company_manufacturer country_of_bean_origin
##                  <int>                  <int>
## 1                  542                     62
# country_of_bean_origin
ChocolateBar %>%
  tabyl(country_of_bean_origin) %>%  
  arrange(desc(n)) %>% 
  top_frac(.25) %>%  
  gt()   
country_of_bean_origin n percent
Venezuela 246 0.10069587
Peru 231 0.09455587
Dominican Republic 220 0.09005321
Ecuador 201 0.08227589
Madagascar 171 0.06999591
Blend 144 0.05894392
Nicaragua 100 0.04093328
Bolivia 79 0.03233729
Colombia 78 0.03192796
Tanzania 78 0.03192796
Brazil 77 0.03151862
Belize 74 0.03029063
Vietnam 73 0.02988129
Guatemala 62 0.02537863
Mexico 54 0.02210397
ChocolateBar %>% 
  group_by(country_of_bean_origin) %>% 
  filter(n() >= 54) %>% 
  mutate(count = n()) %>% 
  ggplot(aes(x = reorder(country_of_bean_origin, count))) + 
  geom_bar(fill="lightblue") + 
  coord_flip() + 
  geom_text(stat = "count",aes(label = ..count..)) +
  labs(x = 'Country of bean origin', y = 'Count', title = 'Top 25% most frequently used country of bean origins') + 
  theme_minimal()

# company_manufacturer
ChocolateBar %>%
  tabyl(company_manufacturer) %>%  
  arrange(desc(n)) %>% 
  top_frac(.05) %>%  
  gt()   
company_manufacturer n percent
Soma 56 0.022922636
Fresco 39 0.015963979
Arete 32 0.013098649
Bonnat 29 0.011870651
A. Morin 26 0.010642652
Dandelion 25 0.010233320
Pralus 25 0.010233320
Domori 22 0.009005321
Guittard 22 0.009005321
Valrhona 22 0.009005321
Zotter 21 0.008595989
Castronovo 19 0.007777323
Dick Taylor 19 0.007777323
Coppeneur 18 0.007367990
Mast Brothers 18 0.007367990
Duffy's 17 0.006958657
Pierre Marcolini 17 0.006958657
Scharffen Berger 17 0.006958657
Smooth Chocolator, The 17 0.006958657
Rogue 16 0.006549325
Artisan du Chocolat 15 0.006139992
Letterpress 15 0.006139992
Palette de Bine 15 0.006139992
Szanto Tibor 15 0.006139992
Bittersweet Origins 14 0.005730659
Fruition 14 0.005730659
Hotel Chocolat (Coppeneur) 14 0.005730659
Map Chocolate 14 0.005730659
Sirene 14 0.005730659
Tejas 14 0.005730659
ChocolateBar %>%
  group_by(company_manufacturer) %>% 
  filter(n() >= 14) %>%
  summarize(mean_rating = mean(rating)) %>%
  arrange(desc(mean_rating)) %>%  
  gt()   
company_manufacturer mean_rating
Soma 3.589286
Bonnat 3.534483
Arete 3.531250
Domori 3.522727
Smooth Chocolator, The 3.514706
Dick Taylor 3.486842
Duffy's 3.441176
Letterpress 3.433333
Fruition 3.428571
A. Morin 3.423077
Rogue 3.421875
Fresco 3.416667
Szanto Tibor 3.400000
Pierre Marcolini 3.397059
Sirene 3.392857
Castronovo 3.381579
Zotter 3.321429
Valrhona 3.318182
Dandelion 3.300000
Bittersweet Origins 3.267857
Coppeneur 3.250000
Map Chocolate 3.250000
Palette de Bine 3.233333
Scharffen Berger 3.220588
Pralus 3.180000
Guittard 3.170455
Artisan du Chocolat 3.166667
Mast Brothers 3.125000
Hotel Chocolat (Coppeneur) 3.053571
Tejas 3.017857
ChocolateBar %>%
  group_by(company_manufacturer) %>% 
  filter(n() >= 14) %>% 
  mutate(mean_rating = mean(rating)) %>%
  ggplot(aes(reorder(company_manufacturer, mean_rating), rating, fill = mean_rating)) + 
  geom_boxplot() + 
  coord_flip()  + 
  labs(x = 'Company manufacturer', y = 'Rating', title = 'Top 5% most frequently rated company manufacturers') + 
  scale_fill_continuous(name = "Mean rating") + 
  theme_minimal()

What are your findings about the summary? Are they what you expected?

I have 62 different countries of beans origin and 542 chocolate company manufacturers. I choose the top 25% country of origins with most frequently used chocolate beans and top 5% chocolate company manufacturers with most frequently rated times. Based on above plot, it appears that Venezula is the most popular country of chocolate bean’s origin and Soma chocolate maker has the higest rating. The results are as expected. I’m going to explain the details in final summary section.

ChocolateBar %>% 
  tabyl(ingredients) %>%  
  gt()   
ingredients n percent
1 6 0.002455997
2 750 0.306999591
3 1023 0.418747442
4 469 0.191977077
5 191 0.078182562
6 4 0.001637331
# numbers of ingredients vs. the rating of the chocolate
ChocolateBar %>% 
  group_by(ingredients) %>% 
  ggplot(aes(x = ingredients, y = rating)) + 
  geom_boxplot() + 
  labs(title = "Rating vs. Groups with 1 - 6 ingredients", x = 'Numbers of ingredients', y = 'Rating') + 
  theme_minimal() 

# percentage of cocoa vs. the rating of the chocolate
ChocolateBar %>% 
  ggplot(aes(x = cocoa_percent, y = rating)) + 
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE, col = 'blue') + 
  labs(title = "Correlation of Cocoa percentage vs. Rating", x = 'Cocoa percentage', y = 'Rating') +
  theme_minimal()

What are your findings about the summary?

Three ingredients in chocolate appears the highest frequency. The above boxplot also shows that the distribution of rating values is different among the six levels of ingredients. Two and three ingredients have the highest median value while six ingredients has a lowest median value. In the second scatter plot, with increasing of cocoa percentage, the rating of chocolate bar slightly decreases.

# linear regression model 
model <- lm(rating ~ cocoa_percent, data = ChocolateBar)
summary(model)
## 
## Call:
## lm(formula = rating ~ cocoa_percent, data = ChocolateBar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.20045 -0.21968  0.03032  0.28673  0.89570 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.668419   0.120159  30.530  < 2e-16 ***
## cocoa_percent -0.006410   0.001676  -3.824 0.000134 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4272 on 2441 degrees of freedom
## Multiple R-squared:  0.005956,   Adjusted R-squared:  0.005548 
## F-statistic: 14.63 on 1 and 2441 DF,  p-value: 0.0001344
model <- lm(rating ~ cocoa_percent + ingredients, data = ChocolateBar)
summary(model)
## 
## Call:
## lm(formula = rating ~ cocoa_percent + ingredients, data = ChocolateBar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04543 -0.28104  0.00846  0.27519  0.94620 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.795007   0.245475  15.460  < 2e-16 ***
## cocoa_percent -0.008367   0.001749  -4.782 1.84e-06 ***
## ingredients2   0.026235   0.179641   0.146    0.884    
## ingredients3   0.074040   0.179637   0.412    0.680    
## ingredients4  -0.075624   0.180720  -0.418    0.676    
## ingredients5  -0.138801   0.183154  -0.758    0.449    
## ingredients6  -0.250919   0.276482  -0.908    0.364    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4218 on 2436 degrees of freedom
## Multiple R-squared:  0.03287,    Adjusted R-squared:  0.03048 
## F-statistic:  13.8 on 6 and 2436 DF,  p-value: 1.772e-15

What are your findings about the summary?

According to the model summary, cocoa_percent is significantly related with rating. However, the \(R^2\) of 0.6% is a fairly small value. It represents only 0.6% of data fit this model. After adding variabale ingredients into the model, the \(R^2\) greatly increased to 3% but non of the levels of ingredients is significant.


Final Summary (10 points)

Summarize your research question and findings below.

In summary, regardless of the R-squared value, there is a significant relationship between cocoa percentage and the rating of the chocolate bar. Although ingredients could help to explain some of the variation in the rating variable around its mean, a statistically significant relationship is not found between ingredients and the rating scores. To further explore, to my way of thinking, we might need to include more variable in the model for a better prediction. The top three most popular beans are from Venezuela, Peru and Dominican Republic. The highest frequently rated with the highest rating score company manufacturers are Soma, Bonnat and Arete.

Are your findings what you expected? Why or Why not?

The findings generally meet my expectation. The linear regression models shows the evidence that cocoa percentage is significantly correlated to the rating score. But non of the level of ingredients was significantly correlated with the rating, which is a bit unexpected for me. Cacao trees must grow in a limited geographical zone, with hotter tropical climates and lower elevations. Hence, the top three of most frequently used country of bean origins: Venezuela, Peru and Dominican Republic, are all the habitat which meet the growing requirement. For the highest rating score company manufacturers, I googled their names and found their information on the website of Academy of Chocolate. I also tried to find the individual manufacturers information, for instance, Soma, it is a small chocolate factory which start their business in 2003, and Bonnat is a French chocolate maker. It makes sense to me that why I am not familiar with their company manufacturers names. The reasons could be they are foreign brands, small factory or establishing their brand very late.