Midterm

Midterm (Due Sunday 2/13/2022 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday.
See other data sources in the #data channel on Slack.
Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.
You may use another dataset or your own data, but please make sure it is de-identified.

Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:

mutate()
group_by()
summarize()
ggplot()

and at least one of the following:

case_when()
across()
*_join() (i.e. left_join())
pivot_*() (i.e. pivot_longer())

The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I am a chocolate lover and I enjoy all types of chocolate. Luckily I found the chocolatebar dataset from TidyTuesday on GitHub website so I could dedicate this project to chocolate. I’ve used this dataset for my function of the week, and I’d like to keep exploring this dataset in the midterm project.
The specific questions that I’m interested in is to see if there is a correlation between percentage of cocoa and the rating of the chocolate, and if adding different numbers of ingredients would correlate with the rating score as well. I also want to know where the most popular beans are among the countries of origin and who the company manufacturers are with relatively high rating.

Given your question, what is your expectation about the data?

Intuitively, customers do not like the chocolate with high cocoa percentage. More ingredients can maximize chocolate’s flavor, therefore the chocolate bars with more ingredients probably have better reviews scores. I’m expecting the most popular beans come from the countries where located in the southern hemisphere. I’ve never heard of the chocolate manufacturers in this datasets, so I’d like to know which of them have the best reviews.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

#tuesdata <- tt_load('2022-01-18')
#ChocolateBar <-  tuesdata$chocolate

ChocolateBar <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')

glimpse(ChocolateBar)

## Rows: 2,530
## Columns: 10
## $ ref                              <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2~
## $ company_manufacturer             <chr> "5150", "5150", "5150", "5150", "5150~
## $ company_location                 <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.~
## $ review_date                      <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2~
## $ country_of_bean_origin           <chr> "Tanzania", "Dominican Republic", "Ma~
## $ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat~
## $ cocoa_percent                    <chr> "76%", "76%", "76%", "68%", "72%", "8~
## $ ingredients                      <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "~
## $ most_memorable_characteristics   <chr> "rich cocoa, fatty, bready", "cocoa, ~
## $ rating                           <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3~

# Check all data frame columns and get a sum of NA values in them
cbind(lapply(lapply(ChocolateBar, is.na), sum))

##                                  [,1]
## ref                              0   
## company_manufacturer             0   
## company_location                 0   
## review_date                      0   
## country_of_bean_origin           0   
## specific_bean_origin_or_bar_name 0   
## cocoa_percent                    0   
## ingredients                      87  
## most_memorable_characteristics   0   
## rating                           0

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

After I had an general idea of the dataset, I made a list about what I wanted to do in the next step, transforming the data:

Remove the variables which are not relative to my research questions: ref, company_location, review_date, specific_bean_origin_or_bar_name, and most_memorable_characteristics.
Remove % sign for cocoa_percent variable, and then convert it into numeric variable.
Separate ingredients variable and only keep the number of ingredients in the chocolate, delete the information of the types of ingredients. Then convert it into factor variable(for model building purpose).
For 87 missing values “NA” in character variable ingredients, I will remove them from my dataset in the next section.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

# Remove the variables that not needed
ChocolateBar <- ChocolateBar %>% 
  select(-ref, -company_location, -review_date, -specific_bean_origin_or_bar_name, -most_memorable_characteristics)
  
  
# Remove % sign 
ChocolateBar <- ChocolateBar %>% 
  mutate(cocoa_percent = as.numeric(gsub("%", "", cocoa_percent)))


# Separate ingredients variable
ChocolateBar <- ChocolateBar %>%
  separate(
    col = ingredients,
    into = c("ingredients", "type"),
    sep = "-") %>%
  mutate(ingredients = na_if(ingredients, "NA"),
         type = na_if(type, "NA")) %>% 
  select(-type) %>% 
  mutate(ingredients = as.factor(ingredients)) %>% 
  drop_na()

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

skim(ChocolateBar)

Data summary
Name	ChocolateBar
Number of rows	2443
Number of columns	5
_______________________
Column type frequency:
character	2
factor	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
company_manufacturer	0	1	2	39	0	542	0
country_of_bean_origin	0	1	4	21	0	62	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
ingredients	0	1	FALSE	6	3: 1023, 2: 750, 4: 469, 5: 191

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
cocoa_percent	0	1	71.50	5.16	42	70	70.00	74.0	100	▁▁▇▁▁
rating	0	1	3.21	0.43	1	3	3.25	3.5	4	▁▁▅▇▇

Are the values what you expected for the variables? Why or Why not?

After transformed my variables as I planed, I got two character variables(company_manufacturer and country_of_bean_origin), two numeric variables (cocoa_percent and rating) and one factor variable(ingredients). They are the variables with right formats as I expected, and I’m going to use them for answering my research questions.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

ChocolateBar %>%
  summarise(across(where(is.character), n_distinct))

## # A tibble: 1 x 2
##   company_manufacturer country_of_bean_origin
##                  <int>                  <int>
## 1                  542                     62

# country_of_bean_origin
ChocolateBar %>%
  tabyl(country_of_bean_origin) %>%  
  arrange(desc(n)) %>% 
  top_frac(.25) %>%  
  gt()

country_of_bean_origin	n	percent
Venezuela	246	0.10069587
Peru	231	0.09455587
Dominican Republic	220	0.09005321
Ecuador	201	0.08227589
Madagascar	171	0.06999591
Blend	144	0.05894392
Nicaragua	100	0.04093328
Bolivia	79	0.03233729
Colombia	78	0.03192796
Tanzania	78	0.03192796
Brazil	77	0.03151862
Belize	74	0.03029063
Vietnam	73	0.02988129
Guatemala	62	0.02537863
Mexico	54	0.02210397

ChocolateBar %>% 
  group_by(country_of_bean_origin) %>% 
  filter(n() >= 54) %>% 
  mutate(count = n()) %>% 
  ggplot(aes(x = reorder(country_of_bean_origin, count))) + 
  geom_bar(fill="lightblue") + 
  coord_flip() + 
  geom_text(stat = "count",aes(label = ..count..)) +
  labs(x = 'Country of bean origin', y = 'Count', title = 'Top 25% most frequently used country of bean origins') + 
  theme_minimal()

# company_manufacturer
ChocolateBar %>%
  tabyl(company_manufacturer) %>%  
  arrange(desc(n)) %>% 
  top_frac(.05) %>%  
  gt()

company_manufacturer	n	percent
Soma	56	0.022922636
Fresco	39	0.015963979
Arete	32	0.013098649
Bonnat	29	0.011870651
A. Morin	26	0.010642652
Dandelion	25	0.010233320
Pralus	25	0.010233320
Domori	22	0.009005321
Guittard	22	0.009005321
Valrhona	22	0.009005321
Zotter	21	0.008595989
Castronovo	19	0.007777323
Dick Taylor	19	0.007777323
Coppeneur	18	0.007367990
Mast Brothers	18	0.007367990
Duffy's	17	0.006958657
Pierre Marcolini	17	0.006958657
Scharffen Berger	17	0.006958657
Smooth Chocolator, The	17	0.006958657
Rogue	16	0.006549325
Artisan du Chocolat	15	0.006139992
Letterpress	15	0.006139992
Palette de Bine	15	0.006139992
Szanto Tibor	15	0.006139992
Bittersweet Origins	14	0.005730659
Fruition	14	0.005730659
Hotel Chocolat (Coppeneur)	14	0.005730659
Map Chocolate	14	0.005730659
Sirene	14	0.005730659
Tejas	14	0.005730659

ChocolateBar %>%
  group_by(company_manufacturer) %>% 
  filter(n() >= 14) %>%
  summarize(mean_rating = mean(rating)) %>%
  arrange(desc(mean_rating)) %>%  
  gt()

company_manufacturer	mean_rating
Soma	3.589286
Bonnat	3.534483
Arete	3.531250
Domori	3.522727
Smooth Chocolator, The	3.514706
Dick Taylor	3.486842
Duffy's	3.441176
Letterpress	3.433333
Fruition	3.428571
A. Morin	3.423077
Rogue	3.421875
Fresco	3.416667
Szanto Tibor	3.400000
Pierre Marcolini	3.397059
Sirene	3.392857
Castronovo	3.381579
Zotter	3.321429
Valrhona	3.318182
Dandelion	3.300000
Bittersweet Origins	3.267857
Coppeneur	3.250000
Map Chocolate	3.250000
Palette de Bine	3.233333
Scharffen Berger	3.220588
Pralus	3.180000
Guittard	3.170455
Artisan du Chocolat	3.166667
Mast Brothers	3.125000
Hotel Chocolat (Coppeneur)	3.053571
Tejas	3.017857

ChocolateBar %>%
  group_by(company_manufacturer) %>% 
  filter(n() >= 14) %>% 
  mutate(mean_rating = mean(rating)) %>%
  ggplot(aes(reorder(company_manufacturer, mean_rating), rating, fill = mean_rating)) + 
  geom_boxplot() + 
  coord_flip()  + 
  labs(x = 'Company manufacturer', y = 'Rating', title = 'Top 5% most frequently rated company manufacturers') + 
  scale_fill_continuous(name = "Mean rating") + 
  theme_minimal()

What are your findings about the summary? Are they what you expected?

I have 62 different countries of beans origin and 542 chocolate company manufacturers. I choose the top 25% country of origins with most frequently used chocolate beans and top 5% chocolate company manufacturers with most frequently rated times. Based on above plot, it appears that Venezula is the most popular country of chocolate bean’s origin and Soma chocolate maker has the higest rating. The results are as expected. I’m going to explain the details in final summary section.

ChocolateBar %>% 
  tabyl(ingredients) %>%  
  gt()

ingredients	n	percent
1	6	0.002455997
2	750	0.306999591
3	1023	0.418747442
4	469	0.191977077
5	191	0.078182562
6	4	0.001637331

# numbers of ingredients vs. the rating of the chocolate
ChocolateBar %>% 
  group_by(ingredients) %>% 
  ggplot(aes(x = ingredients, y = rating)) + 
  geom_boxplot() + 
  labs(title = "Rating vs. Groups with 1 - 6 ingredients", x = 'Numbers of ingredients', y = 'Rating') + 
  theme_minimal()

# percentage of cocoa vs. the rating of the chocolate
ChocolateBar %>% 
  ggplot(aes(x = cocoa_percent, y = rating)) + 
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE, col = 'blue') + 
  labs(title = "Correlation of Cocoa percentage vs. Rating", x = 'Cocoa percentage', y = 'Rating') +
  theme_minimal()

What are your findings about the summary?

Three ingredients in chocolate appears the highest frequency. The above boxplot also shows that the distribution of rating values is different among the six levels of ingredients. Two and three ingredients have the highest median value while six ingredients has a lowest median value. In the second scatter plot, with increasing of cocoa percentage, the rating of chocolate bar slightly decreases.

# linear regression model 
model <- lm(rating ~ cocoa_percent, data = ChocolateBar)
summary(model)

## 
## Call:
## lm(formula = rating ~ cocoa_percent, data = ChocolateBar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.20045 -0.21968  0.03032  0.28673  0.89570 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.668419   0.120159  30.530  < 2e-16 ***
## cocoa_percent -0.006410   0.001676  -3.824 0.000134 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4272 on 2441 degrees of freedom
## Multiple R-squared:  0.005956,   Adjusted R-squared:  0.005548 
## F-statistic: 14.63 on 1 and 2441 DF,  p-value: 0.0001344

model <- lm(rating ~ cocoa_percent + ingredients, data = ChocolateBar)
summary(model)

## 
## Call:
## lm(formula = rating ~ cocoa_percent + ingredients, data = ChocolateBar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04543 -0.28104  0.00846  0.27519  0.94620 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.795007   0.245475  15.460  < 2e-16 ***
## cocoa_percent -0.008367   0.001749  -4.782 1.84e-06 ***
## ingredients2   0.026235   0.179641   0.146    0.884    
## ingredients3   0.074040   0.179637   0.412    0.680    
## ingredients4  -0.075624   0.180720  -0.418    0.676    
## ingredients5  -0.138801   0.183154  -0.758    0.449    
## ingredients6  -0.250919   0.276482  -0.908    0.364    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4218 on 2436 degrees of freedom
## Multiple R-squared:  0.03287,    Adjusted R-squared:  0.03048 
## F-statistic:  13.8 on 6 and 2436 DF,  p-value: 1.772e-15

What are your findings about the summary?

According to the model summary, cocoa_percent is significantly related with rating. However, the of 0.6% is a fairly small value. It represents only 0.6% of data fit this model. After adding variabale ingredients into the model, the greatly increased to 3% but non of the levels of ingredients is significant.

Final Summary (10 points)

Summarize your research question and findings below.

In summary, regardless of the R-squared value, there is a significant relationship between cocoa percentage and the rating of the chocolate bar. Although ingredients could help to explain some of the variation in the rating variable around its mean, a statistically significant relationship is not found between ingredients and the rating scores. To further explore, to my way of thinking, we might need to include more variable in the model for a better prediction. The top three most popular beans are from Venezuela, Peru and Dominican Republic. The highest frequently rated with the highest rating score company manufacturers are Soma, Bonnat and Arete.

Are your findings what you expected? Why or Why not?

The findings generally meet my expectation. The linear regression models shows the evidence that cocoa percentage is significantly correlated to the rating score. But non of the level of ingredients was significantly correlated with the rating, which is a bit unexpected for me. Cacao trees must grow in a limited geographical zone, with hotter tropical climates and lower elevations. Hence, the top three of most frequently used country of bean origins: Venezuela, Peru and Dominican Republic, are all the habitat which meet the growing requirement. For the highest rating score company manufacturers, I googled their names and found their information on the website of Academy of Chocolate. I also tried to find the individual manufacturers information, for instance, Soma, it is a small chocolate factory which start their business in 2003, and Bonnat is a French chocolate maker. It makes sense to me that why I am not familiar with their company manufacturers names. The reasons could be they are foreign brands, small factory or establishing their brand very late.