Midterm (Due Sunday 2/13/2022 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday.
See other data sources in the #data channel on Slack.
Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.
You may use another dataset or your own data, but please make sure it is de-identified.

Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:

mutate()
group_by()
summarize()
ggplot()

and at least one of the following:

case_when()
across()
*_join() (i.e. left_join())
pivot_*() (i.e. pivot_longer())

The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

My research question is about which factor (company location, review date, bean origin, cocoa percentage, or number of ingredients) has relation to chocolate bar rating. The data interests me since it has over 2500 reviews of chocolate bars over the year and documents different characteristics of these chocolate bars. The specific question I want to find out about the data is whether there is any observable trend between the chocolate ratings and these characteristics or not.

Given your question, what is your expectation about the data?

I expect the data to include both the chocolate ratings (outcome variables) and the chocolate bars’ characteristics.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

chocolate <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')
head(chocolate)

##    ref company_manufacturer company_location review_date country_of_bean_origin
## 1 2454                 5150           U.S.A.        2019               Tanzania
## 2 2458                 5150           U.S.A.        2019     Dominican Republic
## 3 2454                 5150           U.S.A.        2019             Madagascar
## 4 2542                 5150           U.S.A.        2021                   Fiji
## 5 2546                 5150           U.S.A.        2021              Venezuela
## 6 2546                 5150           U.S.A.        2021                 Uganda
##   specific_bean_origin_or_bar_name cocoa_percent ingredients
## 1            Kokoa Kamili, batch 1           76%    3- B,S,C
## 2                  Zorzal, batch 1           76%    3- B,S,C
## 3           Bejofo Estate, batch 1           76%    3- B,S,C
## 4            Matasawalevu, batch 1           68%    3- B,S,C
## 5            Sur del Lago, batch 1           72%    3- B,S,C
## 6         Semuliki Forest, batch 1           80%    3- B,S,C
##      most_memorable_characteristics rating
## 1         rich cocoa, fatty, bready   3.25
## 2            cocoa, vegetal, savory   3.50
## 3      cocoa, blackberry, full body   3.75
## 4               chewy, off, rubbery   3.00
## 5 fatty, earthy, moss, nutty,chalky   3.00
## 6 mildly bitter, basic cocoa, fatty   3.25

# use skimr to look at the data structure
skim(chocolate)

Data summary
Name	chocolate
Number of rows	2530
Number of columns	10
_______________________
Column type frequency:
character	7
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
company_manufacturer	1	2	39	0	580
company_location	1	4	21	0	67
country_of_bean_origin	1	4	21	0	62
specific_bean_origin_or_bar_name	1	3	51	0	1605
cocoa_percent	1	3	6	0	46
ingredients	1	0	14	87	22
most_memorable_characteristics	1	3	37	0	2487

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ref	1	1429.80	757.65	5	802	1454.00	2079.0	2712	▆▇▇▇▇
review_date	1	2014.37	3.97	2006	2012	2015.00	2018.0	2021	▃▅▇▆▅
rating	1	3.20	0.45	1	3	3.25	3.5	4	▁▁▅▇▇

The data set I chose is uploaded directly from the Tidy Tuesday website. This data set is a collection of chocolate reviews over the years of Brady Brelinski from the Manhattan Chocolate Society. I didn’t upload the data into the data folder since it is more convienient to read in the data manually for me.

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

I do notice that the cocoa_percent column is character, so I’m going to try to strip it of the “%” sign and convert it to numeric.

I also realize that there are 87 empty values in the ingredients column. Since I’m planning to use the ingredients in this column later, I’m going to filter out observations with empty values. However, this could potentially influence the trends we observe later on.

chocolate$cocoa_percent<-gsub("%","",as.character(chocolate$cocoa_percent))

chocolate$cocoa_percent <- as.numeric(chocolate$cocoa_percent)

#filter out empty observations in the ingredients column
chocolate <- chocolate %>%
  filter(ingredients != "")

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

I want to observe the relation between chocolate ratings and the number of ingredients in a scatter plot. Since the column ingredients include both the number of ingredients and the ingredients, I’m going to create another variable (number_ing) that only include the number of ingredients using mutate and case_when().

# check to see the maximum number of ingredients
unique(chocolate$ingredients) # maximum of 6 ingredients

##  [1] "3- B,S,C"       "4- B,S,C,L"     "2- B,S"         "4- B,S,C,V"    
##  [5] "5- B,S,C,V,L"   "6-B,S,C,V,L,Sa" "5-B,S,C,V,Sa"   "4- B,S,V,L"    
##  [9] "2- B,S*"        "1- B"           "3- B,S*,C"      "3- B,S,L"      
## [13] "3- B,S,V"       "4- B,S*,C,L"    "4- B,S*,C,Sa"   "3- B,S*,Sa"    
## [17] "4- B,S,C,Sa"    "4- B,S*,V,L"    "2- B,C"         "4- B,S*,C,V"   
## [21] "5- B,S,C,L,Sa"

# create variable that only includes the number of ingredients, use grepl to detect the 1st character
chocolate <- chocolate %>%
  mutate(number_ing = case_when(
    grepl("1", ingredients) ~ 1,
    grepl("2", ingredients) ~ 2,
    grepl("3", ingredients) ~ 3,
    grepl("4", ingredients) ~ 4,
    grepl("5", ingredients) ~ 5,
    TRUE ~ 6
  ))

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

skim(chocolate)

Data summary
Name	chocolate
Number of rows	2443
Number of columns	11
_______________________
Column type frequency:
character	6
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
company_manufacturer	1	2	39	542
company_location	1	4	21	67
country_of_bean_origin	1	4	21	62
specific_bean_origin_or_bar_name	1	3	51	1567
ingredients	1	4	14	21
most_memorable_characteristics	1	3	37	2403

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ref	1	1451.52	755.52	5	833	1474.00	2100.0	2712	▆▆▇▇▇
review_date	1	2014.49	3.96	2006	2012	2015.00	2018.0	2021	▃▅▇▆▅
cocoa_percent	1	71.50	5.16	42	70	70.00	74.0	100	▁▁▇▁▁
rating	1	3.21	0.43	1	3	3.25	3.5	4	▁▁▅▇▇
number_ing	1	3.04	0.91	1	2	3.00	4.0	6	▆▇▃▂▁

Are the values what you expected for the variables? Why or Why not?

Yes, the values are what I expected for the variables. I successfully converted the variable cocoa_percent from character to numeric, and the newly created variable number_ing only includes the number of ingredients.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

# summarize the rating by company location
chocolate %>% 
  group_by(company_location) %>%
  summarize(mean_rating = mean(rating),
            min = quantile(rating, probs = 0),
            p25 = quantile(rating, probs = 0.25),
            median = quantile(rating, probs = 0.5),
            p75 = quantile(rating, probs = 0.75),
            max = quantile(rating, probs = 1))

## # A tibble: 67 x 7
##    company_location mean_rating   min   p25 median   p75   max
##    <chr>                  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
##  1 Amsterdam               3.31  3     3.19   3.25  3.5   3.75
##  2 Argentina               3.5   3.5   3.5    3.5   3.5   3.5 
##  3 Australia               3.37  2.5   3.12   3.5   3.75  4   
##  4 Austria                 3.26  2.75  3      3.25  3.5   3.75
##  5 Belgium                 3.18  1     3      3.12  3.5   4   
##  6 Bolivia                 3.25  2.75  3      3.25  3.5   3.75
##  7 Brazil                  3.27  2.5   3      3.25  3.5   4   
##  8 Canada                  3.31  2     3      3.25  3.5   4   
##  9 Chile                   3.75  3.75  3.75   3.75  3.75  3.75
## 10 Colombia                3.22  2     3      3.25  3.5   3.75
## # ... with 57 more rows

# summarize the rating by bean orgin
chocolate %>% 
  group_by(country_of_bean_origin) %>%
  summarize(mean_rating = mean(rating),
            min = quantile(rating, probs = 0),
            p25 = quantile(rating, probs = 0.25),
            median = quantile(rating, probs = 0.5),
            p75 = quantile(rating, probs = 0.75),
            max = quantile(rating, probs = 1))

## # A tibble: 62 x 7
##    country_of_bean_origin mean_rating   min   p25 median   p75   max
##    <chr>                        <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
##  1 Australia                     3.25  2.75  3      3.25  3.5   3.75
##  2 Belize                        3.24  2.5   3      3.25  3.5   4   
##  3 Blend                         3.09  1     2.75   3     3.5   4   
##  4 Bolivia                       3.18  2     3      3.25  3.5   4   
##  5 Brazil                        3.26  1.75  3      3.25  3.5   4   
##  6 Burma                         3     3     3      3     3     3   
##  7 Cameroon                      3.08  3     3      3     3.12  3.25
##  8 China                         3.5   3.5   3.5    3.5   3.5   3.5 
##  9 Colombia                      3.21  2     3      3.25  3.5   4   
## 10 Congo                         3.32  2.75  3.12   3.25  3.5   3.75
## # ... with 52 more rows

# summarize the rating by number of ingredients
chocolate %>% 
  group_by(number_ing) %>%
  summarize(mean_rating = mean(rating),
            min = quantile(rating, probs = 0),
            p25 = quantile(rating, probs = 0.25),
            median = quantile(rating, probs = 0.5),
            p75 = quantile(rating, probs = 0.75),
            max = quantile(rating, probs = 1))

## # A tibble: 6 x 7
##   number_ing mean_rating   min   p25 median   p75   max
##        <dbl>       <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1          1        2.96  2.5   2.81   3     3      3.5
## 2          2        3.22  2     3      3.25  3.5    4  
## 3          3        3.27  1.5   3      3.25  3.5    4  
## 4          4        3.13  1.5   2.75   3     3.5    4  
## 5          5        3.08  1     2.75   3     3.5    4  
## 6          6        2.94  2.75  2.75   2.75  2.94   3.5

What are your findings about the summary? Are they what you expected?

After grouping the data by the company location, the means and ranges of the ratings from these groups are pretty similar to each others. None of the means and median is surprisingly low (1) or high(4). The same also goes with the means and ranges of the ratings of the groups resulted from grouping by the cocoa bean origin. Overall, I don’t see any trend or relation between ratings and company location or bean origin. This is not different from what I expected.

After grouping the data by the number of ingredients, it is surprising to see that the mean ratings of these groups do not differ much from each other. The highest mean and median rating belongs to the group of three ingredients, while the lowest mean and median ratings belong to the group with 6 ingredients, so isn’t any positive or negative trends in rating as the number of ingredients increases. This is different from what I expected since I thought that increasing the number of ingredients can improve the flavor of chocolate and therefore increases the rating.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

# rating vs number of ingredients
p1 <- ggplot(chocolate,
             aes(y = rating,
                 x = number_ing)) +
  geom_point() +
  labs(title = "Rating vs Number of Ingredients",
       x = "Number of Ingredients",
       y = "Rating")

# rating vs review date
p2 <- ggplot(chocolate,
             aes(y = rating,
                 x = review_date,
                 color = number_ing)) +
  geom_point() +
  labs(title = "Rating vs Review Date",
       x = "Review Year",
       y = "Rating")

# rating vs cocoa percentage
p3 <- ggplot(chocolate,
             aes(y = rating,
                 x = cocoa_percent,
                 color = number_ing)) +
  geom_point() +
  labs(title = "Rating vs Cocoa Percentage",
       x = "Cocoa Percentage",
       y = "Rating")

p1

p2

p3

# standard deviations of ratings over time
table1 <- chocolate %>%
  group_by(review_date) %>%
  summarize(standard_deviations = sd(rating))
table1

## # A tibble: 16 x 2
##    review_date standard_deviations
##          <int>               <dbl>
##  1        2006               0.660
##  2        2007               0.598
##  3        2008               0.495
##  4        2009               0.439
##  5        2010               0.432
##  6        2011               0.476
##  7        2012               0.464
##  8        2013               0.432
##  9        2014               0.409
## 10        2015               0.382
## 11        2016               0.417
## 12        2017               0.346
## 13        2018               0.389
## 14        2019               0.377
## 15        2020               0.321
## 16        2021               0.344

# rating vs cocoa percentage
p4 <- ggplot(table1,
             aes(y = standard_deviations,
                 x = review_date)) +
  geom_line() +
  labs(title = "Standard Diviation of Rating over Time",
       x = "Review Year",
       y = "Standard Deviation")
p4

From the plot above, it appears that there aren’t any negative or positive trends between chocolate bar rating (outcome variable) and the number of ingredients, review date or cocoa percentage. However, I do notice that the variation in rating seems to decrease over (standard deviation vs review year plot).

Final Summary (10 points)

Summarize your research question and findings below.

In order to answer my research question of whether any of the mentioned factors (company location, bean origin, number of ingredients, review date, cocoa percentage) has relation to chocolate ratings, I looked at summaries and plots of these factors in relation to chocolate rating.

After grouping the data observations based on company locations, I found that the average and ranges of these ratings are very similar to each other. There isn’t any particular group that has mean and/or median rating out of the ordinary (too high or too low). Grouping the observations by bean origin and the number of of ingredients also resulted in no observable trend between the ratings and these factors. Therefore, I conclude that there aren’t any observable relations between chocolate ratings and company location, bean origin, or number of ingredients.

From the plot of rating vs number of ingredients, I found no observable positive or negative trend between the two variables. However, I notice that there are less observations for the one and six-ingredient groups compared to the other groups. This could explain why these groups’ average ratings are lower than the others’ as shown in the summary.

From the rating vs review date scatter plot, although I found no observable negative or positive trend between these variables, I found that the variation in rating decreases over time. This is confirmed in the standard deviations over time plot. This is interesting since this result could either be due quality of different chocolate brands getting closer to each other over time or the chocolate tester finds becomes more consistent at reviewing chocolate.

From the rating vs cocoa percentage plot, I found no observable positive or negative trend between these two variables.

From the summaries and plots above, I found no observable relations between chocolate rating and the mentioned factors. However, I do found that the variation in chocolate ratings decreases over time.

Keep in that I did remove about 82 observations from the data set since they have empty values for ingredients and this may have influenced these observations. Furthermore, I only looked at summaries and scatter plot to conclude these findings. In order to be more precise about the significance of these factors in relation to chocolate rating, we have to find the underlying model and fit it using these factors to look at these factors’ influence on ratings.

Are your findings what you expected? Why or Why not?

My findings are different from my original expectation. I expected that there is some relationship or trend between cocoa percentage, number ingredients, and bean origin since these are the characteristics that consumer often look for when buying chocolate. Finding these factors to not have any observable relations to chocolate ratings have me question whether there are other factors out there that have influence on chocolate ratings.

Midterm

Ngoc Le

2022-02-13