BSTA 504 Midterm

Define Your Research Question (10 points)

Loading the Data (10 points)

Transforming the data (15 points)

Visualizing and Summarizing the Data (15 points)

Final Summary (10 points)

Define Your Research Question (10 points)

Loading the Data (10 points)

Transforming the data (15 points)

Visualizing and Summarizing the Data (15 points)

Final Summary (10 points)

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I am interested in policy surrounding obesity management and the data we collect to understand it. I want to know from this data if being older means you are more likely to be obese, and will do less vigorous exercise. This then may cause a higher BP. My dad has issues with BP and weight gain, and I see his health has gone down hill as his weight has gone up. I’m wondering if this also happens in younger people as well, or if their BP stays low despite higher BMI’s.

Given your question, what is your expectation about the data?

Most likely, older people will account for the higher percentage of obesity, and they will do less exercise. According to the e-magazine Harvard Health, systolic BP can be a good measure of overall health. This will be used in conjunction with BMI to ask, is health declining as people age due to weight gain?

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

library(aplore3)
summary(nhanes)

##        id          gender          age                   marstat    
##  Min.   :   1   Male  :3164   Min.   :16.00   Married        :3019  
##  1st Qu.:1621   Female:3318   1st Qu.:30.00   Widowed        : 505  
##  Median :3242                 Median :46.00   Divorced       : 643  
##  Mean   :3242                 Mean   :46.34   Separated      : 195  
##  3rd Qu.:4862                 3rd Qu.:62.00   Never Married  :1034  
##  Max.   :6482                 Max.   :80.00   Living Together: 457  
##                                               NA's           : 629  
##     samplewt           psu           strata           tchol      
##  Min.   :  4084   Min.   :1.00   Min.   : 1.000   Min.   : 90.0  
##  1st Qu.: 16460   1st Qu.:1.00   1st Qu.: 4.000   1st Qu.:162.0  
##  Median : 24217   Median :2.00   Median : 7.000   Median :188.0  
##  Mean   : 34546   Mean   :1.51   Mean   : 7.225   Mean   :192.1  
##  3rd Qu.: 50435   3rd Qu.:2.00   3rd Qu.:11.000   3rd Qu.:218.0  
##  Max.   :153810   Max.   :2.00   Max.   :15.000   Max.   :383.0  
##                                                   NA's   :395    
##       hdl             sysbp          dbp               wt        
##  Min.   : 11.00   Min.   : 90   Min.   : 40.00   Min.   : 33.20  
##  1st Qu.: 41.00   1st Qu.:110   1st Qu.: 62.00   1st Qu.: 65.90  
##  Median : 50.00   Median :120   Median : 70.00   Median : 77.60  
##  Mean   : 52.41   Mean   :123   Mean   : 69.59   Mean   : 80.38  
##  3rd Qu.: 61.00   3rd Qu.:132   3rd Qu.: 78.00   3rd Qu.: 91.80  
##  Max.   :144.00   Max.   :220   Max.   :134.00   Max.   :159.10  
##  NA's   :395      NA's   :553   NA's   :594      NA's   :37      
##        ht             bmi         vigwrk      modwrk      wlkbik    
##  Min.   :123.3   Min.   :13.18   Yes :1131   Yes :2190   Yes :1807  
##  1st Qu.:160.0   1st Qu.:23.99   No  :5350   No  :4291   No  :4674  
##  Median :166.9   Median :27.68   NA's:   1   NA's:   1   NA's:   1  
##  Mean   :167.4   Mean   :28.62                                      
##  3rd Qu.:175.0   3rd Qu.:32.15                                      
##  Max.   :202.7   Max.   :65.19                                      
##  NA's   :37      NA's   :37                                         
##  vigrecexr   modrecexr       sedmin       obese     
##  Yes :1402   Yes :2493   Min.   :  0.0   No  :5459  
##  No  :5079   No  :3987   1st Qu.:180.0   Yes : 986  
##  NA's:   1   NA's:   2   Median :300.0   NA's:  37  
##                          Mean   :321.1              
##                          3rd Qu.:480.0              
##                          Max.   :840.0              
##                          NA's   :79

write.csv(x = nhanes, file = "nhanes_new.csv", row.names = FALSE)
nhanes_new <- read_csv("~/Desktop/BSTA 504/nhanes_new.csv")

## Rows: 6482 Columns: 21

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): gender, marstat, vigwrk, modwrk, wlkbik, vigrecexr, modrecexr, obese
## dbl (13): id, age, samplewt, psu, strata, tchol, hdl, sysbp, dbp, wt, ht, bm...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nhanes_new<-na.omit(nhanes_new)
skimr::skim(nhanes_new)

Data summary

Name	nhanes_new
Number of rows	4915
Number of columns	21
_______________________
Column type frequency:
character	8
numeric	13
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
gender	1	4	6	2
marstat	1	7	15	6
vigwrk	1	2	3	2
modwrk	1	2	3	2
wlkbik	1	2	3	2
vigrecexr	1	2	3	2
modrecexr	1	2	3	2
obese	1	2	3	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	3252.09	1869.78	1.00	1641.50	3262.00	4871.50	6482.00	▇▇▇▇▇
age	1	49.27	17.73	20.00	34.00	48.00	63.00	80.00	▇▇▇▆▆
samplewt	1	36257.60	26599.99	4084.48	17990.12	25259.46	55242.77	153810.26	▇▂▂▁▁
psu	1	1.52	0.50	1.00	1.00	2.00	2.00	2.00	▇▁▁▁▇
strata	1	7.10	4.13	1.00	3.00	7.00	11.00	15.00	▇▇▆▆▅
tchol	1	195.74	40.94	92.00	166.00	192.00	221.00	383.00	▂▇▅▁▁
hdl	1	52.75	16.34	15.00	41.00	50.00	62.00	144.00	▃▇▂▁▁
sysbp	1	123.98	18.45	90.00	112.00	122.00	134.00	220.00	▇▇▂▁▁
dbp	1	70.44	11.80	40.00	62.00	70.00	78.00	120.00	▂▇▆▁▁
wt	1	81.19	19.83	33.20	67.00	78.50	92.60	159.10	▂▇▅▁▁
ht	1	167.46	10.23	135.40	160.10	167.20	175.00	202.70	▁▅▇▃▁
bmi	1	28.88	6.30	13.18	24.38	28.04	32.34	64.97	▂▇▂▁▁
sedmin	1	311.31	184.24	0.00	180.00	300.00	420.00	840.00	▅▇▆▂▁

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

I took out all the NA rows, since there are so many data points and so few of the are NA.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Age

I turned age into a categorical variable. Based on the hitogram below, it looks like good bins would be 10 year increments, and starting at 30 and ending at 70. It was noted that this variable is fairly normally distirbuted, but does look a bit strange since there are pretty equal amounts of subjects in each age group.

histogram(nhanes_new$age)

nhanes_new<-
nhanes_new %>%
  mutate(
    age_category = case_when(
      age <30 ~ "<30",
      (age >= 30 & age <40) ~ "[30,40)",
      (age >= 40 & age<50) ~ "[40,50)",
      (age >= 50 & age<60) ~ "[50,60)",
       (age >= 60 & age<70) ~ "[60,70)",
      age >= 70 ~ ">=70"
    )
  )


nha <-  nhanes_new %>% 
  tabyl(age_category)
nha

##  age_category   n   percent
##       [30,40) 813 0.1654120
##       [40,50) 898 0.1827060
##       [50,60) 761 0.1548321
##       [60,70) 767 0.1560529
##           <30 843 0.1715158
##          >=70 833 0.1694812

nhanes_new$age_category <- factor(nhanes_new$age_category)

levels(nhanes_new$age_category)

## [1] "[30,40)" "[40,50)" "[50,60)" "[60,70)" "<30"     ">=70"

nhanes_new$age_category <- factor(nhanes_new$age_category, 
                        levels = c("<30", "[30,40)", "[40,50)","[50,60)","[60,70)",">=70" ))

table(nhanes_new$age_category)

## 
##     <30 [30,40) [40,50) [50,60) [60,70)    >=70 
##     843     813     898     761     767     833

BMI

According to the CDC, the following BMI categories are good measurement categories:

Underweight: Under 18.5
Healthy: 18.5 - 25
Overweight: 25 - 30
Obese: Over 30

These categories were used here as well. This was done in order to group subjects into BMI bins, so that I could look at other variables at the same time. It was noted that this variable is normally distributed.

histogram(nhanes_new$bmi)

nhanes_new<-
nhanes_new %>%
  mutate(
    bmi_category = case_when(
       bmi <18.5 ~ "Underweight",
      ( bmi>= 18.5 &  bmi<25) ~ "Healthy",
      ( bmi >= 25 &  bmi<30) ~ "Overweight",
       bmi >= 30 ~ "Obese"
    )
  )


nhab <-  nhanes_new %>% 
  tabyl(bmi_category)
nhab

##  bmi_category    n    percent
##       Healthy 1327 0.26998983
##         Obese 1818 0.36988810
##    Overweight 1697 0.34526958
##   Underweight   73 0.01485249

nhanes_new$bmi_category <- factor(nhanes_new$bmi_category)

levels(nhanes_new$bmi_category)

## [1] "Healthy"     "Obese"       "Overweight"  "Underweight"

nhanes_new$bmi_category <- factor(nhanes_new$bmi_category, 
                        levels = c("Underweight", "Healthy", "Overweight","Obese" ))

table(nhanes_new$bmi_category)

## 
## Underweight     Healthy  Overweight       Obese 
##          73        1327        1697        1818

Are the values what you expected for the variables? Why or Why not?

I was surprised to see so many people in the obese category. About 80% of the subjects are either overweight or obese. I didn’t expect this, because I thought BMI would follow a normal distribution in which a healthy BMI was in the center.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

by_marstat <- nhanes_new %>% group_by(marstat)
by_marstat

## # A tibble: 4,915 × 23
## # Groups:   marstat [6]
##       id gender   age marstat      samplewt   psu strata tchol   hdl sysbp   dbp
##    <dbl> <chr>  <dbl> <chr>           <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1 Male      34 Married        80101.     1      9   135    50   114    88
##  2     3 Female    60 Widowed        20090.     2      1   202    45   154    70
##  3     4 Male      26 Married        22538.     1     14   160    45   102    50
##  4     5 Female    49 Living Toge…   74212.     2     11   259    45   118    82
##  5     6 Male      80 Married        11998.     1      3   182    75   142    62
##  6     7 Male      80 Widowed        21807.     1      5   148    49   126    62
##  7    11 Female    45 Married        86636.     2      4   225    82   106    62
##  8    12 Male      28 Never Marri…   23137.     1      2   299    51   108    62
##  9    14 Male      44 Divorced       90640.     2      5   197    37   118    90
## 10    15 Male      66 Married        56053.     2     12   193    26   124    64
## # … with 4,905 more rows, and 12 more variables: wt <dbl>, ht <dbl>, bmi <dbl>,
## #   vigwrk <chr>, modwrk <chr>, wlkbik <chr>, vigrecexr <chr>, modrecexr <chr>,
## #   sedmin <dbl>, obese <chr>, age_category <fct>, bmi_category <fct>

by_tab <- nhanes_new %>% group_by(bmi_category,vigwrk)
summarize(by_tab,avg= mean(bmi,na.rm=TRUE))

## `summarise()` has grouped output by 'bmi_category'. You can override using the `.groups` argument.

## # A tibble: 8 × 3
## # Groups:   bmi_category [4]
##   bmi_category vigwrk   avg
##   <fct>        <chr>  <dbl>
## 1 Underweight  No      17.4
## 2 Underweight  Yes     17.6
## 3 Healthy      No      22.4
## 4 Healthy      Yes     22.5
## 5 Overweight   No      27.4
## 6 Overweight   Yes     27.5
## 7 Obese        No      35.4
## 8 Obese        Yes     35.3

What are your findings about the summary? Are they what you expected?

These are what I expected yes. It looks like on average, the age of underweight people is under 18, healthy is about 23 years old and subjects overweight are over 27. In terms of age, the vigorous work routine doesn’t seem to change much however between the groups within each bmi category. I thought the age variation would be more, thinking perhaps younger people would have more vigorous activity.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

This first plot shows the difference between vigorous activity and no vigorous activity in subjects grouped by obesity scoring. Age is on the y axis. Observations that standout here are that across the age groups, more people are likely to be not engaging in vigorous activity than engaging in it. Less people who are older than 50 are engaging in vigorous exercise. Similar proportions of people engaging in vigorous exercise are obese and underweight. It’s possible exercise is not a good marker for whether or not someone is obese.

p<-ggplot(nhanes_new) +
  
  aes(x = bmi_category, 
      y = age, 
      fill = vigrecexr) +
  
  geom_boxplot()+

  labs(title = "Boxplot of Age and Obesity",
       x = "Obesity",
       y = "Age (Years)") 

p +
  theme_minimal() +
  scale_fill_viridis_d()

This plot shows that BMI and systolic BP are positively correlated, as expected. The positive trend is more striking with people in younger age groups. In subjects 60 and over, this trend flattens out somewhat, and overall BP is higher. Because of different health conditions, it is possible that systolic BP increasing as we age is something to be expected.

ggplot(nhanes_new, aes(x = sysbp,
                           y = bmi,
                           color = age_category)) + geom_point(size = 0.5) + labs(title="Plot of BMI vs Systolic BP",
        x ="Systolic BP", y = "BMI")+
    facet_wrap(~ age_category)

In the last plot, again we can see that vigorous activity is difficult to note trends in. It appears that activity is fairly equally distributed in all groups below, regardless of age, BMI, or BP. We can perhaps see that the last graph, of people over 70, has less subject’s attesting to vigorous activity that perhaps the other age groups. This is perhaps giving us a sense that it is less important what people do, and more important what they eat, in terms of who become obese.

ggplot(nhanes_new, aes(x = sysbp,
                           y = bmi,
                           color = vigwrk)) + geom_point(size = 0.5) + labs(title="Plot of BMI vs Systolic BP",
        x ="Systolic BP", y = "BMI")+
    facet_wrap(~ age_category)

Final Summary (10 points)

Summarize your research question and findings below.

It looks like as predicted, people get higher blood pressures as their BMI goes up, and both go up with age. This is what was predicted. Also, it looks like by eye people in all age groups do varying amount of “vigorous” activity. However, subjects who reported No seem to be more ubiquitous and seem to be the outliers in the graphs. Overall, it looks like health goes down as people age and gain more weight, and do less activity. According to Harvard Health, an e-magazine, blood pressure is a good measure of overall health. With this in mind, it looks like overall health declines as people gain weight, and people gain weight as they age.

Are your findings what you expected? Why or Why not?

It is not surprising that people lose their health as they age, caused in major part by gaining weight. I was surprised that young people didn’t have a higher rate of obesity, but this data is older. Perhaps newer data would shine more of a light on younger people and their health issues. It was very surprising that vigorous exercise didn’t seem to play a part in people having lower weights. But perhaps this is just because the variable I chose is “vigorous” exercise. Probably, this is going to be a smaller group of people than people who do less exercise, but perhaps excersise every day. It would be interesting to look at levels of exercise, and see if people that engage in less vigorous exercise actually have healthier weights than those who do vigorous exercise.