---
title: "Stat7350-Session4_Tidy-Data"
author: "Aleeza Gerstein"
date: '2019-03-13'
output: pdf_document
geometry: margin=2cm
---

<!-- This file was created by Aleeza Gerstein, adapted from http://girke.bioinformatics.ucr.edu/GEN242/mydoc_Rbasics_12.html, http://byrneslab.net/classes/data_science/lectures/12_data_tidy.html, as well as work by Amelia McNamara that is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio and -->


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

#Learning Objectives

## Part I: Transform Data
* Extract variables with `select()`  
* Extract cases with `filter()`  
* Arrange cases, with `arrange()`  
* Make tables of summaries with `summarise()`  
* Make new variables, with `mutate()`  
* Do groupwise operations with `group_by()`  
* Use `_join()` to join and filter datasets

## Part II: Tidy Data
* Reshape data so that it is tidy

# Pre-analysis workflow: packages & data prep

## Packages
```{r}
suppressMessages(library(tidyverse))
suppressMessages(library(gridExtra))
suppressMessages(library(Hmisc)) #%nin%
suppressMessages(library(skimr))
```

### Load Data for Part I
We're going to use a dataset that is built-in to the tidyverse, the iris dataset.
We'll also load our old survey dataset to use for practice.


```{r}
head(iris)
iris_df <- as_tibble(iris)
sc <- read_csv("data_output/surveys_complete.csv")
```

It has 150 rows. Let's add a new vector with row number.

```{r}
row <- 1:nrow(iris_df)
bind_cols(iris_df, row = row)
```

## Part I: Transform Data

### Keep a subset of columns

Using base R:
```{r}
iris_df[iris_df[, "Sepal.Length"] > 7.5 & iris_df[, "Species"]=="virginica", ]
```

With the tidyverse:
```{r}
filter(iris_df, Sepal.Length > 7.5, Species=="virginica")
```

```{r}
filter(iris_df, Sepal.Length > 7.5 | Sepal.Length < 5.5, Species=="virginica")
```

### Subset by rows
```{r}
iris_df[1:2,]
```

```{r}
slice(iris_df, 1:2)
```

### Select specific columns
```{r}
select(iris_df, Species, Petal.Length, Sepal.Length)
```

```{r}
select(iris_df, Sepal.Length : Petal.Width)
```

```{r}
select(iris_df, starts_with("S"))
```

### Drop specific columns
```{r}
select(iris_df, -(Sepal.Length : Petal.Width))
```

### Ordering with arrange (row-wise ordering based on specific columns)
```{r}
arrange(iris_df, Species, Sepal.Length, Sepal.Width)
```

```{r}
arrange(iris_df, desc(Species), Sepal.Length, Sepal.Width)
```

Base R code equivalent
```{r}
iris_df[order(iris_df$Species, iris_df$Sepal.Length, iris_df$Sepal.Width), ]
```

***

#### CHALLENGE:
With the `sc` dataset, filter all of the individuals from the `genus` is Dipodomys and `hindfoot_length` equal to or less than 35.

***

All of this can also be accomplished using the pipe. e.g.,
```{r}
iris_df %>% 
  select(3:5) %>% 
  filter(Petal.Length < 2) %>% 
  arrange(Petal.Width)
```

***

#### CHALLENGE: Use `%>%` to write a sequence of functions that: 

1. Filter `sc` to keep just the male records from 1980
2. Select the `month`, `day`, 'species' and 'hindfoot_length', columns 
3. Arrange the results so that they are organized by calendar date

***

### Rename columns 
```{r}
rename(iris_df, new_col_name = Species)
```

### Obtain unique rows with distinct

```{r}
distinct(iris_df, Species, .keep_all=TRUE)
```

### Add columns with mutate

```{r}
mutate(iris_df, Ratio = Sepal.Length / Sepal.Width, Sum = Sepal.Length + Sepal.Width)
```

### Transmute
This does the same thing as `mutate` but drops existing columns.

```{r}
transmute(iris_df, Ratio = Sepal.Length / Sepal.Width, Sum = Sepal.Length + Sepal.Width)
```


### Summarize and group by

```{r}
summarise(iris_df, mean(Petal.Length))
```

Summarize on many columns
```{r}
summarise_all(iris_df[,1:4], mean)
```

Summarize by grouping column
```{r}
iris_df %>% 
  group_by(Species) %>% 
  summarise(mean = mean(Sepal.Length))
```

Note you can also use dplyr without the pipe and with a mix of base formatting
```{r}
summarise(group_by(iris_df, Species), mean(Sepal.Length))
```

Summarise all columns
```{r}
iris_df %>% 
  group_by(Species) %>% 
  summarise_all(mean)
```


```{r}
sc %>% 
  filter(weight < 150) %>% 
  summarise(mean = mean(weight), n = n())
```

```{r}
sc %>% 
  group_by(weight < 150) %>% 
  summarise(mean_weight = mean(weight), n = n())
```


***

#### CHALLENGE: 
For `sc` divide the dataset into records with `weight` less than 150 and higher than 150 and then use `summarise` to find out the mean of each group and the number of records in each group

***

### Merging tibbles
There are several join functions for merging tibbles by a common key column similar to the merge function in base R. These `_join()?` functions include:

* `inner_join()`: returns join only for rows matching among both tibbles   
* `full_join()`: returns join for all (matching and non-matching) rows of two tibbles  
* `left_join()`: returns join for all rows in first tibble  
* `right_join()`: returns join for all rows in second tibble  
* `anti_join()`: returns for first tibble only those rows that have no match in the second one  

Toy data:
```{r}
```{r}
band <- tribble(
   ~name,     ~band,
  "Mick",  "Stones",
  "John", "Beatles",
  "Paul", "Beatles"
)

instrument <- tribble(
    ~name,   ~plays,
   "John", "guitar",
   "Paul",   "bass",
  "Keith", "guitar"
)

instrument2 <- tribble(
    ~artist,   ~plays,
   "John", "guitar",
   "Paul",   "bass",
  "Keith", "guitar"
)
```
### Mutating joins

```{r}
band %>% left_join(instrument, by = "name")
```
```

```{r}
band %>% right_join(instrument, by = "name")
```

```{r}
band %>% full_join(instrument, by = "name")
```

```{r}
band %>% inner_join(instrument, by = "name")
```

## Load Data for Part II

### Long versus wide
 Long data formats have one observation and one measurement per row. So, multiple rows constitude a single observation.

Wide data has a every measurement in a single observation in a single row.

#### Long to wide
Data is often recorded in a long format for efficiency. We're going to use a toy mammal dataset to illustrate. For every site a researcher visited, they wrote down the species of mammals they saw, and the density of those mammals. As this was just a running tally, the data ended up in a long format:

```{r}
mammals <- tibble(site = c(1,1,2,3,3,3), 
                      taxon = c('Suncus etruscus', 'Sorex cinereus', 
                                'Myotis nigricans', 'Notiosorex crawfordi', 
                                'Scuncus etruscus', 'Myotis nigricans'),
                      density = c(6.2, 5.2, 11.0, 1.2, 9.4, 9.6)
)
mammals
```

But, what if we wanted to easily compare abundances across sites? We can use `spread` to change the data into a wide format. 

`Spread` requires you tell it the name of the column which contains values that will be the column names in your new data set - a so called “key” column. You also tell it which column contains the relevant numbers - the “values” column.

```{r}
m_wide <- mammals %>%
  spread(taxon, density)

m_wide
```

You'll notice there are a lot of NA values. This is incredibly common in ecological and biological data. Sometimes, they are NA - they weren't recorded Other times, such as in this data set, they actually mean 0 observations. To add this information we use the `fill` argument in `spread`.

```{r}
m_wide_0 <- mammals %>%
  spread(taxon, density, fill=0)

m_wide_0
```

Note that we could also fill those in right into the long data format using the `complete` function. In that function, we specify which columns we want all combinations of, and then supply a list of how new values should be filled in for other columns. If we don’t give a column name in that list, it defaults to NA. 

```{r}
m_long_0 <- mammals %>%
  complete(site, taxon, fill=list(density=0))

m_long_0
```

#### Wide to long

In other cases the reverse is true - as some people recorde their data in a wide format. To go from wide to long we use the `gather` function, which "gathers up your wide data".
It's a little bit , as you specify what you want the name of the new key column to be, what you want the name of the new values column to be, and then you can either specify which columns are to be gathered up (which can be tedious if there are a lot or they are spread) or you can specify which columns you want to exclude. I actually do things by exclusion quite often (as in `%nin%`).

```{r}
m_long <- m_wide_0 %>%
  gather(Species_name, Density, -site)

m_long
```

```{r}
m_long <- m_wide_0 %>%
  gather(Species_name, Density, `Myotis nigricans`:`Suncus etruscus`)
```

### What if we have more than one value column

This gets tricky. We have to  `unite` those columns into something that’s later easy to separate. Here's an example. Let's create a new variable for height first.

```{r}
mamh <- mammals %>%
  mutate(height=rnorm(6,30,3))
```

Now we want to do everything above, but with density and height. First, we `unite` them, using a _ as our separator. Note that we We could have used anything, but _ is often used because it's easy to see andd used for so few other things. 

```{r}
mamh2 <- mamh %>%
  unite(measurement, density, height, sep="_")

mamh2
```

So now we keep going. Let's first fill iln all the empty species-site combos with zeroes. Here’s one workflow to do that.

```{r}
mamh_long <- mamh2 %>%
  spread(taxon, measurement, fill="0_0") %>%
  gather(taxon, measurement, -site)

mamh_long
```

Now we want to restore our old measurements. For that, we have `separate`, which takes the relevant column, the new column names in a vector, and the pattern you match to split them.

```{r}
mamh_long <- mamh_long %>%
  separate(measurement, into = c("density", "height"), sep="_") %>%
  mutate(density = as.numeric(density),
         height = as.numeric(height))

mamh_long
```

Note that because he new column was treated as a charcter(we just did a string split) we have to coerce the columns back into numeric.

***

#### CHALLENGE

Separate the taxon into two columns - Genus and species

***