Sentiment Analysis of a CHILDES Corpus

class: center, middle, inverse, title-slide

# Sentiment Analysis of a CHILDES Corpus
## CS 631 Final Project, part 2
### Grace Lawley
### August 14th, 2018

---

# The Dataset
--

* Child Language Data Exchange System (CHILDES)

+ Online repository of language acquisition data
  
--
  
    + Used to study language development, second language acquisition, child directed speech
  
--
  
* Why is CHILDES special?
  
--

+ Corpora of children speaking North American English are **very** hard to come by
  
---
# The Dataset

* CHILDES → Eng-NA → **Kuczaj Corpus**

+ Longitudional Case Study
    
    + 1 target child: **Abe**
    
    + ~2 - ~5 years old
    
    + 210 transcripts (average of 810 words long)
  
---

# The Raw Data

* Pulled raw utterances data down from the CHILDES database with the `childesr` package

* Some raw utterances:

```
## [1] "okay that's a alligator he got a cigar"
## [2] "go away"                               
## [3] "camel pig and the donkey"              
## [4] "you go away"                           
## [5] "uhhuh eat"                             
## [6] "oh no"
```

- **Cleaned**, **processed**, & **tokenized** the data

---
# The Sentiment Analysis

* Used the `nrc` Word-Emotion Association Lexicon in the `tidytext` package

+ Classifies words into 10 different sentiment categories:

- *anger*  
        - *disgust*    
        - *fear*  
        - *joy*  
        - *negative*  
        - *sadness*  
        - *anticipation*  
        - *surprise*  
        - *trust*  
        - *positive*

* Merged with tokens with `dplyr::inner_join()`

+ Only kept tokens that occured in both dataframes

---

# Six sentiments

.pull-left[
## Positive

+ *trust*

+ *joy*

+ *anticipation*  
]

.pull-right[
## Negative

+ *sadness*

+ *fear*

+ *anger*
]

---
# The Original Plot
<img src="figures/orig_plot-1.png" width="80%" style="display: block; margin: auto;" />

---
# Problems

* Visualization is difficult to explain

* ~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon

* Transcript length varies a lot

* Distribution of transcripts across the ages varies a lot

---
# Problems

* Visualization is difficult to explain

* ~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon

* <span style="color:blue">**Transcript length varies a lot**</span>

* Distribution of transcripts across the ages varies a lot

---
# Normalization

* Binned age into months:

+ `30.13204`, `30.19775`, `30.32916`,...`30.59200` → `30`

* For each age bin and each sentiment:

* `n_percent = n_sentiment/n_tokens`

--
<img src="figures/ages_binned-1.png" width="80%" style="display: block; margin: auto;" />

---
class: inverse, center, middle
# Iterate!

---
# Version 1
<img src="figures/plot1-1.png" width="80%" style="display: block; margin: auto;" />
  
---
# Version 2
<img src="figures/plot2-1.png" width="80%" style="display: block; margin: auto;" />

---
# Version 3
<img src="figures/plot3-1.png" width="80%" style="display: block; margin: auto;" />

---
# Version 4
<img src="figures/plot4-1.png" width="80%" style="display: block; margin: auto;" />

---
# Version 5
<img src="figures/plot5-1.png" width="80%" style="display: block; margin: auto;" />

---
# Version 5.1
<img src="figures/plot5_1-1.png" width="80%" style="display: block; margin: auto;" />

---
# The Final Version

---
class: inverse, center, middle

# Thank you!

*Github Repository:*   
[gracelawley/kuczaj-corpus](https://github.com/gracelawley/kuczaj-corpus)  
  
  
  
*Write up & code available at:*    
[grace.rbind.io/project/kuczaj_pt2/](/project/kuczaj_pt2/)
  
  
  
*Slides made with the R package* [xaringan](https://github.com/yihui/xaringan)  
These slides - [rendered](/slides/ruser_dataviz) & [raw](https://raw.githubusercontent.com/gracelawley/gracelawley/master/static/slides/ruser_dataviz.Rmd)    
  
*Based on my CS 631 Final Visualization Project*  
*Write up & code available at:*      
[grace.rbind.io/project/final_vis/](/project/final_vis/)