class: center, middle, inverse, title-slide # Sentiment Analysis of a CHILDES Corpus ## CS 631 Final Project, part 2 ### Grace Lawley ### August 14th, 2018 --- # The Dataset -- * Child Language Data Exchange System (CHILDES) -- + Online repository of language acquisition data -- + Used to study language development, second language acquisition, child directed speech -- * Why is CHILDES special? -- + Corpora of children speaking North American English are **very** hard to come by --- # The Dataset * CHILDES → Eng-NA → **Kuczaj Corpus** -- + Longitudional Case Study + 1 target child: **Abe** + ~2 - ~5 years old + 210 transcripts (average of 810 words long) --- # The Raw Data * Pulled raw utterances data down from the CHILDES database with the `childesr` package -- * Some raw utterances: -- ``` ## [1] "okay that's a alligator he got a cigar" ## [2] "go away" ## [3] "camel pig and the donkey" ## [4] "you go away" ## [5] "uhhuh eat" ## [6] "oh no" ``` -- - **Cleaned**, **processed**, & **tokenized** the data --- # The Sentiment Analysis -- * Used the `nrc` Word-Emotion Association Lexicon in the `tidytext` package -- + Classifies words into 10 different sentiment categories: - *anger* - *disgust* - *fear* - *joy* - *negative* - *sadness* - *anticipation* - *surprise* - *trust* - *positive* -- * Merged with tokens with `dplyr::inner_join()` -- + Only kept tokens that occured in both dataframes --- # Six sentiments -- .pull-left[ ## Positive + *trust* + *joy* + *anticipation* ] -- .pull-right[ ## Negative + *sadness* + *fear* + *anger* ] --- # The Original Plot <img src="figures/orig_plot-1.png" width="80%" style="display: block; margin: auto;" /> --- # Problems -- * Visualization is difficult to explain -- * ~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon -- * Transcript length varies a lot -- * Distribution of transcripts across the ages varies a lot --- # Problems * Visualization is difficult to explain * ~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon * <span style="color:blue">**Transcript length varies a lot**</span> * Distribution of transcripts across the ages varies a lot --- # Normalization * Binned age into months: + `30.13204`, `30.19775`, `30.32916`,...`30.59200` → `30` -- * For each age bin and each sentiment: -- * `n_percent = n_sentiment/n_tokens` -- <img src="figures/ages_binned-1.png" width="80%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Iterate! --- # Version 1 <img src="figures/plot1-1.png" width="80%" style="display: block; margin: auto;" /> --- # Version 2 <img src="figures/plot2-1.png" width="80%" style="display: block; margin: auto;" /> --- # Version 3 <img src="figures/plot3-1.png" width="80%" style="display: block; margin: auto;" /> --- # Version 4 <img src="figures/plot4-1.png" width="80%" style="display: block; margin: auto;" /> --- # Version 5 <img src="figures/plot5-1.png" width="80%" style="display: block; margin: auto;" /> --- # Version 5.1 <img src="figures/plot5_1-1.png" width="80%" style="display: block; margin: auto;" /> --- # The Final Version -- <img src="figures/plot5_2-1.png" width="80%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Thank you! *Github Repository:* [gracelawley/kuczaj-corpus](https://github.com/gracelawley/kuczaj-corpus) *Write up & code available at:* [grace.rbind.io/project/kuczaj_pt2/](/project/kuczaj_pt2/) *Slides made with the R package* [xaringan](https://github.com/yihui/xaringan) These slides - [rendered](/slides/ruser_dataviz) & [raw](https://raw.githubusercontent.com/gracelawley/gracelawley/master/static/slides/ruser_dataviz.Rmd) *Based on my CS 631 Final Visualization Project* *Write up & code available at:* [grace.rbind.io/project/final_vis/](/project/final_vis/)