Math Meets Data

# Math Meets Data
## Transitioning from math proofs to natural language research
### Grace Lawley
### April 4th, 2019

---

### About me

Graduated from Lewis & Clark College in 2017

* Majored in Math

Computer Science & Engineering PhD student at OHSU
* September 2017 - now

* Halfway through my 2nd year!

+ Computational Linguistics

+ Natural Language Processing

+ Speech and Language Disorders
]
.pull-right[
<br>
+ Discrete Math, Statistics

+ Data Science, Data Visualization

+ R, Python
]
---
class: middle

### About me

Graduated from Lewis & Clark College in 2017

* Majored in Math

Computer Science & Engineering PhD student at OHSU
* September 2017 - now

* Halfway through my 2nd year!

+ .look[Computational Linguistics?]

+ .look[Natural Language Processing?]

+ Speech and Language Disorders
]
.pull-right[
<br>
+ Discrete Math, Statistics

+ .look[Data Science?], Data Visualization

+ R, Python
]

---
class: middle

* .emphasize[Natural Language]

> "Any human language that has evolved naturally in a community, usually in contrast to computer programming languages or to artificially constructed languages such as Esperanto." <br> .sidebar[Wiktionary]

* .emphasize[Natural Language Processing]

> "...a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular .emphasize[how to program computers to process and analyze large amounts of natural language data]." <br> .sidebar[Wikipedia]

---
class: middle

> "...the scientific study of language from a computational perspective. Computational linguists are interested in .emphasize[providing computational models of various kinds of linguistic phenomena]. These models may be "knowledge-based" ("hand-crafted") or "data-driven" ("statistical" or "empirical")." <br> .sidebar[Association for Computational Linguistics]

---
class: middle

> "...the scientific study of language from a computational perspective. Computational linguists are interested in providing computational models of various kinds of linguistic phenomena. These models may be .emphasize["knowledge-based"] ("hand-crafted") or .emphasize["data-driven"] ("statistical" or "empirical")." <br> .sidebar[Association for Computational Linguistics]

---
class: middle

> "..a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data." <br> .sidebar[Wikipedia]

---
class: middle, center, inverse

## How did I get here?

![](images/ilana1.gif)

---

__2009 - 2013__

* High school in Oakland, CA

* Know I like math & language...can I combine them?

* At the end of my Senior year, I found out about .fancy[computational linguistics] (thanks wikipedia!)

__2013 - 2014__
  + New York University
  
  + Took my first linguistics class

__2014 - 2015 __

+ Transferred to Lewis & Clark College
  
  + Started as a computer science major...

---

__2014 - 2015 __
  + ...took Calculus II, and then switched to math
  
--
  
  + Translation Theory & Practice
  
  + Calculus III, Linear Algebra, Discrete Math
  
--

__2015 - 2016__
  
  + Independent study: "Introduction to Computational Linguistics"
  
  + Math Colloquium Talk: "Atypical Language in Autism: Can we measure it?"
  
  + Met Jan van Santen - director of Center for Spoken Language Understanding (CSLU) & my future advisor
  
  + Summer internship at CSLU

__2016 - 2017__
  
  + Number Theory, Prob/Stats I & II

+ CSLU Internship #2
  
  + Applied for the PhD program at CSLU

---
class: middle

### What I do

Center for Spoken Language Understanding, OHSU

+ Automatic speech recognition, image processing, augmentative and alternative communication devices, ...
  
  + Computer Science & Electrical Engineering Master's & PhD programs
  
--

Funded by NIH grant

+ _Automated Measurement of Language Outcomes for Neurodevelopmental Disorders, R01DC012033_
  
  + Autism Spectrum Disorder (ASD), Fragile X Syndrome (FXS), Down Syndrome (DS)

---
class: middle

### I'm also...

Finishing up coursework requirements, here's what I've taken so far:

+ Probability & Statistical Inference for Scientists and Engineers

+ Data Science Programming

+ Introduction to Linguistics & Communication Disorders

+ Principles & Practices of Data Visualization
]

+ Algorithms

+ Natural Language Processing

+ Research Ethics in Computer Science

+ Machine Learning
]

<br>
<br>
.sidebar[If you want to talk about undergrad vs. grad school ask me during the Q&A, or come to the GemSTEM panel today at 6:30 pm!]
---
class: center, middle, inverse

## Research 🔎

---
class: middle

### My Current Research

Characteristics of Autism Spectrum Disorder (ASD)
    
  + Restricted, repetitive interests
  
  + Difficulties with social communication

+ Overly formal, adult-like speech
  
  + Using conventional words and phrases in unusual and peculiar ways
  
  + Jill Dolata's example:
    
    + "I ate shrimp for lunch"
    
    + "I ate crustaceans for lunch"
  
---
class: middle

### How is this currently measured?

#### Autism Diagnostic Observation Schedule (ADOS)
  
  + Standard ASD assessment tool
  
  + Series of semi-structured, examiner led activities
  
  + Coding scheme for behaviors characteristic of ASD

--
  
  + Pedantic speech:
  
  > Use of words or phrases tends to be more repetitive or .look[formal] than that of most individuals at the same level of expressive language, but not obviously odd...

#### Limitations

* Subjective

* Inconsistent across examiners

---
class: middle

### Alternatives?

* Could try to develop an automated method to quantify this
    * ...but capturing something so subtle will be difficult

* How can you teach a computer to differentiate "I ate shrimp from lunch" from "I ate crustaceans for lunch?"

* Lots of different ways to be pedantic
  + Vocabulary choice
  + Style of speaking
  + Tone

* Usually want to smooth outliers, but in this case the outliers are the points we are interested in

.emphasize[Q: Can pedantic speech in children with ASD be measured and described by using natural language processing methods?]

---
class: inverse, center, middle

## Language `\(\rightarrow\)` Computer

![](images/life-size.gif)

---
class: middle

### Language is complicated...

Excerpts from a corpus of "confusing or misleading headlines" <sup>1<sup>

```
## [1] "2 SISTERS REUNITED AFTER 18 YEARS AT CHECKOUT COUNTER"
```

```
## [1] "ENRAGED COW INJURES FARMER WITH AX"
```

```
## [1] "EYE DROPS OFF SHELF"
```

```
## [1] "HOSPITALS ARE SUED BY 7 FOOT DOCTORS"
```

```
## [1] "INCLUDE YOUR CHILDREN WHEN BAKING COOKIES"
```

```
## [1] "KIDS MAKE NUTRITIOUS SNACKS"
```

```
## [1] "MAN EATING PIRANHA MISTAKENLY SOLD AS PET FISH"
```
.footnote[[1] corpus: https://github.com/dariusk/corpora]
--
<br>
"HOSPITALS ARE SUED BY 7 FOOT DOCTORS"     
* the hospital is sued by doctors who are 7 feet tall?
* the hospital is sued by seven podiatrists?  
<br>
<br>
<br>

---
class: middle

.emphasize[How do we teach a computer to understand language?]
* Need to represent language in a form that we can analyze computationally/mathematically/statistically

* Simplify language and represent it numerically

+ Written text is made up of .emphasize[sentences]

+ Speech is made up of .emphasize[utterances]

---
class: middle

* "A token is .emphasize[a meaningful unit of text], such as a word, that we are interested in using for analysis" - Text Mining with R

* "A rose is a rose" `\(\rightarrow\)` {a, rose, is, a, rose}

* "aren't" `\(\rightarrow\)` {are, n't}, {are, not}, {are, nt}, or {aren't} ?

* "A rose is a rose" `\(\rightarrow\)` {a, rose, is}

---
class: middle

* .emphasize[unigrams]: "A rose is a rose" `\(\rightarrow\)` {a, rose, is, a, rose}

* .emphasize[bigrams]: "A rose is a rose" `\(\rightarrow\)` {(a, rose), (rose is), (is, a), (a, rose)}

* .emphasize[trigrams]: "A rose is a rose" `\(\rightarrow\)` {(a, rose, is), (rose, is, a),...,(is, a, rose)}

* The smallest grammatical unit

* "unbreakable" `\(\rightarrow\)` un - break - able

* "dogs" `\(\rightarrow\)` dog - s

* "buyer" `\(\rightarrow\)` buy - er
---
class: middle, inverse, center

## What can we use these for?

---
class: middle

### Q: How _much_ do they talk?

One way to measure this `\(\rightarrow\)` .fancy[Mean Length of Utterance (MLU)]

* MLU = average number of morphemes per utterance for a sample of 100 utterances

* As a typically developing child gets older, they will produce longer and more complex utterances

* Used as a measure of expressive language, language productivity, and language development

+ _Expressive language_ is language used to communicate, while _receptive language_ is the ability to understand language
  
  + All else being equal, a higher MLU reflects a higher level of language proficiency

---

### MLU in Language Development

__Brown's Stages of Syntactic and Morphological Development__ <sup> 1,2 <sup>

|Stage|Age (months)|MLU|Example Utterances|
|:---|:---|:---|:---|
|I|12 - 26|1.0 - 2.0| "that car" <br> "more juice"|
|II|27 - 30|2.0 - 2.5| "it going" <br > "in box" <br> "my cars"|
|III|31 - 34|2.5 - 3.0| "man's book" <br> "was it Alison?"|
|IV|35 - 40|3.0 - 3.75| "the puppy chews it" <br> "a ball on the book" |
|V|47 + |4.5 + | "were you hungry?" <br> "we're hiding" |

.footnote[[1] Brown, R. (1973). _A First Language: The Early Stages._ London: George Allen & Unwin. <br>
[2] https://www.speech-language-therapy.com]
---

### Q: How _repetitive_ are they being?

One way to measure this `\(\rightarrow\)` .fancy[Type-Token Ratio (TTR)]

* Can talk a lot while not really saying much

+ e.g. "row row row your boat..."

* `\(\text{ttr} = \frac{\text{# unique words}}{\text{total # words}}\)`

* Measures degree of lexical variation

+ A higher TTR reflects a more diverse vocabulary

+ If someone says a lot but is very repetitive they will have a lower TTR

---

.pull-left[
Baa Baa Black Sheep
.smaller[
Baa, baa, black sheep,  
Have you any wool?  
Yes sir, yes sir,  
Three bags full.

One for the master,  
One for the dame,  
And one for the little boy  
Who lives down the lane.

Baa, baa, black sheep,  
Have you any wool?  
Yes sir, yes sir,  
Three bags full.

One to mend the jerseys  
one to mend the socks  
and one to mend the holes in  
the little girls' frocks.

Baa, baa, black sheep,  
Have you any wool?  
Yes sir, yes sir,  
Three bags full.  
]
]

.pull-right[
Shakespeare's Sonnet XVIII
.smaller[
Shall I compare thee to a summer’s day?  
Thou art more lovely and more temperate:  
Rough winds do shake the darling buds of May,  
And summer’s lease hath all too short a date;  
Sometime too hot the eye of heaven shines,  
And often is his gold complexion dimm'd;  
And every fair from fair sometime declines,  
By chance or nature’s changing course untrimm'd;  
But thy eternal summer shall not fade,  
Nor lose possession of that fair thou ow’st;  
Nor shall death brag thou wander’st in his shade,  
When in eternal lines to time thou grow’st:  
&nbsp;&nbsp;So long as men can breathe or eyes can see,  
&nbsp;&nbsp;So long lives this, and this gives life to thee.  
]
]

|title|# types|# tokens|type-token-ratio|
|:---|:---|:---|:---|
|Baa Baa Black Sheep|32|85|0.3764706|
|Sonnet XVII|83| 114|0.7280702|

---
class: middle

### Q: _What_ are they talking about?

One way to capture this `\(\rightarrow\)` .fancy[Term-Document Matrix]

* A matrix of frequencies for term (i.e. type) in a collection of documents (i.e. samples, texts, etc.)

+ Frequencies can be raw counts, transformed counts, weighted counts, etc.

* Useful for comparing documents to one another

* Can see which words appear frequently and which words appear infrequently

---
class: middle

"All that glitters is not gold "  
"All's well that ends well "
]
.pull-right[
<br>
<br>
"Be-all and the end-all "  
"Heart of gold "
]

```
##    all all's and be end ends glitters gold heart is not of that the well
## D1   1     0   0  0   0    0        1    1     0  1   1  0    1   0    0
## D2   2     0   1  1   1    0        0    0     0  0   0  0    0   1    0
## D3   0     0   0  0   0    0        0    1     1  0   0  1    0   0    0
## D4   0     1   0  0   0    1        0    0     0  0   0  0    1   0    2
```

* Term-Document Matrices are typically sparse and high-dimensional (we will circle back to this later)

* Counts for commonly used words like "and", "is", and "the" (a.k.a. _stop words_) will be inflated

+ Could remove these words before analysis,
  + ...or use a weighting scheme like .emphasize[tf-idf]

---
class: middle

>"....tf-idf is intended to measure .emphasize[how important a word is to a document in a collection] (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites." <sup>1<sup>

* `\(\text{tf-idf} = \text{tf}(term, doc) \times \text{idf}(term)\)`
  
  + `\(\text{idf}(term) = ln(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}})\)`

+ Term frequency = frequency of `\(t\)` in `\(d\)` 
  
  + Inverse document frequency = how "important" `\(t\)` is across all documents
  
--
  
+ Identifies the words that are important in a document and not too common overall

+ Words that are frequently used across all documents will have a _lower_ tf-idf value while words that are rarely used overall will have a _higher_ tf-idf value

### Back to the Shakespeare example

```
##     all all's  and   be  end ends glitters gold heart   is  not   of that
## D1 0.05   0.0 0.00 0.00 0.00  0.0     0.23 0.12  0.00 0.23 0.23 0.00 0.12
## D2 0.07   0.0 0.17 0.17 0.17  0.0     0.00 0.00  0.00 0.00 0.00 0.00 0.00
## D3 0.00   0.0 0.00 0.00 0.00  0.0     0.00 0.23  0.46 0.00 0.00 0.46 0.00
## D4 0.00   0.2 0.00 0.00 0.00  0.2     0.00 0.00  0.00 0.00 0.00 0.00 0.10
##     the well
## D1 0.00  0.0
## D2 0.17  0.0
## D3 0.00  0.0
## D4 0.00  0.2
```

* `\(\text{tf-idf}(all, \text{D1}) = 1 \rightarrow 0.05\)`
* `\(\text{tf-idf}(glitters, \text{D1}) = 1 \rightarrow 0.23\)`

---

## How can we use these tools <br> to measure pedantic speech?

---
class: middle
### Measuring pedantic speech

* Lots of different ways to be pedantic

+ Vocabulary choice
  + Style of speaking
  + Tone

---
class: middle
### Measuring pedantic speech

* Lots of different ways to be pedantic

+ .look[Vocabulary choice]
  + Style of speaking
  + Tone
  
--

---
class: middle

### The data I have to work with

* Transcribed ADOS sessions (n = 87)

* Two diagnosis groups
  1. Autism Spectrum Disorder (ASD)
  2. Typically Developing (TD)
  
* All participants:
  
  * Age range: 4 - 9 years old

* Full-scale IQ > 90

* MLU > 3.0

* Child speech only (examiner speech excluded)
]

.pull-right[
<br>
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> dx </th>
   <th style="text-align:right;"> n </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> TD </td>
   <td style="text-align:right;"> 44 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ASD </td>
   <td style="text-align:right;"> 43 </td>
  </tr>
</tbody>
</table>
]

---
class: middle

### First, text preprocessing

+ Convert all letters to lowercase
  
  + Remove all coded words - e.g. "xxx and I went to the park"
  
  + Remove all punctuation except apostrophes 
      
      + Keep contractions as is - e.g. "don't" vs. "do not"
   
  + Tokenize into unigrams

---
class: center

---
class: middle, center

.pull-left[
TD group:
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> token </th>
   <th style="text-align:right;"> n </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> and </td>
   <td style="text-align:right;"> 4861 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> the </td>
   <td style="text-align:right;"> 3410 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> to </td>
   <td style="text-align:right;"> 2124 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> yeah </td>
   <td style="text-align:right;"> 1800 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> like </td>
   <td style="text-align:right;"> 1685 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> it </td>
   <td style="text-align:right;"> 1680 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> then </td>
   <td style="text-align:right;"> 1251 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> that </td>
   <td style="text-align:right;"> 1168 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> he </td>
   <td style="text-align:right;"> 1158 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> you </td>
   <td style="text-align:right;"> 1107 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
ASD group:
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> token </th>
   <th style="text-align:right;"> n </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> the </td>
   <td style="text-align:right;"> 3564 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> and </td>
   <td style="text-align:right;"> 3393 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> to </td>
   <td style="text-align:right;"> 1976 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> yeah </td>
   <td style="text-align:right;"> 1688 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> it </td>
   <td style="text-align:right;"> 1506 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> that </td>
   <td style="text-align:right;"> 1169 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> no </td>
   <td style="text-align:right;"> 1120 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> you </td>
   <td style="text-align:right;"> 983 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> in </td>
   <td style="text-align:right;"> 916 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> this </td>
   <td style="text-align:right;"> 856 </td>
  </tr>
</tbody>
</table>
]

---
class: middle, center

.emphasize[How many unique types are there <br> for each diagnosis group?]
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> dx </th>
   <th style="text-align:right;"> n_types </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> TD </td>
   <td style="text-align:right;"> 4412 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ASD </td>
   <td style="text-align:right;"> 4438 </td>
  </tr>
</tbody>
</table>

--
<br>
<br>

`n_types` = 6296

---
class: middle

Using infrequent/uncommon words `\(\rightarrow\)` pedantic speech

* Create a document matrix where a document corresponds to a participant
  
* Instead of raw counts or tf-idf values, use a transformed frequency value

+ `\(\log(\frac{n_{word}}{n_{total words}} +1)\)` 
  
--

* Explore dimensionality reduction methods to reduce dimensions of term-document matrix from many `\(\rightarrow\)` 2

+ .emphasize[Non-Metric Multidimensional Scaling]

* Visualize results

+ Might not capture 2+ word phrases that are pedantic

+ 2 dimensions might not be enough
  
+ Exploratory - visualizations are just the first step

---
class: middle

* One of many dimensionality reduction methods

* Given a .emphasize[matrix of pairwise distances] in M-dimensional space

+ Find the projection of the points in N-dimensional space (where N < M) that .emphasize[preserves as much of the pairwise distances as possible]

* .look[Points that were close in the M-dimensional space should remain close in the N-dimensional space, while points that were far apart should remain far apart]

* The "best" projection is the one that minimizes the .emphasize[stress] value, where

+ `\(d_{ij}\)` is the distance between the `\(i\)` and `\(j\)`
  
  + `\(\delta_{ij}\)` is the distance between the projections of `\(i\)` and `\(j\)`

+ `\(stress = \frac{1}{\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\delta_{ij}}\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\frac{(\delta_{ij}-d_{ij})^2}{\delta_{ij}}\)`
  
* Stress value will be between 0 and 1, with 0 meaning that no distance information was lost.

---
class: middle, center, inverse

## ✨ Plots! ✨

---
class: middle

---

---

---
class: middle

### Zooming in on one child

---
class: middle

.pull-left[
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> token </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> iderman </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sp </td>
  </tr>
  <tr>
   <td style="text-align:left;"> iderman's </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ferris </td>
  </tr>
  <tr>
   <td style="text-align:left;"> locking </td>
  </tr>
  <tr>
   <td style="text-align:left;"> oppositional </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sips </td>
  </tr>
  <tr>
   <td style="text-align:left;"> poke </td>
  </tr>
  <tr>
   <td style="text-align:left;"> mil </td>
  </tr>
  <tr>
   <td style="text-align:left;"> weevie </td>
  </tr>
  <tr>
   <td style="text-align:left;"> woovie </td>
  </tr>
  <tr>
   <td style="text-align:left;"> symbol </td>
  </tr>
  <tr>
   <td style="text-align:left;"> grandp </td>
  </tr>
  <tr>
   <td style="text-align:left;"> belt </td>
  </tr>
  <tr>
   <td style="text-align:left;"> meow </td>
  </tr>
</tbody>
</table>
]
--

]

---
class: inverse, middle, center

## "A Spiderman who lost his Iderman is a Spiderman who lost his sp"

<br>

---
class: middle

## What next?

* Explore bigrams, trigrams, etc.

* Look into examiner speech & frequency of conversational turns

+ Who is talking more? Who talks for longer? 
  
* Pronoun usage

+ Always using proper nouns and never pronouns -- pedantic?

* Is there a *source* for pedantic speech?

+ Modeling language from books? Movies? Parents?

---

## Thank you!

<br>

gracelawley@gmail.com

https://grace.rbind.io

<i class='fab fa-github'></i> @gracelawley

<br>
Slides created via <br>the R package [xaringan](https://github.com/yihui/xaringan)