Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hi, I'm the author. I appreciate the time you've taken to read and provide constructive criticism of my work. Here's my full write up (on GitHub, so it should continue to work): https://github.com/walkerkq/textmining_southpark/blob/master...

I was working under the assumption that we do not know ALL the words since the show's been renewed through 2019. This covers the first 18 seasons.

Additionally, the counting up their most frequent words produced results with very little semantic meaning - things like "just" and "dont" - which can be seen in this (really boring) wordcloud: https://github.com/walkerkq/textmining_southpark/blob/master...

Looking into the log likelihood of each word for each speaker produced results that were much more intuitive and carried more meaning, like ppod said below: I think the idea is that what we are really trying to measure is something unobservable like the underlying nature of the character or the writers' tendencies to give characters certain ways of speaking.



The point I am making is simple: You can calculate whatever you want to calculate, but there is no room for statistical testing because you do not have a probability sample, and, no sampling variation.

Yes, there will be future episodes, but you are not claiming that you are predicting what these characters will say in those future episodes (in which case your whole setup is rather inappropriate).

Also, I suggest you think very hard about this statement:

> The log likelihood value of 101.7 is significant far beyond even the 0.01% level, so we can reject the null hypothesis that Cartman and the remaining text are one and the same.

Even if the statistical test you employed were appropriate, this is not the conclusion you draw from it.

Also, are you confusing p = 0.01 with 1% or did you really choose p = 0.00001 as the significance level for your test?


A simple tf-idf would get you similar results without a t-test.

I think that is what parent is implying.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: