> Reducing the sparsity brought that down to about 3,100 unique words [from 30,6... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		wodenokoto on Feb 10, 2016 \| parent \| context \| favorite \| on: Text Mining South Park > Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words] What does that mean? Does he remove words that are only said once or twice? Can anyone point me to a text explaining the difference between Identifying Characteristic Words using Log Likelihood and using tfidf. ?

minimaxir on Feb 10, 2016 [–]

Relevant line in code:

   # remove sparse terms
   all.tdm.75 <- removeSparseTerms(all.tdm, 0.75) # 3117 / 728215

I believe it corresponds to the tfidf factor.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact