Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words]

What does that mean? Does he remove words that are only said once or twice?

Can anyone point me to a text explaining the difference between Identifying Characteristic Words using Log Likelihood and using tfidf. ?



Relevant line in code:

   # remove sparse terms
   all.tdm.75 <- removeSparseTerms(all.tdm, 0.75) # 3117 / 728215
I believe it corresponds to the tfidf factor.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: