I did my master's thesis on authorship verification, which is exactly this problem (deciding if two texts are written by the same author).
I experimented with clustering, SVMs, neural nets, etc for a long time, and got mainly disappointing results.
Even when "modern methods" give very high confidence scores, the problem is very messy and complicated, and usually the training data is different enough from the actual data (in supervised learning scenarios) as to bring the result into question.
I don't have access to the paper, so can't say much more, but I've seen a lot of very good-looking results that are in fact questionable.
> After compiling the ‘Jack the Ripper Corpus’ consisting of the 209 letters linked to the case, a cluster analysis of the letters is carried out using the Jaccard distance of word 2-grams. The quantitative results and the discovery of certain shared distinctive lexicogrammatical structures support the hypothesis that the two most iconic texts responsible for the creation of the persona of Jack the Ripper were written by the same person.
So it seems to be a combination of cluster analysis and manual linguistic reading of lexicogrammatical structures. It would be interesting if these studies were done as blind studies, with another letter from the period (or a convincing known fake) was also used as control. Do they find those shared distinctive lexicogrammatical structures in the control too, because they're looking for them now? Does the cluster analysis give significantly higher confidence to grouping the possibly-real letters, compared to grouping one possibly-real letter with the known fake?
Those would make this more interesting and less questionable, especially for a first study into this.
Andrew Wakefield, the man who published the fraudulent "vaccines cause autism" paper, was a medical researcher. Do we then discount all other medical researchers?
There are absolutely huge problems in medical research regarding statistical strength and p-hacking. But, I get your point.
In this case, copy cats of Jack the Ripper were trying to emulate his (her?) ideolect. So one expects the entire exercise to be hopeless and the conclusion holds no predictive value.
> In this case, copy cats of Jack the Ripper were trying to emulate his (her?) ideolect. So one expects the entire exercise to be hopeless and the conclusion holds no predictive value.
The letters in question were sent before the first letter was published. How would a copycat know what to emulate before the first letter's publication? It was also found that the latter letters were not very good at copying the original author's style. This is all in the article.
I think medical research should be met with great skepticism even for very reputable members of the medical research community. The GP expresses doubts in the entire field and we for example know now that the majority of past psychological studies are flawed.
I don't think it's fair to call it the "vaccines cause autism" paper. It didn't suggest a causal relationship. As usual, it was the fault of the media and public for interpreting it in that way.
As usual, it was the fault of the media and public for interpreting it in that way.
Absolutely not. The paper was straight up fraudulent, written purely to push an agenda and quite possibly being funded by groups looking sue the vaccine manufacturers. Wakefield spent the rest of his career actively defending and championing the causal relationship way beyond even what his fraudulent evidence might suggest, and it was the media calling bullshit on his so called research that eventually got the paper retracted and him struck off the UK medical register.
Look, I love complaining about bad science reporting misrepresenting research as much as the next person, but this is very much not one of those cases.
From the paper (http://www.thelancet.com/journals/lancet/article/PIIS0140-67...): "We did not prove an association between measles, mumps, and rubella vaccine and the syndrome described. Virological studies are underway that may help to resolve this issue".
What the paper really claimed is that, based on parent's recollections, there may be a link between MMR vaccine and autism, and further studies should be done on it, which is quite reasonable. Whether the paper is fraudulent (I don't think it is), and whether an author of the paper separately claimed a causal relationship, is orthogonal to whether the paper itself claimed that "vaccines cause autism" which is a creation of the media, encouraged by Wakefield.
It was so fraudulent that it was retracted by The Lancet, which is something that almost never happens. The study was fabricated. Data was chosen selectively to indicate a link. Ethical standards were violated. Funding by litigators was undisclosed. Everything about the study was flawed or outright fraudulent.
The fact that Wakefield et al attempted to cover their asses by claiming they did not prove the link does not reduce the level of fraud or make the fabricated implication acceptable. Wakefield et all never explicitly said that vaccines cause autism in the paper but they fabricated the entire study to paint that picture.
The next episode in the saga was a short retraction of the interpretation of the original data by 10 of the 12 co-authors of the paper. According to the retraction, “no causal link was established between MMR vaccine and autism as the data were insufficient”.[5] This was accompanied by an admission by the Lancet that Wakefield et al.[1] had failed to disclose financial interests (e.g., Wakefield had been funded by lawyers who had been engaged by parents in lawsuits against vaccine-producing companies). However, the Lancet exonerated Wakefield and his colleagues from charges of ethical violations and scientific misconduct.[6]
The Lancet completely retracted the Wakefield et al.[1] paper in February 2010, admitting that several elements in the paper were incorrect, contrary to the findings of the earlier investigation.[7] Wakefield et al.[1] were held guilty of ethical violations (they had conducted invasive investigations on the children without obtaining the necessary ethical clearances) and scientific misrepresentation (they reported that their sampling was consecutive when, in fact, it was selective). This retraction was published as a small, anonymous paragraph in the journal, on behalf of the editors.[8]
The final episode in the saga is the revelation that Wakefield et al.[1] were guilty of deliberate fraud (they picked and chose data that suited their case; they falsified facts).[9] The British Medical Journal has published a series of articles on the exposure of the fraud, which appears to have taken place for financial gain.[10–13] It is a matter of concern that the exposé was a result of journalistic investigation, rather than academic vigilance followed by the institution of corrective measures. Readers may be interested to learn that the journalist on the Wakefield case, Brian Deer, had earlier reported on the false implication of thiomersal (in vaccines) in the etiology of autism.[14] However, Deer had not played an investigative role in that report.
That all came later as an attempt to punish Wakefield for the negative consequences to society due to people not getting vaccines. But people only stopped getting vaccines in the first place based on a misinterpretation of the paper.
They emphatically did not paint the picture that vaccines cause autism, which is exactly what "We did not prove an association between measles, mumps, and rubella vaccine and the syndrome described. Virological studies are underway that may help to resolve this issue" is cautioning against. That picture was painted by the media.
Wakefield AJ, Murch SH, Anthony A, Linnell J, Casson DM, Malik M, et al. Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. Lancet. 1998;351:637–41.
In 1998, Andrew Wakefield and 12 of his colleagues[1] published a case series in the Lancet, which suggested that the measles, mumps, and rubella (MMR) vaccine may predispose to behavioral regression and pervasive developmental disorder in children. Despite the small sample size (n=12), the uncontrolled design, and the speculative nature of the conclusions, the paper received wide publicity, and MMR vaccination rates began to drop because parents were concerned about the risk of autism after vaccination.[2]
In this article, an analysis of the texts sent during the Whitechapel murders case was presented. This analysis found linguistic evidence that supports the hypothesis that the two most iconic texts signed as ‘Jack the Ripper’, the ‘Dear Boss’ letter and the ‘Saucy Jacky’ postcard, have been written by the same person. Because of the number and the distinctiveness of the linguistic similarities, it is likely that an authorial link also exists between these two texts and a third letter sent to the same recipient, the ‘Moab and Midian’ letter. These results constitute new forensic evidence in the Jack the Ripper case after more than 100 years, even though they do not reveal information about the identity of the killer(s).
Besides the historical and forensic implications, the results presented in this article also have interesting consequences for modern research in authorship analysis, forensic linguistics, and research on idiolect. The results in this article present additional evidence that uniqueness in linguistic production can be found in the way meaning is encoded and that this encoding of meaning can be difficult to imitate.
"""
Had the original link made it easier to read the article, I might not have been as hasty to call is toiler paper.
I experimented with clustering, SVMs, neural nets, etc for a long time, and got mainly disappointing results.
Even when "modern methods" give very high confidence scores, the problem is very messy and complicated, and usually the training data is different enough from the actual data (in supervised learning scenarios) as to bring the result into question.
I don't have access to the paper, so can't say much more, but I've seen a lot of very good-looking results that are in fact questionable.
Still a fascinating problem!