Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>"Malamud says that he did have to get copies of the 107 million articles referenced in the index in order to create it; he declined to say how,"

It's clearly Sci-Hub, as this 2019 article all-but-confirms:

https://www.nature.com/articles/d41586-019-02142-1

>"And around the same time that he heard about the Rameshwari judgment, he had come into possession (he won’t say how) of eight hard drives containing millions of journal articles from Sci-Hub [...] Malamud began to wonder whether he could legally use the Sci-Hub drives to benefit Indian students [...] Asked directly whether some of the text-mining depot’s articles come from Sci-Hub, he said he wouldn’t comment"

(It'd be nice if there was a coverage source other than this scientific publisher, whose biases are obvious).




Out of interest, is sci-hub generally available to get a copy of for...research purposes?


Taking you at your word that you're just asking out of interest, I hear (but haven't verified) that libgen mirrors scihub.

http://libgen.rs/dbdumps/


There are torrents available. I'm unsure if I can link to the site directly without getting banned so you'll have to check the Gizmodo article for the link to the site containing the torrents. [0] It's missing some of the 2021 papers though, judging by the dates.

[0] https://gizmodo.com/archivists-want-to-make-sci-hub-un-censo...


also, I want a version that lets me browse papers. Unlike people in academia or other research fields, I don't know the paper that I want to pirate yet.

from my lived experience, this request has been seen as absurd and invalidated by many because their lived experience is always having feeds of papers and abstracts they can start from. or how they see a browser extension that doesn’t do what I’m asking at all to be good enough (there’s some that automatically give you the scihub link to any paper you are looking at, which still puts me at square one, which paper?).

its a pretty basic feature though.


Unpaywall is a browser extension you might find useful. It provides links to open access papers, with the links appearing on pubmed pages, publisher pages, etc. A significant proportion of papers are now open access so this tool is very often useful.


see this reddit thread on Scihub rescue mission https://www.reddit.com/r/DataHoarder/comments/nc27fv/rescue_...

there is a link there to 851 torrents , ~77TB of data (compressed zip files), and also an index (sql db dump)


77 TB of compressed publications.

That's pretty crazy to think about, even if you consider overhead and multimedia assets like images often included in PDFs. I remember the old "the library of congress fits on this CDROM" analogies (which wasn't entirely true) but this takes it to a whole new level.

At some point, it seems like in research, it will be far easier to skip the lit review and just do the work then later if you do the same work someone else did, compare results for consistency. We may yet get through the hurdle of the reproducibility crisis due to the deluge of information. The underlying issue though is, because you couldn't find the related effort that caused you to independently attempt to replicate the effort, you may also never find all the duplicate efforts to compare for consistency.


> 77 TB of compressed publications.

PDF is fairly inefficient compared to formats like DVI (and consider that so many papers are produced using TeX anyway, though figures may be in various image formats.)

But 77TB? You could host the whole thing in a shoebox with nine 8TB flash drives, a 10-socket powered USB hub, and a Raspberry Pi.

Someone really needs to build Shoe-Hub.


Scihub has been a life saver for me once I started working on some of the more obscure areas of AI such as signal/time series processing. Anything off the beaten path is locked up behind paywalls, and I'm sorry I ain't paying $40 just to see if someone's paper sucks or not (which 95% of them do, in this particular niche, especially the ones shielded from scrutiny by paywalls).


95% of all papers I have read have sucked. Maybe they just weren’t what I was looking for but a lot of them I couldn’t believe got published as anything novel


Of the remaining five percent, at least in software you can be sure at least 80% (4% of the total) doesn't actually work when tested. It is beyond frustrating to deal with research, to the point that these days unless the algorithm is very well described or there is source code available I have to assume the researchers are just lying.


> It is beyond frustrating to deal with research, to the point that these days (...) I have to assume the researchers are just lying.

This is not something new, or even from this century. The Royal Society, which was founded on 1660 and is a reference in the history of science, adopted the motto "take nobody's word for it".

https://en.wikipedia.org/wiki/Royal_Society


Recent AI stuff from major labs on Arxiv is pretty good, but yeah, anything that's AI+some other field is usually pretty bad. It's usually written by someone in that other field who might be an expert in their own domain, but who knows very little about AI or even just numerical optimization in general. The fact that such "work" is accepted uncritically by publishers doesn't inspire a lot of confidence in the value they purportedly add. It's right on the surface: "awesome" results are easy to achieve in AI if you screw up your train/val split, or deliberately choose an extremely weak baseline.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: