I use a little script [1] and a passive approach to quickly find a PDF I am looking for among a few thousands of academic PDF. The workflow (illustrated in the GIF [2]):
- as I read new PDFs in the browser, the PDFs are passively downloaded typically in a Downloads/ folder.
- This results in thousands in papers lying in Downloads/ or elsewhere.
- The command p from the script [1] let me instantaneously fuzzy-search over the first page of each pdf (The first page of each pdf is extracted using pdftotext, but cached so it's fast). The first page of academic PDFs usually contains title, abstract, author names, institutions, keywords of the paper; so typing any combination of those will quickly find the pdf.
What is particularly convenient is that no time is spent trying to organize the papers into folders, or importing them into software such as zotero. The papers are passively downloaded, and if I remember ever downloading a paper, it's one fuzzy-search away. Of course it does not solve the problem of generating clean bibtex files.
I love the passive nature of your workflow. I’ve always thought that, as soon as I had consumed (read, viewed, heard) some content (text, audio, video) it would be nice to have a “shadow copy” of it stored in a personal, private knowledgebase, with a simple keyword or more complex semantic search on top.
Basically, a personal search engine with a passively gathered corpus of my experienced content - maybe even filtered at times as in your case where you limited it to academic PDFs (to keep the knowledgebase focused). Kind of like an extension of our human memory.
Consume -> add as extension of knowledgebase -> recall.
Thank you for sharing your workflow - simple and ingenious!
Have you had any issues or thoughts for future enhancements? I can think of a number of other helpful things you could do with the corpus you’ve built, for yourself.
It is not limited to academic PDFs, although I mostly use the script to find those. When typing academic keywords (author names, scientific jargon, etc), the personal PDF that also lie in ~/Downloads/ are filtered out.
I recently used the command with some combination of airport/city/airline and the only match was the boarding pass I was looking for. It could probably be used for receipts from hotel or whatnot, as soon as pdftotext can retrieve the text. It should find tax returns and related PDFs by querying "IRS + SSN".
A current issue that I would like to fix is the preview window that does not always highlight the query in full if a single match was found before the full query was typed. It is linked to how fzf handles previewing. I do not have plans for any big enhancements.
edit: I created a public repo to replace the gist. Feel free to post your thoughts or suggestions in the issues!
I did something similar for search until I found Recoll. It has similar functions to what you describe(caching, fuzzy search) with a slick work flow that shows google scholar like context previews with an optional remote access to your library through a webui. It also searches compressed archives and generally simplifies searching many unorganized files.
I agree with what others have mentioned here: I really like your elegant workflow (thanks for sharing!!). I also like that it is generally applicable to any collection of PDFs.
However, to be fair, you can follow a somewhat similar workflow with Zotero in combination with the Firefox plugin: download pdf by adding it to the Zotero database in Firefox and Zotero takes care of the indexing. Zotero misses the fancy interactive fuzzy searching you have in your workflow thanks to fzf, but I've added it as a feature request for Zotero [1].
You don't have to organize your papers into folders (or collections in Zotero parlance since a single item can appear in multiple collections). For most academic papers the Zotero plugin will also grab the pdf's metadata as a bonus without additional costs.
This is awesome! Interesting enhancement idea, slap a NLP topic model on top of it and explore by clusters. I'd pay for a product that did that (and if I ever find the time might try and do it myself).
Neat workflow. I use a conceptually similar system though it’s proprietary (DEVONthink). Drop pdfs or websites or bookmarks in the inbox and burn through them and decide if I want to archive it in the database.
I’ve enjoyed having access to the database from my laptop as well as iPhone and iPad. It’s definitely been a workflow I’ve cobbled together. This seems to be working out better though.
I don't think full names of non-authors are very commonly mentioned on the first page of papers, so just including the name in the query should be a useful approximation?
Firefox or any other browser that downloads the PDFs that you browse. When searching/browsing PDFs, my firefox is set up to download the file into ~/Downloads/.
- as I read new PDFs in the browser, the PDFs are passively downloaded typically in a Downloads/ folder.
- This results in thousands in papers lying in Downloads/ or elsewhere.
- The command p from the script [1] let me instantaneously fuzzy-search over the first page of each pdf (The first page of each pdf is extracted using pdftotext, but cached so it's fast). The first page of academic PDFs usually contains title, abstract, author names, institutions, keywords of the paper; so typing any combination of those will quickly find the pdf.
What is particularly convenient is that no time is spent trying to organize the papers into folders, or importing them into software such as zotero. The papers are passively downloaded, and if I remember ever downloading a paper, it's one fuzzy-search away. Of course it does not solve the problem of generating clean bibtex files.
[1]: the script: https://github.com/bellecp/fast-p
[2]: an illustration in GIF: https://user-images.githubusercontent.com/1019692/34446795-1...
edit: The script has been moved from the gist to the public repository https://github.com/bellecp/fast-p