I use a little script [1] and a passive approach to quickly find a PDF I am look...

sixdimensional · on July 25, 2018

I love the passive nature of your workflow. I’ve always thought that, as soon as I had consumed (read, viewed, heard) some content (text, audio, video) it would be nice to have a “shadow copy” of it stored in a personal, private knowledgebase, with a simple keyword or more complex semantic search on top.

Basically, a personal search engine with a passively gathered corpus of my experienced content - maybe even filtered at times as in your case where you limited it to academic PDFs (to keep the knowledgebase focused). Kind of like an extension of our human memory.

Consume -> add as extension of knowledgebase -> recall.

Thank you for sharing your workflow - simple and ingenious!

Have you had any issues or thoughts for future enhancements? I can think of a number of other helpful things you could do with the corpus you’ve built, for yourself.

jknz · on July 25, 2018

It is not limited to academic PDFs, although I mostly use the script to find those. When typing academic keywords (author names, scientific jargon, etc), the personal PDF that also lie in ~/Downloads/ are filtered out.

I recently used the command with some combination of airport/city/airline and the only match was the boarding pass I was looking for. It could probably be used for receipts from hotel or whatnot, as soon as pdftotext can retrieve the text. It should find tax returns and related PDFs by querying "IRS + SSN".

A current issue that I would like to fix is the preview window that does not always highlight the query in full if a single match was found before the full query was typed. It is linked to how fzf handles previewing. I do not have plans for any big enhancements.

edit: I created a public repo to replace the gist. Feel free to post your thoughts or suggestions in the issues!

severine · on July 25, 2018

Take a look at Recoll, and its web extension:

http://www.lesbonscomptes.com/recoll/faqsandhowtos/IndexWebH...

thatcat · on July 25, 2018

I did something similar for search until I found Recoll. It has similar functions to what you describe(caching, fuzzy search) with a slick work flow that shows google scholar like context previews with an optional remote access to your library through a webui. It also searches compressed archives and generally simplifies searching many unorganized files.

http://www.lesbonscomptes.com/recoll/

disqard · on July 26, 2018

Thank you for the pointer to Recoll!

I'm trying it out now -- I have way too many pdfs, and I could really use the extra features (like context previews).

davidovitch · on July 26, 2018

I agree with what others have mentioned here: I really like your elegant workflow (thanks for sharing!!). I also like that it is generally applicable to any collection of PDFs.

However, to be fair, you can follow a somewhat similar workflow with Zotero in combination with the Firefox plugin: download pdf by adding it to the Zotero database in Firefox and Zotero takes care of the indexing. Zotero misses the fancy interactive fuzzy searching you have in your workflow thanks to fzf, but I've added it as a feature request for Zotero [1].

You don't have to organize your papers into folders (or collections in Zotero parlance since a single item can appear in multiple collections). For most academic papers the Zotero plugin will also grab the pdf's metadata as a bonus without additional costs.

[1] https://github.com/zotero/zotero/issues/1536

paultopia · on July 25, 2018

This is awesome! Interesting enhancement idea, slap a NLP topic model on top of it and explore by clusters. I'd pay for a product that did that (and if I ever find the time might try and do it myself).

rufius · on July 26, 2018

Neat workflow. I use a conceptually similar system though it’s proprietary (DEVONthink). Drop pdfs or websites or bookmarks in the inbox and burn through them and decide if I want to archive it in the database.

I’ve enjoyed having access to the database from my laptop as well as iPhone and iPad. It’s definitely been a workflow I’ve cobbled together. This seems to be working out better though.

nanna · on July 25, 2018

A downside would be that you can't search all the pdfs of a particular author? For me that's crucial.

detaro · on July 25, 2018

I don't think full names of non-authors are very commonly mentioned on the first page of papers, so just including the name in the query should be a useful approximation?

hyperpape · on July 25, 2018

This heavily depends on the field/journal style.

SnowflakeOnIce · on July 25, 2018

This is terrific! Thanks for sharing. fzf is a wonderful piece of software.

TsukiZombina · on July 26, 2018

I tried to use the script but it just searches over the file names.

jknz · on July 30, 2018

It seems that the pdftotext command does not go through. Create an issue on github if this still occurs.

atrus · on July 25, 2018

What are you using for passively downloading all those pdfs?

jknz · on July 25, 2018

Firefox or any other browser that downloads the PDFs that you browse. When searching/browsing PDFs, my firefox is set up to download the file into ~/Downloads/.

misiti3780 · on July 25, 2018

this is amazing.