Tabula PDF: Extract table data from PDFs

scrollaway · on April 20, 2020

So I started working with MCCs (Merchant Category Codes AKA ISO 18245) and I needed some decent lookup tables for them.

I just spent several hours fighting with awful spec PDFs containing hundreds upon hundreds of tables of these. Well, Tabula made quick work of all of them:

https://github.com/jleclanche/python-iso18245

It extracted over 200 pages of tables with nearly no errors, and maybe a grand total of ~15 mins of manual cleanup work needed to have the data be processable by the library.

Final CSVs: https://github.com/jleclanche/python-iso18245/tree/master/is...