Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Tabula PDF: Extract table data from PDFs (tabula.technology)
2 points by scrollaway on April 20, 2020 | hide | past | favorite | 1 comment


So I started working with MCCs (Merchant Category Codes AKA ISO 18245) and I needed some decent lookup tables for them.

I just spent several hours fighting with awful spec PDFs containing hundreds upon hundreds of tables of these. Well, Tabula made quick work of all of them:

https://github.com/jleclanche/python-iso18245

It extracted over 200 pages of tables with nearly no errors, and maybe a grand total of ~15 mins of manual cleanup work needed to have the data be processable by the library.

Final CSVs: https://github.com/jleclanche/python-iso18245/tree/master/is...




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: