Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PDF Is the World's Most Important File Format (vice.com)
201 points by jbegley on May 3, 2019 | hide | past | favorite | 212 comments


I didn't "get" PDF's until I started doing academic research.

But the ability to collect papers, books, documents, etc. all in a single format that I can read on any device, and mark up with highlights and notes, has been a game-changer.

Yes it's a lowest-common-denominator format. And it's designed for human reading and manual office tasks, not computer processing of data. But it works. And it's supported everywhere.

Doesn't matter if I'm on the Apple or Google or Microsoft or Adobe stack. Doesn't matter if the PDF is 20 years old. It just works.


Some features, not all of them. Forms in particular are poorly supported in the Apple stack, and specifically, it can't generate QR codes from form fields in documents like Reader can. Fonts are usually inconsistent and wrong in forms. Letter spacing is all wrong. I'm sure there's tons of other things missing, the spec is ginormous and includes it's own version of JavaScript that's similar to -- but not quite -- ECMA which is a huge attack vector [0].

For instance, check out the Canadian passport simplified renewal form [1]. The upper-right corner on the first page is "$FORM$054(06-2018)$V$1.4$CS$0$C$0" on the Apple stack and a proper QR code which changes as you fill in the application in Reader. The big blue "Read Instructions" buttons don't work on the Apple stack either.

It may be important, but it's a waking nightmare of a spec.

[0] http://mariomalwareanalysis.blogspot.com/2012/02/how-to-embe...

[1] https://www.canada.ca/content/dam/ircc/migration/ircc/englis...


Originally there were AcroForms, part of the PDF spec. That's supported by pretty much everyone.

Then, Adobe saw an opportunity make some enterprise money and introduced XFA (XML Forms Architecture). The XFA spec is associated with the PDF spec, and you can put XFA content inside of a PDF, but the XFA spec is actually larger than the PDF spec. It's utterly insane.

I know that Adobe of course supports XFA, and there are various enterprisey things that support it to varying degrees, but I don't think it's well-supported by anyone outside of Adobe's implementation. Not only is it huge and complicated, but it also requires pulling in a JavaScript interpreter, which is a big ask for a feature that only exists because at one time Adobe thought they could turn it into another revenue stream.

It's noteworthy that the PDF 2.0 spec specifically says that XFA is not only deprecated, but that any PDF which contains XFA content is considered out of spec for PDF 2.0. 2.0 goes back to AcroForms for all of that stuff and ditches XFA entirely. Likewise for JavaScript in general. Anything with JavaScript dependencies in PDF 1.7 is verboten in 2.0.

In general, if you get stuck with a PDF with XFA content (not AcroForms), your best bet is to just use Adobe Reader to fill it out. Hopefully PDF 2.0 will take over the world eventually and everybody will be 100% back on AcroForms.


I'm surprised to hear they're getting rid of JavaScript (and I couldn't find any references to that, but I wish it to be true) -- but don't worry, they're adding a whole new pile of complexity in PRC. "PDF 2.0 adds support for “PRC”, a rich 3D modeling language, originally developed by Adobe." [1]

[1] https://theblog.adobe.com/taking-documents-to-the-next-level...


> I'm surprised to hear they're getting rid of JavaScript (and I couldn't find any references to that, but I wish it to be true)

I won't believe it until I see hard evidence. Acrobat and Reader have supported JavaScript for something like two decades. (I wrote a bunch of JavaScript multimedia APIs when I worked there around 2002.)

Consider the large number of interactive PDFs on the IRS website that use JavaScript for form calculations. I don't think Adobe is highly motivated to break all of those.


I've seen some embedded SVGs get rendered differently by different PDF readers.


Not that you're wrong about the PDF spec being monstrously complex and not universally supported, but why would someone want to generate a QR code from form fields? Who actually wants to use QR codes?


Anyone in Asia, they are hugely popular and super useful.

Adding friends on WeChat, making payments to vendors, getting discounts, installing an app, etc


Yep. In my case add friends on Line, buy subway tix, pay at 7-11 and many other places. You can also receive payments to your own QR code.


If I understand right, they took off pretty much everywhere other than NA for public use. The general vibe I get here is that nobody in the country uses them, and using them is seen as a bit of an anomaly, even scanning one in public makes you a minor spectacle.


A machine that scans forms.


Haha, fair, that was just the first painpoint that sprung to mind. They're fairly common on visa and passport applications, and some other government forms. I guess if you're accepting paper submissions for whatever reason not having to transcribe the contents is a big win.


QR codes are also super useful in warehouses


Like people said, QR codes are huge in Asia. I got one just yesterday to get my movie ticket.


> Forms in particular are poorly supported

I understand this is a problem as it is one extremely important features of PDFs, but from a consumer perspective (never worked in an big office) I do not think I have ever seen a PDF form.

Maybe the real success of PDFs is that it managed to hide all the inner complexity of the format and missing features from the common use cases.


re: forms, I find it easier to just convert a pdf to an image or screenshot it programmatically, and process it with a few custom OCR APIs that identify key/value form structures. Unfortunately, I would say, but it works pretty well.


I wish it really Just Worked everywhere. Unfortunately, the fixed layout and tendency to target 8.5x11 means they’re basically unreadable on smaller screens.


Print or screen: pick one.

If you want something to print beautifully, you take the time to put the figures and pictures exactly where you think they fit. If you want something to be read with arbitrary text sizes, you hope they'll show up somewhere near the relevant text. Treating a book as an ebook, or vice versa, will always be a compromise.


Easy choice: screen.


Horses for courses. For novels? Sure. For reference books with a lot of photos and diagrams, I still prefer print to whatever would result from a "reactive" layout reacting to my current screen and font size. Maybe they could be rewritten as interactive "apps," but that's a lot more work, opening a whole new can of worms.


The new ACM format is attempting to fix this for academic publications. I believe it generates HTML and should look better across devices (and be more accessible).

I haven't seen that happen yet though...


I haven't heard of this "new ACM format", but for years I've wished for a PDF2 that incorporated lessons learned from PDF and web pages for print, ebooks, web on multiple screen sizes, and non-visual readers (for the blind but also for anyone who would like to "read" a book or paper while driving or jogging).

A "multi-view" format that could be viewed as full page layout (both scrolling and paged), small screen linear, audio-only linear, or machine-readable (full, deterministic text sequence, tabular data, alt-text and tags for images and charts, etc.) All pages, images, fonts, etc., would remain encapsulated within a single file.

The closest I've seen is the latest ebook format, but that doesn't have anything close to the power of PDF to lay out a proper textbook or magazine article page.


Isn't that just HTML?


Markdown comes to mind


That's true, but not unique to PDF. Any longer document is "basically unreadable on smaller screens".


No way. A Kindle is great for reading books, but text in a typical PDF is too small to read. Reading on a phone is a bit more annoying but totally reasonable with dynamic text layout.


Sadly PDFs are usually just a series of graphics instructions for drawing stuff (including text) at specific sizes and positions, on a page with a defined size. It's an inherently hostile format for reflowing text, although I think some PDF viewers can do that to an extent. The PDF accessibility spec defines an optional, more DOM-like document structure that a compliant viewer could use to reflow text, but in my experience even the apps that produce that structure don't populate it with enough data to make reflowing reliable or easy to implement. Pretty much every format is way better for reflowing text than PDF (Kindle, ePub, HTML, etc.).


+1 also, kindle != web.


Koreader[0] can be sideloaded on Kindles, Kobos and other devices, and is good at making PDFs readable on such devices.

[0] http://koreader.rocks/


Thanks, I’ll have to check this out.


> and mark up with highlights and notes, has been a game-changer.

What are you using for that? In my experience annotation support is very spotty, not user-friendly and not cross-platform, so I rarely use it for collaboration. Even fill-in forms one gets for reimbursements often seem badly done with too small or too large boxes or missing boxes. Thus, pdf seems fine for read-only, but making changes afterwards is not a good experience.


I could say all the same things about text files or JPGs.

PDF is not the "lowest common denominator".

That would be the format that can be most easily converted to other formats.

With text and images, I can easily create PDFs and documents in myriad other formats.

Alas, PDFs do not convert as easily to other formats. To this day, no one has a PDF-to-text converter that is 100% reliable in preserving the proper line lengths. Not even a company with Google's resources.


Doesn't HTML also provide this feature? While being vastly more powerful and the only downside it cannot be printed accurately?


it would be much better if the djvu format, wholly superior to pdf for large files, wasn't locked behind a license.


PDF has extremely developed vector graphics support as well as adequate compressed image support (e.g. you can compress using a bunch of methods as good as what DJVU uses for your purpose, including PNG and JPEG2000.)

AFAIK DJVU is a much simpler format primarily for scanned images, not vector graphics (which includes LaTeX-generated documents, mind you.) Could anyone with more knowledge about DJVU comment?


Yes, DjVu has raster images under the hood.


What do you mean by "locked behind a license"?


There are a few patents involved: http://djvu.sourceforge.net/licensing.html


HTML (sans javscript) would have been so much nicer though. All of the machine readable structure and separation of style from content would have been just absolutely wonderful.

But we'll never have that (not in the sense we have PDF.) Either it's an image or an application, no one really understands how to handle anything in between.


And you can get better accessibility with html: reflow, adjust font size, adjust color, e.t.c. I only found a few PDF reader can display the content well on the mobile phone, but still worse experience when compare to the web.


[flagged]


Have you set up a bot that copy-pastes this every time 'PDF' is a substring of a post's title? Seriously, stop spamming.


> and spaced repetition via Anki.

Now that is a potentially killer feature. Anki is great once you get information in, but in my mind it always laggnd the proprietary one... Supermemo?.. in terms of getting stuff into cards and your brain in the first place.


Man, this is so annoying.


He's not spamming, he made it himself, and I could really see it helping people.


Then why would they hijack the top comment about widespread PDF support instead of making a separate top-level post when the whole submission is about PDF?


In what way? It seems interesting to me.


Yes, it's neat and interesting, but burtonator posts basically the same comment in almost every thread about PDFs, academic papers, etc.


Except for many, it doesen't. Why glorify a closed format that we spent decades trying to aviod just because its common now? The inherent issues persist.


PDF has been released as an open standard since 2008.


198 CHF to obtain ISO 32000-2:2017 from ISO, however.


That's the same price as C++ (ISO/IEC 14882:2017) and much cheaper than the entire languages codes standard (ISO 639-1:2002, ISO 639-2:1998, ISO 639-3:2007, ISO 639-4:2010 and ISO 639-5:2008).

I get your point, though.


since ISO 32000-2 there is no proprietary technology included[1]

[1] https://www.iso.org/standard/63534.html


Not all PDFs are standards compliant, plenty are filled with acrobat only JS, embedded flash and form filling crap.


As e.g. the fine men and women of Int. J. POC||GTFO have demonstrated, it's possible to embed all sorts of weird and wonderful things in a PDF. I think a format where it's possible to include as much as possible is better.

Just look at HTML. Would you rather have it be impossible to extend beyond text, images and stylesheets?


I would. The modern web is a rich-get-richer hellscape of surveillance and advertising.


Form filling is part of the standard, specifically, the process of applying an XFDF to a PDF, which even allows you to do things like present a remote web form and then inject the data into the PDF.


html can include acrobat only JS and embedded flash and form filling crap.


PDFs for reading is one thing. Fillable PDFs are evil incarnate. Apparently there are multiple ways to do it, and only Acrobat Reader is capable of making them all work.

Then you have abominations like embedded flash...

As a standalone file representing the format of a book, PDFs are a good format. But then PDFs can unfortunately (sometimes) store much more, and then can be a security minefield.


I used to have good luck using pdftk for filling out forms and the IRS seemed to encode them well, but it looks like pdftk is mostly dead. Nowadays, evince seems to work OK, sometimes, for filling out forms. And, yes, Acrobat seems to be the only piece of software that does all of them.

That said, my personal favorite for a complete screw you to the format comes from Texas and form 2382:

https://hhs.texas.gov/laws-regulations/forms/2000-2999/form-...

If you decompress it, you'll find that they encoded an HTML webpage into the PDF and, as far as I can tell, this can only be viewed on Windows with Acrobat. It's almost like Texas doesn't want people to apply for medicaid funds.


They used Livecycle (XFA) forms. It's not part of the latest PDF standard.

https://en.wikipedia.org/wiki/XFA

Here is a plain Acroform version :-)

https://send.firefox.com/download/3d2da3ccd62d4ce9/#CJRCa5XO...


Ha! Finally an answer to what these forms are. Thanks! Speaking of which, what software did you use to make the conversion? It might be nice to pass this information along to the agency to see if they'd be willing to convert all their forms.


Save file as Static PDF Form from Livecycle/AEM. (That means advanced capabilities like tables that can "grow" are removed.) Open in Acrobat Pro. Extract the pages (Shift-rightclick the thumbnails on the left side) to a new PDF.

If you asked Adobe, they would say this is not possible. :-)


>It might be nice to pass this information along to the agency to see if they'd be willing to convert all their forms.

If you see a way to, make them pay for it; there is money to be made.


I filled out my taxes with evince a couple years ago, it felt pretty neat using only open source software to do it (and seeing the whole form at once was way less stressful for me compared to how turbo tax does it, I know some people are the opposite though.)


Is there a reasonable, standard, lightweight way of making fillable PDFs? In theory it does seem like a reasonable extension of PDF (people send me xls spreadsheets that have been abused as forms, and I want to stab them) -- is there any non-toxic way of doing this? Perhaps with LaTeX?


I'm not going to call it good, but I used to use FDF (forms data format) along with pdftk to fill in PDF files on the command line using a combination of Makefiles, pdfs, and fdfs. Here's the best reference I could find for the "standard":

https://www.iso.org/obp/ui/#iso:std:iso:19444:-1:ed-1:v1:en

That said, the workflow was pretty terrible. Basically, I had to use pdftk to output from the pdf what fields were writable. Some forms like from the IRS, were pretty good at labeling things. Most other forms were terrible and there was a lot of guessing and checking. As such, this doesn't exactly answer your question, but more to comment that there was sort of a way, kind of.


OSX file preview lets you fill in PDF forms with no problems, and no additional software to install.


Well, OS X Preview saves your changes even if you quit, and it has some serious issues with radio buttons in documents. It really caused quite the problem with people filling in repetitive forms. The standard "open document, add changes, print, quit" workflow that works in Office products goes very badly on OS X.


You can turn off that auto-saving feature in system preferences. It’s called something like “Ask before saving changes.”


Depends on what it's actually doing. Preview can do actual forms, but it can also "fill" in forms by trying to detect lines and boxes.


PDFs can even contain JavaScript...


Indeed they can: https://github.com/osnr/horrifying-pdf-experiments

(Note, breakout game only works in Chrome's PDF reader)


PDF is the EMACS of file formats.


PDF can contain almost everythign


Even as a Book format it is klunky, as it has no indexing. Take the text to speech tools and the bigger the book the slower it gets near the end of the book.

I would rather have a plain ole html file with formatting flags than a PDF book.


It does have indexing, which I frequently use.

If your pdf book does not have it, there's some readers who can add it. e.g. https://helpx.adobe.com/acrobat/using/creating-pdf-indexes.h...


After spending time extracting data and text from PDF I would also say it’s the worst file format. We have perfectly fine structured documents, convert them to PDF to lose half of the information and then we spend insane effort to get the data back somehow.

It also doesn’t render well on different resolutions.

PDF is a perfect case study how inferior solutions can become standards.


PDF isn't and was never meant as a data storage format. It is meant as a presentation format. You write a file in latex or docx then convert to PDF for sharing. This means the recipient only needs a PDF reader, not whatever tools you used to create the file. This is similar to writing a file in C and compiling to an executable. You can distribute the source, but most (outside of programmers) just want the executable so they don't have a to worry about the tooling.


> outside programmers

This is not true. Ask a Lawyer, Accountant or a Journalist wading through govt pdfs.

The reasons cases and investigations take forever is the time spent manually reassembling related data residing in different orgs. The same data across all orgs will show up as 25 differently structured tables. This is changing slowly but putting public tabular data in pdfs has probably cost the economy billions.


“This is changing slowly but the putting public tabular data in pdfs has probably cost the economy billions.”

Or you could says it has created a ton of jobs :)


That’s the problem. It was never meant as what it’s used for now. It’s bad for manuals, it’s bad for data, it’s bad for mobile devices, it’s bad for versioning. It’s only good for printing. But somehow it’s used for all of these.


Formatting matters in many cases, and PDF respects that better than any other common format.

I have policy and legal documents from 2005 in PDF/A that can be rendered identically in 2019, and likely in 2105. That isn't the case for HTML, Word or almost any non-plaintext format. If for no other reason than the US Federal Courts require use of PDF, the format will exist and be somewhat vibrant for many decades to come.

I can wholeheartedly will agree that Adobe Reader sucks, but the format solves lots of problems that are difficult to solve otherwise.


How exactly do you think your HTML files from 2005 would render differently today? Sure, your modern browser engine would probably make slightly different decisions with fonts and margins, but would that negatively impact the usability of a legal document, research paper, etc.? Meanwhile, you would have gotten things like robust search and real compatibility with mobile phones and e-readers for free.

A heavily restricted subset of HTML to replace PDF as the 'archival' format would make the world so much better of a place.


Formatting is important — all of the arguments made in favor of HTML can be made for plaintext. If you don’t care about format and want to read the document 30, 50, 100 or more years ago, you should use text.

Will HTML formatting on a 4K display will look the same as an 800x600 monitor?

Will all of the ancient display elements display the same? Will IE4 specific artifacts display?

Search works fine in PDF. Mobile is not optimal, but platforms optimized for mobile require different design considerations. Few webpages look or function identically on mobile.


How is formatting important though? Semantic formatting, sure, but again, a 20-year-old HTML file still has that preserved. The em-perfect formatting inherent in PDF seems entirely purposeless unless you for some reason need to be able to reproduce a printed copy of the document exactly. Formulas are a special case, but the great majority of documents I see distributed in PDF would be more useful as HTML, and I don't buy the argument that it would risk archivability. At the end of the day, HTML is a plain-text format which is quite human-readable even if all the world's HTML renderers were somehow lost to the ages.

The only use-case I can conceive for perfectly-reproducable layout if you are not a print publisher is in fields where it is convention to reference text by page+paragraph number; in those cases, the page number is actually semantic information so could quite trivially be encoded in the content of the document to maintain that referent.

> Search works fine in PDF.

Search works fine in PDFs that don't use hyphenation or which properly implement it, and when they don't it breaks silently in ways that are potentially disastrous.

> Mobile is not optimal, but platforms optimized for mobile require different design considerations.

And my point about mobile was you don't need to optimize for it.

>Few webpages look or function identically on mobile.

Again, it doesn't need to look identical. That is a falce requirement. There is no value in that. The only web pages that don't function on mobile are ones which have been optimized for desktop. We're not talking about general-purpose web pages here, we're talking about textual documents.

This is not some hypothetical scenario, BTW. The UK has been using HTML over PDF for public-facing documents for a couple years now, it seems to be working out for them[1].

1. https://gds.blog.gov.uk/2018/07/16/why-gov-uk-content-should...


The only web pages that don't function on mobile are ones which have been optimized for desktop.

Although it's rare to find them nowadays, this is false for pages with no CSS at all. By no CSS I mean not even the infamous proprietary viewport meta tag... which is a posteriori being made CSS. Access for instance the basically unstyled page [1] with a $1000+ smartphone of our day... and you're likely to find unreadably small font.

Now, you could argue [1] functions on mobile, but let's agree it's stretching the meaning of that term. But that's not the main point: HTML elements come and go (for instance MENU; find others in [2]), so it's clear that archival reliability is not a big priority.

We all agree that ideally the source should always be made available (mandated if tax payers' money is involved if you ask me), but that doesn't invalidate the value of a universal presentational format.

[1] http://www.qrg.northwestern.edu/papers/files/simhobby-local....

[2] https://meiert.com/en/indices/html-elements/


It’s good for manuals, which often have complex formatting and layout that necessitates precise control. It’s not appreciably worse for mobile devices than the incredibly bloated HTML that you see nowadays. It’s completely orthogonal to versioning, and PDF diff tools are quite competent. It’s not only spectacular for printing, but completely unmatched for paper <-> digital round trips.


Good perspective... and it made me think of a parallel with Democracy. Could the atrocity that is pdf be the "least bad" way to do all of the above in one file format?


No. It’s the worst bad :). It’s only good for printing.


True, PDFs should probably be thought more of as "image" files than editable documents. As a binary format, you need to use visual tools in order to view them. I have a little experience with this having built a tool to diff them: https://www.parepdf.com.


PDF is based on Postscript, it designed for print perfect output, its essentially a bunch of rendering commands for a printer its not supposed to be a word processing format that can be edited later its the final output from a word processor or whatever.


Right, but somehow it gets used for all sorts of things that are never printed in offices everywhere.


It gets printed to the screen for reading and still retains the text for searching. The ability to easily edit can actually be liability in many situations.

Its literally a replacement for paper documents, except you can search and view them on a computer and still hit print and get a great paper reproduction.

Word processing/editing has so much more going on, PDF actually removes features from postscript for security purposes, nobody wants to be sending word docs around just for viewing.


Recovering coherent text strings from PDF is non-trivial. Paragraphs are merely implicit and hyphenation must be removed, among many other problems. Anything unusual like sub/superscripts or inline math will potentially end up as gibberish or missing.


I have spent weeks on an algorithm that would recreate words and paragraphs from letter rectangles and the space in between. Fun, fun, fun. Tables are even better. Page headers and footers then complete the insanity nicely.


That has a lot to do with the program that created it, HTML can be bad to search to I mean try and index a react page.

It a better situation than OCRing a paper doc though.


The alternative doesn’t have to be a word doc; I’d prefer, open office xml docs, google docs, markdown, and just plain old rich text half the time. The other half of use cases, graphically rich report-like things, HTML or a power point would have been better. I don’t want my content to be constrained to a paper format to be zoomed and scrolled around on a not-paper-sized screen with no support for reflows.


Agreed. PDF is intended to accurately render a specific presentation format - often desktop. There is nothing quite like the melange of frustration you get when you are trying to fix something and realizing you have to open up a PDF product manual on your phone. Dragging and zooming all over the place like it's Google fucking Maps. Bonus points when the text isn't even searchable.


It's important to differentiate between the file format and software that renders that file format. Admittedly there are better and worse PDF viewers, but that shouldn't be the final (or even most important) determiner of how important it is.


Yes, I often use SumatraPDF, but there's always those super-complicated Turing-complete documents which use web-based-aunthenticated-signed God-knows-what, that won't open in anything other than Acrobat.


PDF/A-1a is a form of PDF where the text is embedded in structured way. It is a 'tagged PDF'. This is meant to make PDF files accessible. It makes it possible to extract the text well.

Some governments, e.g. Dutch government, mandate that PDFs be tagged PDFs.


I think .doc(x) is clearly the worst file format - they're not as portable as people seem to think they are, and they're not long-lasting. PDF, for all of its failing, doesn't suffer from these issues.

A better solution, perhaps, for preserving structure, would be TeX, markdown or Org files, but PDF has the advantage of not having to be compiled and being ready for presentation/consumption (possessing platform invariance).


I loved it when pdf started gaining traction.

back then there was only 1 format in the world .doc

PDF became standard because it was a SUPERIOR solution for the use cases that most people wanted.


I have to agree. PDF is the biggest pain in the rear I've had to deal with in my development career. I am also not crazy about the fact that such an important standard for storing information is so heavily influenced by one company (though PDF is an open standard at least).


I can't really say it's an open standard either since proprietary binary data can be embedded that may not be usable by the 'standard' pdf viewer.

Yes, I have had to deal with it on a development slant as well. The open source tools are rare.

Paper records are the reason systems went to databases to store information.

PDF is NOT the swiss army knife of IT-data and never will be.


PDF, What is it FOR?

Here's computerphile video that explains it

https://www.youtube.com/watch?v=48tFB_sjHgY


I'm not disagreeing with you but that does not take away from its importance. There are lots of crappy important things :)


Can’t argue about that!


"Most important" and "best" are descriptors that rarely apply to the same product. The most important beer in America is probably Bud Light.


PDF is based on PostScript, which is a Page Description Language (PDL) also from Adobe. The first PostScript printer, the Apple LaserWriter, launched the desktop publishing revolution. The interesting thing about PostScript is that it is a full programming language, with loops and conditionals and so on. When Adobe designed PDF, they kept the same imaging model as PostScript, but stripped out the programming language. Thus they ended up with a dynamic page description language (PostScript) for media (printers) that cannot fully take advantage of a dynamic PDL, and a static PDL (PDF) for media (i.e., computer screens) that could have benefited from a full programming language!


Postscript is a pretty interesting stack based language on its own. Its size is intimidating though. If I remember correctly there was a toy web-server written in postscript just to show that it is possible.


> When Adobe designed PDF, they kept the same imaging model as PostScript, but stripped out the programming language.

And then they added JavaScript to it:

    http://www.adobe.com/devnet/acrobat/javascript.html
The small subset that displays text and images is fine, but PDF is a nightmare.


someone made a submission to ICFP contest which was a self drawing post script solution


I don't know. I personally think plain text is a more important file format. It's read and writeable by a million programs. Any first year programming student can easily build their own program to read and write plain text. It is fairly easily parseable. Etc.

I just opened a random PDF on my computer in a text editor and it starts off with "xÕ\ko‹∆˝Œ_1@ÉbD 9|ÌËáƒZçÌDJÇ¢)UZ[nıÚÆÏƒA˛Pˇeœπ"


Plain text can be a really problematic format for data preservation.

- While it's fairly easy to read and write plain text, it's also fairly easy to inadvertently introduce unintended artifacts in the process.

- The more frequently a file gets passed around and read and written to, the more likely mojibake[1] will get introduced. This concern rises exponentially when you move to non-US audiences and introduce local-specific encodings. File storage settings, client operating system settings, server configuration settings, database settings, programming languages that touch it along the way. All of them introduce assumptions along the way of a file's encoding, and many failure cases can be subtle and easily go unnoticed at a glance while causing some irreversible damage to downstream recipients.

- Even if you solve for the encoding, you still have structural issues with tabular data. Different parsers treat escaping and quoting policies differently. This can result in data shifts as things get mis-parsed, data corruptions if literal values get interpreted as escape characters or vis versa, etc.

For preserving data, generic plain text tends to get worse and worse over time because it's such a non-opinionated format and even if you document the specifics on econdings and parsing details it's easy for those to get lost over time as things exchange hands or for intermediaries to corrupt the plaintext because they relied on defaults instead of the documented parsing details.

For better or worse, PDF tends to solve the preservation issue while introducing potential barriers on the parsing/processing side.

[1] https://en.wikipedia.org/wiki/Mojibake


>mojibake

These days, those that are promoting the idea of plain text as a long term archive format are assuming UTF-8 by default.


Using your goalposts I'd argue a paper napkin is the most important file format. It's readable & writeable by 7 billion wetware programs, including the drunk ones, and even a first grader can make use of it.


I don’t know what is worse, extracting text from a napkin or a PDF. At least with the napkin you know it is going to be lossy going in.


No, it doesn’t. PDF files start with “%PDF” (https://en.wikipedia.org/wiki/Magic_number_(programming)#Mag...)


It also very inefficient especially if you want things like images embedded in it.

Its not like you can read text without the right program, its still binary, there just happens to be a mostly agreed upon standard and a lot of programs that can decode that standard and render to screen.

PDF viewers are built into most browsers now and allows rich page perfect print ready results.

If I had it my way we would settle on a binary data serialization format, I don't care MsgPack, Protobuf, heck maybe even sqlite and then everyone could have a viewer to snoop around in it. You would still have to understand whats being encoded but you could always "view source" so to speak.


It's plenty efficent. It's just a container for an image. It's really, really efficent if you use vector graphics.


Are you talking about text? No base64 is not an efficient way to store raster images, neither is vector markup.


PDF don't need to use base64. (nor base85). They can embed pure binary just fine. A PDF with a JPG is just a few more bytes than the raw JPG.

Same for text. Text can be pure text,but for space saving, it usually will be deflate-compressed.


Right so your talking about PDF and not text, I agree PDF are efficient as they are a binary format.


PDF files begin with %PDF, not sure you have a valid PDF there


True. I just skipped the header lines and grabbed the start of the lines after the headers.


I'm not sure how you got that string, since there's no charset except full-Unicode ones that would contain both ∆ and œ, and compressed binary text isn't going to appear to be UTF-16 or UTF-8, so I don't see any obvious charset you could be using.

Hmm. Maybe your charset is Windows-1253, and unassigned characters are being supplanted with the equivalent codepoints from Windows-1252?


Actually, PDF is primarily a text format, it's just usually compressed, and often contains embedded binary content like fonts and images.

You can see the text code by doing something like this:

  pdftk input.pdf output input_uncompressed.pdf uncompress
You can also edit it in that state, in an editor that preserves binary content, but there's a hard-coded offset table at the end, so if you change the length of something, that needs updating (very fiddly to attempt by hand, but automatable, and some pdf tools automatically fix broken offset tables).


There's really no such thing as "plain-text".

Maybe you mean ASCII, which is indeed simple and also not useful for several billion people. (I live in one of the many countries that cannot use ASCII -- the American Standard Code for Information Interchange -- because don't speak American here.)

Or maybe you mean Unicode which is an extremely long spec and absolutely cannot be handled by a first year programming student.


Arguably it actually is plain text, but I think you can build a case for at least a dozen file formats that they’re the most important file format in the world, so arguably, none of them are.


The solution to that is postscript. I still see quite a few papers being distributed in postscript form.


PDF files start with %PDF. You’re probably looking at DEFLATEd or binary data. What do you expect?


My gripe with PDF is that I don't understand why a standard format which is almost 30 years old, requires seemingly weekly updates of the Acrobat Reader, which in turn requires reboot of my work laptop. I upgrade the reader far more often than I actually use it.


Now days you don't need acrobat reader typically, Chrome, Edge and Firefox can all view/print PDF's. If your on a Mac/iOS the viewers are built as the OS uses display postscript internally (quartz) which matches up pretty closely with PDF.


Chrome/Firefox don't handle complex PDF's well. They render slowly and sometimes inaccurately

It's not uncommon to print things from Chrome/Firefox and have the margins/cropping be wrong. Then again you're using a web browser to print a pdf file, you get what you deserve


I very rarely run into this in Chrome, they licensed Foxit and made it open source https://opensource.google.com/projects/pdfium its native and fast.

Firefox I think still uses https://github.com/mozilla/pdf.js which was not bad last time I used it, but it all javascript so performance isn't up to par with native. Also printing is not great and I don't think they have implemented a SVG backend yet for better printing. On the plus side you can embed it in your web app if you want and have more control over the viewer.


Chrome is not too bad these days. They 'forked' the foxit codebase and unless you need some weird features, chrome should render the page accurately. The margins are specified in the page data in the document.


Sounds like your gripe is with Adobe, not PDF.


Technically, you are correct, but in a corporate environment of managed software, I can't separate the format from the company. When an update is scheduled I have no choice but install and reboot.


Couldn't you open it in a browser? Or does it want to install regardless of if you open Adobe?


Probably because Adobe's PDF reader is full of vulnerabilities and constantly used to spread viruses and compromise machines. I wouldn't be surprised if there were more malware infested PDF files on the internet than legit ones.


That is acrobat's fault not pdf.

there is churn in all major applications these days regardless of if it is needed. Reader is just trying to regain mind-share back from people viewing pdfs in browsers.



sumatrapdf works fine


I run a PDF generation service [1], so this is nice to read. When I first launched the service, I was worried that PDFs and paper forms might become obsolete in the near future, when everyone starts to go paperless and digital. Now that I'm more familiar with the market, this is no longer a concern. (I have banks who are using the software to modernize their operations.)

I also realized that there might be some pressure to turn into the next TurboTax, where the company eventually lobbies against improvements just so we can stay in business. I made a resolution that I'll never do anything like that. But I guess the founders of TurboTax never intended to do that either.

[1] https://formapi.io


I love little profitable businesses like this. Seems like you’re living the dream


I love PDFs. The first time I used one was Acrobat 2.0. I bought a shitty SAMS computer book that came with a bonus shitty SAMS computer book on the CD. As a PDF of course.

It was cumbersome on a Pentium 75MHz with 640x480x8 graphics. I'd be blown away 20 years later viewing those same files on a Macbook Pro with Retina display.

However, it was easily the best way to distribute printed documents EXACTLY the way they were meant to be seen.

They weren't meant to be edited, modified, data extracted from...And adobe went from a quick, minimal viwer to a bloated security nightmare by adding 'features.

Luckily, 3rd party and open-source projects came to the rescue.


PDF is the perfect tool for maintaining document formatting but the worst tool for maintaining document data. The format is not concerned the the actual data of document. Programatically extracting data from the simplest PDF is an exercise in patience. I find it kind of odd that this was chosen as the standard all things considered


I sometimes read papers but I find them an absolute pain to read them on my phone because the two aren't really designed for each other. So I end up with this frustrating experience of zooming in/out to read this bits I'm interested in - which is impossible to do one handed!

I know there's that arxiv vanity thing, which is cool, but most of the time I get the "sorry we can't render this" error message.


PDF is fine if you are using a laptop/desktop/tablet. But it doesn’t fit well into phone screens. Usually I have to turn my phone horizontally and then zoom in little bit. And when I reach the end of the page I have trobles turning it. And sometimes page turning messes up my perfect zoom fit etc. Responsive PDF is what I need...


That’s not a PDF issue though: it’s just an issue of most content being the same dimensions as real world paper sizes. No idea how responsive PDF would work but I agree that would be cool.


People in these comments keep mentioning a responsive PDF, but PDF is supposed to be a presentation format. Wouldn’t HTML and CSS serve the purpose fine?


You can't represent vector graphics, complex math, or anything non-standard in HTML+CSS. Even if you could, you couldn't rely on the user seeing it correctly.


Aren’t SVG and MathML standard in HTML5?


Interesting; I'm not sure if it counts as "in HTML5" but I see they are referenced in the spec.

It's still not something you can rely upon.



It's also actively harming medical and scientific progress as I allude to here in this talk https://www.youtube.com/watch?v=EM61rn9Gxl4&list=PLjzcwcP9P2...


Can you summarise your point about that here?


This would've been a lot more convincing before mobile happened a decade ago. Reading PDFs on my phone is a pain, and it hasn't improved at all in the last ten years. Looking at the web on mobile used to be painful too, but it's ok now.


What phone? I'm still using an iPhone 6S and they are seemless. Some textbooks that are like 300+ mb start having hiccups though. Probably due to my oldish phone.


They're fast, but I have to scroll and zoom around a lot.


Possibly also the most hated file format. Just the fact that no two PDF editors behave the same way goes to show how bad of a format it is.


It is not open data.

You still can have proprietary blocks of info inside the file.

The lack of open source tools to manipulate the format is a major hindrance IMHO.

It is also a very space wasting as well when people only do a bitmap dump into the file for scans. Forms are also an area that is not open source either.

It has so many hacks and kludges, it would be better if it was trashed and we start with postscript again.


What happened to HTML? Nobody could read nor discuss about PDF here without it.


The minor html variant standardized as epub is alive and well, and ubiquitous for ebooks — whereever exact page layout (like scientific textbooks) and page # references don't matter.

With very narrow exceptions, that are even narrower than PDF publishers think, epub is and should be preferred even for technical or academic writings. (Kindle's kf8 format is just a repackaged variant of epub.)


Let me know when browser support the full CSS print spec...no Prince doesn't count.


Tell me when PDF can reasonably reflow in all environments/devices.


PDF are not meant for reflow, layout has already been done by the program that made them, pagination is complete.

Again I would be happy with HTML if the major browsers actually implemented full CSS print spec that exists to cover what PDF does with print layout.


The whole point of a pdf is to display _exactly_ the same across all devices. For something like a scientific research paper, you want all figures, etc., to remain in the same place to avoid formatting oddities.


That’s an important feature, just like reflow. But does the former make PDF World’s most important file format?


>> The whole point of a pdf is to display _exactly_ the same across all devices.

> That’s an important feature, just like reflow. But does the former make PDF World’s most important file format?

It's not just that PDFs are displayed the same across devices, it's that it's displayed the same over time. I can open a PDF generated in 1995 and it will look identical today as it did then. The same can't be said about HTML or Word documents generated in 1995.


The thing I find craziest about PDF/A is it isn't really a format in itself, just a vague promise not to use certain features in the ensuing file. Whether any reader holds the file to that promise is something I'm quite doubtful of. Instead I suspect most readers will do their best to display anything they're handed, happily passing it through any of the hundred-or-so, possibly legacy-code-powered sub-format decoders the file author wishes - leading to a massive attack surface.

From a developer's point of view, when trying to enforce that submitted files are strictly in PDF/A format, from what I can tell there isn't much more you can do than dissect the file looking for umpteen disallowed features.

Is there an ISO-compliance validator to anyone's knowledge?



Thankyou!


I remember when PDF came out, I didn't get it. I thought the problem was already solved by compressed postscript files! I was already used to downloading and sometimes printing paper documentation in this format.

It was natural to view Postscript files on NeXT and UNIX machines, and Ghostscript was already thing. What could be better than just using the "native" language of the printer? I didn't realize that this was not a common view, or even possible for most personal computers at the time.

I was also misinformed for quite some time about the internal format of PDF, assuming it was just PS wrapped up in a container. In a sense, this is true, but there's a lot more (embedded fonts, transparency, forms are just a few that come to mind).


> I was also misinformed for quite some time about the internal format of PDF, assuming it was just PS wrapped up in a container. In a sense, this is true, but there's a lot more (embedded fonts, transparency, forms are just a few that come to mind).

There's also a lot less, as PDF is not a full programming language like PS.


PDFs are an accessibility nightmare and most production pipelines are terrible at preserving the semantic structure of the document or, in many cases, even preserving the text. A PDF is in many cases as series of page images which aren't usable for anything other than human viewing or printing. "Export as PDF" generally produces much better results than print to PDF though not from every application. Many days I wish there was an alternative solution based on SVG. While not perfect it certainly would avoid many of the problems of PDF while having all of the important capabilities.


I really hate PDF for many reasons. It seems it's only a partially open format, a lot of features are implementation dependent (how can a file be 'locked' to prevent printing or editing?). There are very few free and FOSS clients that handle forms, highlighting, etc. Some clients do highlighting and annotations but don't save it to the PDF itself.

The failure of epub and other HTLM based formats in this use case, IMO, is that their focus on reflowing to support any display and device makes them inconsistent and therefore impractical for replacing PDF based content.


I've been working on an open source PDF library for C# [1] and its given me an immense appreciation for the PDF format.

Sure it's horribly long, complex, comes with vulnerabilities and different consumers have different behaviour. But given the constraints of machines at the time it was created and the wide range of requirements and usages it's pretty damn good and has stood the test of time.

[1]: https://github.com/UglyToad/PdfPig


Could this be used to remove potentially-malicious content from a PDF, e.g. anything executable?


It’s a different software entirely, but QubesOS has an interesting way of removing malicious content from a pdf. It basically opens the document in a throwaway VM, then the host takes a high resolution screenshot and makes a new pdf from scratch.

The digital equivalent of physically printing and re-scanning the document.


Unfortunately not at the moment sorry, the editing and creation capabilities are extremely limited at present.

The library is mainly focused on text extraction but editing is on the road map.


Even among non-technical people this isn't true. Those people use excel and word far more than pdf. For devs, we're all about various text formats.


PDF is great! For human reading. But it's garbage for everything else, and, as I found out from a a bunch of frustrating job application efforts, absolute shit for resumés. Literally not parseable, even with OCR apparently. I don't even know WHY any job application even accepts a PDF for a resumé. It shouldn't be allowed to if there's any sort of post-processing done to it to extract information. A bunch of applications didn't even let me see what it extracted, which, depending on the PDF to text application they're using, may spit out absolutely nothing from an entire page of text. That's right, a whole page converted to a single empty line. Marvelous. Truly a technological marvel that's helping the "it's hard to get a job" feeling go away.


PDF is the one and only format any resume should ever be submitted in.

It's the only way to reasonably guarantee someone will be able to open and read a resume in its intended form.


Only from my view, once it's automatically parsed, nobody is going to look at it, because either it'll automatically throw it out from the parsed data (in which case a blank resume is an obvious throw away), or it'll do some keyword parsing and determine you haven't copy/pasted enough. The second one is a problem with all of those systems though. After that point it'll pass it on to the hiring manager, at which point they'll look at it regardless of format. PDF then is simply not the best choice if it gives the highest chance of failure at the first step. A Word document is a much better choice.


> A Word document is a much better choice.

It really truly isn't. Anyone on any device can reliably open a PDF. That is not true for Word files.

Funny enough, this thread made me want to look at my old resumes. I have a .doc resume from 2009. You know what happens when I double-click it? Nothing. I don't have anything that can view it! Windows 10 doesn't come with a preview tool for doc files. Chrome/Firefox can't preview it either.


I feel you're missing my point here. A PDF is bad for the specific case of inputting a resumé into an applicant tracking system, due to the myriad of things that could go wrong without notifying the applicant of any error, discarding their application without a second glance by a human. A Word document makes the likelihood of system discard due to error much smaller. A docx file is just a zipped file, which contains a document.xml that can be easily read by any regular text editor. It's got all the style information in it, but it's at least more easily computer parseable than a PDF. A doc file is also just a zipped file, with a WordDocument file inside. It can also be read by any regular text editor, but it's got some binary garbage around it. Still more easily parseable than a PDF.


> I don't even know WHY any job application even accepts a PDF for a resumé.

I tried to send a resume in LaTeX. They said they want a Word document..


Try the API made by Textkernel. It works quite well.


.xls runs half of the world economy.


Not moreso than .txt


And at the same time PDF Forms[1], animation[2] and 3D extensions are poorly supported in FOSS implementations - poppler, mupdf, etc.

[1] https://gitlab.freedesktop.org/poppler/poppler/issues/463

[2] https://gitlab.freedesktop.org/poppler/poppler/issues/683


Author must not realize how many businesses, government organizations, etc. have operations that are completely dependent on .xls files that someone first created in the late 90s.



For more info on the PDF format: [VIDEO] Programming data for display, the PDF Story by Chas Emerick - https://www.youtube.com/watch?v=MAki8C6qFHY&list=PLGRqfvsPiR...


Are there any people going around and harvesting data of all the PDFs in the world, using Machine Learning to clean collate, and localize the data?


Surely database file formats are at least as important to hold whatever data the World needs to go round.


Microsoft tried to compete with XPS, xml paper specification around mid 2000's but did not succeed.


Adobe acrobat reader has been atrocious. I'm so glad I switched to sumatra this year.


I couldn't disagree more. I almost never use PDFs. I read a lot of books on my mobile phone and PDFs are basically unusable there due to their fixed layout.

To say in 2019 that a file format without mobile support is the most important is moronic.


Worse is better strikes again? I'm not talking about SGML


If they said "Most Important DOCUMENT File Format", maybe, through even then you have stuff like JPEG.

But arguably executable file formats (.exe, .so, ...) are more important overall.


Having some executable file is a prerequisite for opening any other!


To me, a pdf is only a digital version of a paper version. Just to look at, present... nothing more. Data duplication, extraction always had to be done manually. I could never extract text correctly. So I don’t anymore. I just accepted it. As I did with a lot of other things.


I guess if you wanna claim that, but it's a shit format that causes way more pain than convenience for most people. Just today I had to upload sensitive docs to a server to edit it because I can't locally and refuse to pay to do so. Fuck pdf.


It’s not designed to be edited.


Oh so all those fillable pdfs are just hacks then right?


Archiving without extracting is kind of one-way to me.

I prefer a long JSON file, which i can read from, or write down it to file format i want, to a hard-to-extract format such as PDF.


> a hard-to-extract format such as PDF

What's hard about it? It's an open standard with libraries in every language known to man.

If you use it as a dumb wrapper for scanned images, that's going to suck. But as a way to store and faithfully reproduce nicely typeset documents with images - ie, to make an archival copy - I don't think it can possibly be beat.


I think the comment was about bulk extraction of data. For instance, it is pretty difficult to do bibliographic studies if all the articles you have are provided only as PDFs (even if it is text pdf). For starters, I do not think PDFs have a notion resembling a "paragraph of text", because every symbol is placed separately with its own unique coordinates.

Extracting tables of numbers from PDFs is also a pain.


Exactly. I have written software for legal document search. Most of what they get is in PDF and it’s a major PITA to get data out of them. Forget about tables. Just try to extract text without some garbled characters and you will lose your mind.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: