So, does this mean there is no longer a good reason for fixed-width text fields ...

mytherin · on May 3, 2019

Indeed, that is the reason character fields have traditionally had a length in RDBMS. However, many of these optimizations get broken by unicode support.

Consider a CHAR(1) field. Without unicode support this field is always 1 byte, and therefore easy to store. With unicode support, however, the value that fits in this field is actually between 1-4 bytes. However, in many cases users will only store 1 byte in this field. Storing this field as a fixed length 4-byte field will in many cases result in a 2-4X increase in size and hence waste a lot of space. Thus even with CHAR(1) it is usually better to store the strings in a variable-length manner to avoid wasting space; and CHAR(1) is the absolute best case for this type of optimisation. This gets way worse with larger CHAR lengths.

It's also worth noting that even if we could store a fixed one byte that that is likely still not the best way of storing a CHAR(1) field. A CHAR(1) field is typically used to store types or categories. Typically, a column like this would have few unique values (~2-8 values), which means we can reduce the size of these values to 1-4 bits using dictionary compression instead of a single byte, resulting in a size savings of factor 2-8X over storing a fixed one byte length.

bunderbunder · on May 3, 2019

FWIW, I am guessing it's a rare case that you can be confident that all strings you'll want to store are of a specific length, but cannot predict the set of characters they may use.

The use cases for CHAR instead of VARCHAR that I'm thinking of would be things like serial numbers. CHAR(1) seems even more specific; the only time I've ever seen that used is for some version of enumerations with mnemonic keys.

derefr · on May 3, 2019

> things like serial numbers

In Postgres, if I had a truly massive number of alphabetic-textual serial numbers (such that I was worried about their impact on storage/memory size), I’d define a new TYPE, similar to the UUID type, where such serial numbers would be parsed into regular uint64s (BIGINTs) on input, and generated back to their textual serial-number representation on output.

shawnz · on May 3, 2019

I don't think the problem is so much about predicting the encoding that will be used. Let's say you KNOW that non-ascii characters will be used in the field, but only very occasionally. That means the field has to be 4x as wide as it would have been otherwise even if non-ascii characters will almost never be used.

Retra · on May 3, 2019

If you knew the characters were ASCII, you wouldn't want unicode support in the first place. It's pure overhead.

kbenson · on May 3, 2019

But you also wouldn't want char(n), because as the article notes:

char(n) doesn't reject values that are too short, it just silently pads them with spaces. So there's no actual benefit over using text with a constraint that checks for the exact length.

So even if you're using pure ASCII, you're supposedly better off just setting the character encoding appropriately and using a text or non-limited varchar field.

That said, I thought the main benefit of a char field (and its padding) was that you could use it in select instances where is was useful to have specific length records. E.g. if all the other fields of the table are fixed length, every record can be fixed length if you use a char(n). I'm not sure if that's a benefit or not to modern databases anymore though.

bunderbunder · on May 3, 2019

Well, it depends.

Some RDBMS's don't store the string's length with each value for fixed-width character fields. So a fixed-length field may still be preferable if the space lost to padding amounts to less than the space lost to overhead.

That distinction doesn't apply to PostgreSQL, because it stores a length along with the value even for fixed-with columns.

mhd · on May 3, 2019

The problem is that even with serial numbers things aren't as fixed. You might have a different type of serial number, zip code, etc.. A check constraint can be expanded to do the appropriate length and formatting checks for most types. A strict length restriction just does one simple thing.

Same with name and address fields. I've seen databases where they set it at an seemingly arbitrary number, because that's the maximum length their delivery service could handle (UPS etc.), but then they added another one with different constraints…

Arnt · on May 3, 2019

The only performance cost is a worst-case cost. If your idea of performance is something like how many rows you can read/write/process per second, then the average length matters much more than the maximum observed length or the maximum permitted length.

hobs · on May 3, 2019

As far as I know - it depends on how the optimizer/planner works as well (at least in SQL Server.)

If you have a known fixed width you can be certain about your memory grants required for a specific query (width * rows estimate) - if you don't know that ahead of time you might find you need to allocate more memory than is actually required.

This can be really bad for concurrency.

derefr · on May 3, 2019

Postgres uses TOAST tables, an abstraction which allows PG rows to have unbounded lengths. Essentially, this works like a filesystem that supports extended attributes in both “small=intrusive” and “large=external extent” modes. When you have unbounded-length fields in your row-type definition, you end up with

• one “regular” PG table which has a row record type defined with a fixed length (like a filesystem inode), where all the static-length fields are held in the record itself, and any unbounded-length fields have a certain amount of space reserved for them in case they fit into that small space (if they’re even smaller, this space is just wasted);

• and one TOAST table—essentially a table of extents, that the unbounded-length fields of the previous record can, instead of holding data directly, hold pointers into.

TOAST for such columns only gets used when the column’s reserved size within its record gets overflowed; which I believe, for current PG versions, is 127 bytes. So, any row-type where all the fields are bounded to a length of at most 127 (either by literally using VARCHAR(127), or CHAR(127), or by using a TEXT column with a CHECK(length(col) < 128) constraint)—will never use TOAST, thus guaranteeing your stride width and memory bounds.

And, as well, any query that doesn’t try to “unwrap” the value of those unbounded-length columns, won’t have to join against the table’s TOAST table. So you can optimize for the memory bounds of specific queries on a table (by bounding the length of certain fields, keeping them out of TOAST) while leaving other fields as unbounded-length, if no hot-path query uses them.

(There’s also an extra optimization where values that are just barely too large to fit into the 127 bytes of reserved in-record space, are transparently compressed, to see if that’ll make them fit. Thus, although you need the length-check to guarantee that you won’t be using TOAST, it may turn out that your table “gets away with” having some values a bit larger than the bound without ever using TOAST.)

hobs · on May 3, 2019

Excellent answer - thank you. I did some googling, and if you have any specific resources I would love to read more (I was just reading https://www.postgresql.org/docs/11/planner-optimizer.html)

A question - if the planner knows its joining to the TOAST table, and it has some predicate to evaulate, how does it know all the widths of all the rows so it doesnt over-grant(or are memory grants not a thing?)

Generally with statistics/histograms you get a sample and a max width (or something like that) - so I am just confused how it can be precise.

Thanks!

hobs · on May 10, 2019

After many days of googling, it looks like postgres doesnt have this problem because it also cant use dynamic memory grants to get better performance on a per-query basis like SQL Server does.

So while you never suffer from "under" resourced queries, you generally dont get to see what they would look like if they had more memory.

In the SQL Server world that memory grant is part of the query plan, so that's why its a question that occurred to me.

Arnt · on May 3, 2019

Happily, the PG planner isn't thrown off by such things, at least in my experience. It has sample rows and uses them. (I really like the postgres planner, it's the best I've used.)

I've seen it be FAR wrong about memory, though, despite being right about column widths: When you join a few tables in a big select, the planner will look at some sample rows and guess the number of rows in each step of the possible plan, and if the sample is too small, that estimate can be wrong by a LARGE factor.

colanderman · on May 3, 2019

Use a CHECK statement to limit the length. Changing varchar length doesn't play well with dependent views.