Don't name strings (in Python)

Daniel Pope

A recurring conversation with my collegues has been around whether to extract string literals to the top of a Python module as global constants. I often see people writing Python modules that include hundreds of lines of constant string definitions, like this:

...
DASHBOARD_COLUMN_NAME = 'name'
DASHBOARD_COLUMN_PRICE = 'price'

SUMMARY_BREAK_LABEL = 'breakLabel'
BREAK_LABEL_BREAK = 'Break'
BREAK_LABEL_MATCH = 'Match'
...

Naturally we want to avoid repetition and improve maintainability, but I've found myself at loggerheads with people who claim that defining literals as constants is invariably best practice. I see it as potentially damaging to the readability of Python code. The problem is that the "maintainability" afforded by these literals being defined in one place comes at a cost of indirection that actually does make the code harder to read, understand and debug.

The older chunks of our application are written in code that looks like this:

breakPrice = abs(sourceRow[DASHBOARD_COLUMN_PRICE] - targetRow[DASHBOARD_COLUMN_PRICE]) > BREAK_THRESHOLD
breakDict[SUMMARY_BREAK_LABEL] = BREAK_LABEL_BREAK if breakPrice else BREAK_LABEL_MATCH

(This is what our code typically looks like; please don't pay too much attention to other questionable style choices like naming conventions).

When I joined the team I advocated for inlining the constants, and in some cases I've written scripts to automatically do so:

breakPrice = abs(sourceRow['price'] - targetRow['price']) > BREAK_THRESHOLD
valueMap['breakLabel'] = 'Break' if breakPrice else 'Match'

Most people seem to agree this reads better. The code is shorter; you get richer syntax highlighting that helps you visually parse the code; and without the indirection of having to look up names somewhere else, the semantics stand out immediately.

It's also easier to debug: runtime values will match what the code says.

Finally, it will also run faster, because no name lookups are required. In Python, if you use a (non-local) variable name, it's a string, just like a literal - but it's a string that has to be looked up in some namespace dictionary to find the actual value.

There are a couple of argument that are often raised:

What if we need to change the values?

Sure, there's a bit more work, but code is read more often than it is written. Anyway, do you really want to use names that don't reflect their contents (eg. BREAK_LABEL_BREAK = 'Mismatch')?

Pyflakes will warn if a constant is misspelled / We prefer to fail fast at runtime with a ``NameError`` if a constant is mispelled

There's some weight to this, but we test our code properly, and typos are trivial to fix compared to semantic problems due to misreading of code.

In practice, the names are shorter so there are fewer characters to mistype. I can honestly say it just doesn't catch me out all that often.

On Naming

To take a step back a bit, we should realise that in human semantic terms, when we define a global constant, we are not just moving a definition around in our codebase, we are assigning a name to a value.

For non-string values, that's an incredibly important thing to do. If I see an integer literal in a function (say 30), I can't immediately infer its semantics without analysing the ways in which it is used. If I see POLL_INTERVAL, I can immediately start to guess what the code I'm reading is doing. Textual strings, on the other hand, already include semantics: they are already words.

If we use a literal, the value is not named and therefore remains anonymous. You might want to do this because the value is not important enough to deserve a name, or it is self-descriptive enough that an additional name would add rather than remove complexity.

Once we have this view of naming, we can understand a bit better when it might be valuable to name a value. For example, we might want to name a value to indicate additional semantics. I might even mix and match the literal version and the symbolic name on the same line, because the semantics of the two are different:

JOIN_COLUMN = 'breakLabel'
...
    report.select(['breakLabel', 'name', 'price']).join(JOIN_COLUMN, prev_report)

Rules of thumb

Define a global string constant...

To convey additional semantics.
If it's longer than a couple of words and is used more than once.
If it will be typed often and it's easy to typo (eg. it contains symbols).
For values that are likely to change (say, a reasonable chance of such a requirement arising within the next 6 months).

...and try to minimise the distance between declaring a constant and using it.

Use inline string literals...

If they are unlikely to change (say, little chance of a change within 12 months).
If they are only used once.
If they are used in multiple places, but if they were to change they would probably change in different ways (for example, you have a couple of different dialog boxes that happen to share the same title).
If they are already names for something (a dictionary key, a database column, etc).
If they are already self-descriptive in another way. For example, we can see the comma character ',' is a comma; it doesn't need to be named COMMA.

...and try to minimise repetition through good code structure.

On Naming

Rules of thumb

Comments