String concatenation in Python

I had always been under the impression that since strings are immutable in Python, it would be inefficient to build a string incrementally with + or +=; I assumed that each concatenation would require the allocation of a new string that copies both sides of the addition. As of November 2019, this is even mentioned as a justification for avoiding this style in Google's Python style guide:

Avoid using the + and += operators to accumulate a string within a loop. Since strings are immutable, this creates unnecessary temporary objects and results in quadratic rather than linear running time. Instead, add each substring to a list and ''.join the list after the loop terminates (or, write each substring to a io.BytesIO buffer).

You can imagine my surprise when as part of a refactor of parts of dateutil's parser, one of my micro-benchmarks found that for my task, incrementally building a string was faster than joining a list of strings on all supported versions of Python!

My benchmarking results are reproducible with this minimal working example: [1]

def string_join(n_components):
    """Allocate a list of strings and concatenate them with join."""
    str_components = []
    for i in range(n_components):
        str_components.append("%d" % i)

    return "".join(str_components)

def string_buffer(n_components):
    """Build a string by incrementally writing to a StringIO."""
    buffer = io.StringIO()    # Replaced with BytesIO in Python 2

    for i in range(n_components):
        buffer.write("%d" % i)

    buffer.seek(0)
    return buffer.read()

def string_concat(n_components):
    """Incrementally build a string with +="""
    str_out = ""
    for i in range(n_components):
        str_out += "%d" % i
    return str_out

Benchmarking this on my local machine with Python 2.7.15, Python 3.8.0, PyPy 2.7-7.1.1 and PyPy 3.6-7.1.1 [2], I get:

    2.7  
  join (μs) buffer (μs) concat (μs)
10 4.20 (± 0.04) 4.84 (± 0.03) 3.70 (± 0.02)
100 38.53 (± 0.08) 42.2 (± 0.3) 35.2 (± 0.1)
1000 383 (± 3) 414 (± 2) 368 (± 4)
10000 3980 (± 30) 4230 (± 30) 3720 (± 30)
    3.8  
  join (μs) buffer (μs) concat (μs)
10 2.05 (± 0.03) 2.48 (± 0.03) 1.76 (± 0.02)
100 16.8 (± 0.1) 18.5 (± 0.2) 16 (± 2)
1000 172.7 (± 0.8) 184.2 (± 0.9) 165.9 (± 0.9)
10000 1900 (± 20) 2110 (± 50) 1830 (± 20)
    pypy2.7  
  join (μs) buffer (μs) concat (μs)
10 0.46 (± 0.05) 0.9 (± 0.6) 0.4 (± 0.2)
100 3.5 (± 0.3) 3.2 (± 0.4) 3.5 (± 0.5)
1000 35.6 (± 0.3) 30.5 (± 0.6) 70 (± 1)
10000 430 (± 90) 303 (± 4) 6700 (± 100)
    pypy3.6  
  join (μs) buffer (μs) concat (μs)
10 0.4 (± 0.1) 0.7 (± 0.2) 0.35 (± 0.09)
100 4.1 (± 0.6) 8.5 (± 0.7) 3.7 (± 0.6)
1000 46 (± 4) 120 (± 20) 78 (± 3)
10000 500 (± 200) 2700 (± 200) 7100 (± 500)

As you can see, in this particular benchmark, there isn't usually a huge difference in the performance of these three methods, and where there is, string concatenation is usually the fastest. It is certainly not the case that building a string incrementally has quadratic runtime complexity, so what's going on?

Optimizing string concatenation

The naïve approach to concatenating immutable strings, which I assumed was the actual approach taken by the Python interpreter, would be to allocate a string of length len(s1) + len(s2) and then fill it with the characters from s1 and s2, leaving s1 and s2 alone. Using this method, repeatedly concatenating multiple strings to a single base would indeed have quadratic runtime, since every concatenation would start with copying over all the stuff you've already copied over.

However, the CPython interpreter is able to take advantage of reference counting to work a bit smarter than this. When Python is performing a concatenation operation between two strings, it calls unicode_concatenate, which checks if the assignment operation you're about to perform would allow you to free the operand on the left-hand side and if so, it simply mutates the existing string rather than allocating a whole new string and copying the old one over. This is actually a fairly robust optimization, because even if something holds a reference to the original string that you're appending to, only the first append operation will cause a new string to be allocated, and all other intermediate forms will be avoided. It is very unlikely that you would accidentally run into a corner case that causes many string allocations. [3]

PyPy is not reference counted, so it cannot implement this particular optimization; [4] [5] however, for strings of up to a few thousand characters, the differences in performance are minor; for very large strings (millions of characters or 10s of thousands of concatenations), joining a large list indeed seems much faster. I will note that in my original thread I had assumed that PyPy implemented a similar optimization because even with the additional string allocations, simple concatenation was still faster for my benchmarks.

Conclusions

As usual in discussions of performance, I think the lesson here is that one should not prematurely optimize. Write the code that is most readable and if you are worried about performance measure it. Operating under the misapprehension that concatenation operations would be slow, I refactored my application into something slower.

I think that in many cases, the concatenation-based code is more readable, since it's obvious at each step that your intention is to build up a string, but that is likely a matter of taste; despite my preference, I would definitely still consider it idiomatic Python to build a string by joining a list of strings.

There are situations where you might want to avoid string concatenation, (e.g. using PyPy to incrementally build up huge strings), but I would contend that they are rare, and if performance is causing you a problem, it should be easy enough to pinpoint the problem with cProfile or another profiling application.

Footnotes

[1]I have moved the full code for the benchmark into a separate page to avoid cluttering this post with how I did the timings. Note that the functions and the timing code are designed to be backwards compatible with Python 2.7, to make it easier to use the same script on all versions.
[2]I did not see any significant differences between 3.6, 3.7 and 3.8, so I have reported a single "Python 3" number as 3.8.
[3]The most plausible scenario I've been able to come up with that would prevent this optimization from occuring would be if someone is building a string with recursive function calls, since the earlier frames of the stack trace would hold references to the intermediate forms. Needless to say, you should probably not write your Python string concatenations with recursive function calls.
[4]Hat tip to Alex Gaynor for pointing this out.
[5]Though I'm sure that if someone can figure out a way to teach their compiler to optimize away the additional string allocations, they would be happy to accept such a contribution.