What is the max length of a Python string?

PythonString

Python Problem Overview


If it is environment-independent, what is the theoretical maximum number of characters in a Python string?

Python Solutions


Solution 1 - Python

With a 64-bit Python installation, and (say) 64 GB of memory, a Python 2 string of around 63 GB should be quite feasible (if not maximally fast). If you can upgrade your memory much beyond that (which will cost you an arm and a leg, of course), your maximum feasible strings should get proportionally longer. (I don't recommend relying on virtual memory to extend that by much, or your runtimes will get simply ridiculous;-).

With a typical 32-bit Python installation, of course, the total memory you can use in your application is limited to something like 2 or 3 GB (depending on OS and configuration), so the longest strings you can use will be much smaller than in 64-bit installations with ridiculously high amounts of RAM.

Solution 2 - Python

I ran this code on an EC2 instance.

def create1k():
    s = ""
    for i in range(1024):
        s += '*'
    return s

def create1m():
    s = ""
    x = create1k()
    for i in range(1024):
        s += x
    return s

def create1g():
    s = ""
    x = create1m()
    for i in range(1024):
        s += x
    return s

print("begin")
s = ""
x = create1g()
for i in range(1024):
    s += x
    print(str(i) + "g ok")
    print(str(len(s)) + ' bytes')

and this is the output

[ec2-user@ip-10-0-0-168 ~]$ time python hog.py 
begin
0g ok
1073741824 bytes
1g ok
2147483648 bytes
2g ok
3221225472 bytes
3g ok
4294967296 bytes
4g ok
5368709120 bytes
5g ok
6442450944 bytes
6g ok
7516192768 bytes
7g ok
8589934592 bytes
8g ok
9663676416 bytes
9g ok
10737418240 bytes
10g ok
11811160064 bytes
11g ok
12884901888 bytes
12g ok
13958643712 bytes
13g ok
15032385536 bytes
14g ok
16106127360 bytes
15g ok
17179869184 bytes
16g ok
18253611008 bytes
17g ok
19327352832 bytes
18g ok
20401094656 bytes
19g ok
21474836480 bytes
20g ok
22548578304 bytes
21g ok
23622320128 bytes
22g ok
24696061952 bytes
23g ok
25769803776 bytes
24g ok
26843545600 bytes
25g ok
27917287424 bytes
26g ok
28991029248 bytes
27g ok
30064771072 bytes
28g ok
31138512896 bytes
29g ok
32212254720 bytes
30g ok
33285996544 bytes
31g ok
34359738368 bytes
32g ok
35433480192 bytes
33g ok
36507222016 bytes
34g ok
37580963840 bytes
35g ok
38654705664 bytes
36g ok
39728447488 bytes
37g ok
40802189312 bytes
38g ok
41875931136 bytes
39g ok
42949672960 bytes
40g ok
44023414784 bytes
41g ok
45097156608 bytes
42g ok
46170898432 bytes
43g ok
47244640256 bytes
44g ok
48318382080 bytes
45g ok
49392123904 bytes
46g ok
50465865728 bytes
47g ok
51539607552 bytes
48g ok
52613349376 bytes
49g ok
53687091200 bytes
50g ok
54760833024 bytes
51g ok
55834574848 bytes
52g ok
56908316672 bytes
53g ok
57982058496 bytes
54g ok
59055800320 bytes
55g ok
60129542144 bytes
56g ok
61203283968 bytes
57g ok
62277025792 bytes
58g ok
63350767616 bytes
59g ok
64424509440 bytes
60g ok
65498251264 bytes
61g ok
66571993088 bytes
62g ok
67645734912 bytes
63g ok
68719476736 bytes
64g ok
69793218560 bytes
65g ok
70866960384 bytes
66g ok
71940702208 bytes
67g ok
73014444032 bytes
68g ok
74088185856 bytes
69g ok
75161927680 bytes
70g ok
76235669504 bytes
71g ok
77309411328 bytes
72g ok
78383153152 bytes
73g ok
79456894976 bytes
74g ok
80530636800 bytes
75g ok
81604378624 bytes
76g ok
82678120448 bytes
77g ok
83751862272 bytes
78g ok
84825604096 bytes
79g ok
85899345920 bytes
80g ok
86973087744 bytes
81g ok
88046829568 bytes
82g ok
89120571392 bytes
83g ok
90194313216 bytes
84g ok
91268055040 bytes
85g ok
92341796864 bytes
86g ok
93415538688 bytes
87g ok
94489280512 bytes
88g ok
95563022336 bytes
89g ok
96636764160 bytes
90g ok
97710505984 bytes
91g ok
98784247808 bytes
92g ok
99857989632 bytes
93g ok
100931731456 bytes
94g ok
102005473280 bytes
95g ok
103079215104 bytes
96g ok
104152956928 bytes
97g ok
105226698752 bytes
98g ok
106300440576 bytes
99g ok
107374182400 bytes
100g ok
108447924224 bytes
101g ok
109521666048 bytes
102g ok
110595407872 bytes
103g ok
111669149696 bytes
104g ok
112742891520 bytes
105g ok
113816633344 bytes
106g ok
114890375168 bytes
107g ok
115964116992 bytes
108g ok
117037858816 bytes
109g ok
118111600640 bytes
110g ok
119185342464 bytes
111g ok
120259084288 bytes
112g ok
121332826112 bytes
113g ok
122406567936 bytes
114g ok
123480309760 bytes
115g ok
124554051584 bytes
116g ok
125627793408 bytes
Traceback (most recent call last):
  File "hog.py", line 25, in <module>
    s += x
MemoryError

real	1m10.509s
user	0m16.184s
sys	0m54.320s

memory error after 116GB.

[ec2-user@ip-10-0-0-168 ~]$ python --version
Python 2.7.12

[ec2-user@ip-10-0-0-168 ~]$ free -m
             total       used       free     shared    buffers     cached
Mem:        122953        430     122522          0         11        113
-/+ buffers/cache:        304     122648
Swap:            0          0          0

Tested on EC2 r3.4xlarge instance running 64-bit Amazon Linux AMI 2016.09

Short answer would be: if you have over 100GB of RAM, one Python string can use up that much memory.

Solution 3 - Python

9 quintillion characters on a 64 bit system on CPython 3.10.

That's only if your string is made up of only ASCII characters. The max length can be smaller depending on what characters the string contains due to the way CPython implements strings:

  • 9,223,372,036,854,775,758 characters if your string only has ASCII characters (U+00 to U+7F) or
  • 9,223,372,036,854,775,734 characters if your string only has ASCII characters and characters from the Latin-1 Supplement Unicode block (U+80 to U+FF) or
  • 4,611,686,018,427,387,866 characters if your string only contains characters in the Basic Multilingual Plane (for example if it contains Cyrillic letters but no emojis, i.e. U+0100 to U+FFFF) or
  • 2,305,843,009,213,693,932 characters if your string might contain at least one emoji (more formally, if it can contain a character outside the Basic Multilingual Plane, i.e. U+10000 and above)

On a 32 bit system it's around 2 billion or 500 million characters. If you don't know whether you're using a 64 bit or a 32 bit system or what that means, you're probably using a 64 bit system.


Python strings are length-prefixed, so their length is limited by the size of the integer holding their length and the amount of memory available on your system. Since PEP 353, Python uses Py_ssize_t as the data type for storing container length. Py_ssize_t is defined as the same size as the compiler's size_t but signed. On a 64 bit system, size_t is 64. 1 bit for the sign means you have 63 bits for the actual quantity, meaning CPython strings cannot be larger than 2⁶³ - 1 bytes or around 9 million TB (8EiB). This much RAM would cost you around 40 billion dollars if we multiply today's price of around $4/GB by 9 billion. On 32-bit systems (which are rare these days), it's 2³¹ - 1 bytes or 2GiB.

CPython will use 1, 2 or 4 bytes per character, depending on how many bytes it needs to encode the "longest" character in your string. So for example if you have a string like 'aaaaaaaaa', the a's each take 1 byte to store, but if you have a string like 'aaaaaaaaa😀' then all the a's will now take 4 bytes each. 1-byte-per-character strings will also use either 48 or 72 bytes of metadata and 2 or 4 bytes-per-character strings will take 72 bytes for metadata. Each string also has an extra character at the end for a terminating null, so the empty string is actually 49 bytes.

When you allocate a string with PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) (see docs) in CPython, it performs this check:

    /* Ensure we won't overflow the size. */
    // [...]
    if (size > ((PY_SSIZE_T_MAX - struct_size) / char_size - 1))
        return PyErr_NoMemory();

Where PY_SSIZE_T_MAX is

/* Largest positive value of type Py_ssize_t. */
#define PY_SSIZE_T_MAX ((Py_ssize_t)(((size_t)-1)>>1))

which is casting -1 into a size_t (a type defined by the C compiler, a 64 bit unsigned integer on a 64 bit system) which causes it to wrap around to its largest possible value, 2⁶⁴-1 and then right shifts it by 1 (so that the sign bit is 0) which causes it to become 2⁶³-1 and casts that into a Py_ssize_t type.

struct_size is just a bit of overhead for the str object's metadata, either 48 or 72, it's set earlier in the function

    struct_size = sizeof(PyCompactUnicodeObject);
    if (maxchar < 128) {
        // [...]
        struct_size = sizeof(PyASCIIObject);
    }

and char_size is either 1, 2 or 4 and so we have

>>> ((2**63 - 1) - 72) // 4 - 1
2305843009213693932

There's of course the possibility that Python strings are practically limited by some other part of Python that I don't know about, but you should be able to at least allocate a new string of that size, assuming you can get get your hands on 9 exabytes of RAM.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionblippyView Question on Stackoverflow
Solution 1 - PythonAlex MartelliView Answer on Stackoverflow
Solution 2 - PythonTiti Wangsa bin DamhoreView Answer on Stackoverflow
Solution 3 - PythonBoris VerkhovskiyView Answer on Stackoverflow