Best way to convert string to bytes in Python 3?
PythonStringCharacter EncodingPython 3.xPython Problem Overview
https://stackoverflow.com/questions/5471158/typeerror-str-does-not-support-the-buffer-interface suggests two possible methods to convert a string to bytes:
b = bytes(mystring, 'utf-8')
b = mystring.encode('utf-8')
Which method is more Pythonic?
Python Solutions
Solution 1 - Python
If you look at the docs for bytes
, it points you to bytearray
:
> bytearray([source[, encoding[, errors]]]) > >Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences, described in Mutable Sequence Types, as well as most methods that the bytes type has, see Bytes and Byte Array Methods. > >The optional source parameter can be used to initialize the array in a few different ways: > >If it is a string, you must also give the encoding (and optionally, errors) parameters; bytearray() then converts the string to bytes using str.encode(). > >If it is an integer, the array will have that size and will be initialized with null bytes. > >If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array. > >If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array. > > Without an argument, an array of size 0 is created.
So bytes
can do much more than just encode a string. It's Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense.
For encoding a string, I think that some_string.encode(encoding)
is more Pythonic than using the constructor, because it is the most self documenting -- "take this string and encode it with this encoding" is clearer than bytes(some_string, encoding)
-- there is no explicit verb when you use the constructor.
I checked the Python source. If you pass a unicode string to bytes
using CPython, it calls PyUnicode_AsEncodedString, which is the implementation of encode
; so you're just skipping a level of indirection if you call encode
yourself.
Also, see Serdalis' comment -- unicode_string.encode(encoding)
is also more Pythonic because its inverse is byte_string.decode(encoding)
and symmetry is nice.
Solution 2 - Python
It's easier than it is thought:
my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
type(my_str_as_bytes) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
type(my_decoded_str) # ensure it is string representation
Solution 3 - Python
The absolutely best way is neither of the 2, but the 3rd. The first parameter to encode
defaults to 'utf-8'
ever since Python 3.0. Thus the best way is
b = mystring.encode()
This will also be faster, because the default argument results not in the string "utf-8"
in the C code, but NULL
, which is much faster to check!
Here be some timings:
In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop
In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop
Despite the warning the times were very stable after repeated runs - the deviation was just ~2 per cent.
Using encode()
without an argument is not Python 2 compatible, as in Python 2 the default character encoding is ASCII.
>>> 'äöä'.encode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Solution 4 - Python
Answer for a slightly different problem:
You have a sequence of raw unicode that was saved into a str variable:
s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
You need to be able to get the byte literal of that unicode (for struct.unpack(), etc.)
s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
Solution:
s_new: bytes = bytes(s, encoding="raw_unicode_escape")
Reference (scroll up for standard encodings):
Solution 5 - Python
How about the Python 3 'memoryview' way.
Memoryview is a sort of mishmash of the byte/bytearray and struct modules, with several benefits.
- Not limited to just text and bytes, handles 16 and 32 bit words too
- Copes with endianness
- Provides a very low overhead interface to linked C/C++ functions and data
Simplest example, for a byte array:
memoryview(b"some bytes").tolist()
[115, 111, 109, 101, 32, 98, 121, 116, 101, 115]
Or for a unicode string, (which is converted to a byte array)
memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).tolist()
[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]
#Another way to do the same
memoryview("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020".encode("UTF-16")).tolist()
[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]
Perhaps you need words rather than bytes?
memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).cast("H").tolist()
[65279, 117, 110, 105, 99, 111, 100, 101, 32]
memoryview(b"some more data").cast("L").tolist()
[1701670771, 1869422624, 538994034, 1635017060]
Word of caution. Be careful of multiple interpretations of byte order with data of more than one byte:
txt = "\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020"
for order in ("", "BE", "LE"):
mv = memoryview(bytes(txt, f"UTF-16{order}"))
print(mv.cast("H").tolist())
[65279, 117, 110, 105, 99, 111, 100, 101, 32]
[29952, 28160, 26880, 25344, 28416, 25600, 25856, 8192]
[117, 110, 105, 99, 111, 100, 101, 32]
Not sure if that's intentional or a bug but it caught me out!!
The example used UTF-16, for a full list of codecs see Codec registry in Python 3.10