Weird behaviour initializing a numpy array of string data

PythonNumpy

Python Problem Overview


I am having some seemingly trivial trouble with numpy when the array contains string data. I have the following code:

my_array = numpy.empty([1, 2], dtype = str)
my_array[0, 0] = "Cat"
my_array[0, 1] = "Apple"

Now, when I print it with print my_array[0, :], the response I get is ['C', 'A'], which is clearly not the expected output of Cat and Apple. Why is that, and how can I get the right output?

Thanks!

Python Solutions


Solution 1 - Python

Numpy requires string arrays to have a fixed maximum length. When you create an empty array with dtype=str, it sets this maximum length to 1 by default. You can see if you do my_array.dtype; it will show "|S1", meaning "one-character string". Subsequent assignments into the array are truncated to fit this structure.

You can pass an explicit datatype with your maximum length by doing, e.g.:

my_array = numpy.empty([1, 2], dtype="S10")

The "S10" will create an array of length-10 strings. You have to decide how big will be big enough to hold all the data you want to hold.

Solution 2 - Python

I got a "codec error" when I tried to use a non-ascii character with dtype="S10"

You also get an array with binary strings, which confused me.

I think it is better to use:

my_array = numpy.empty([1, 2], dtype="<U10")

Here 'U10' translates to "Unicode string of length 10; little endian format"

Solution 3 - Python

The numpy string array is limited by its fixed length (length 1 by default). If you're unsure what length you'll need for your strings in advance, you can use dtype=object and get arbitrary length strings for your data elements:

my_array = numpy.empty([1, 2], dtype=object)

I understand there may be efficiency drawbacks to this approach, but I don't have a good reference to support that.

Solution 4 - Python

Another alternative is to initialize as follows:

my_array = np.array([["CAT","APPLE"],['','']], dtype=str)

In other words, first you write a regular array with what you want, then you turn it into a numpy array. However, this will fix your max string length to the length of the longest string at initialization. So if you were to add

my_array[1,0] = 'PINEAPPLE'

then the string stored would be 'PINEA'.

Solution 5 - Python

What works best if you are doing a for loop is to start a list comprehension, which will allow you to allocate the right memory.

data = ['CAT', 'APPLE', 'CARROT']
my_array = [name for name in data]

Solution 6 - Python

in case of anyone who's new here, I guess there's another way to do this job for now, just need a little work:

my_array = np.full([1, 2], "", dtype=np.object)

Use np.full instead of np.empty, and create the array with a empty string (type is object).

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJimView Question on Stackoverflow
Solution 1 - PythonBrenBarnView Answer on Stackoverflow
Solution 2 - PythonJohny WhiteView Answer on Stackoverflow
Solution 3 - PythonspinupView Answer on Stackoverflow
Solution 4 - PythonPlamenView Answer on Stackoverflow
Solution 5 - PythonKanDanView Answer on Stackoverflow
Solution 6 - PythonayiisView Answer on Stackoverflow