Get first letter of a string from column

PythonPandas

Python Problem Overview


I'm fighting with pandas and for now I'm loosing. I have source table similar to this:

import pandas as pd

a=pd.Series([123,22,32,453,45,453,56])
b=pd.Series([234,4353,355,453,345,453,56])
df=pd.concat([a, b], axis=1)
df.columns=['First', 'Second']

I would like to add new column to this data frame with first digit from values in column 'First': a) change number to string from column 'First' b) extracting first character from newly created string c) Results from b save as new column in data frame

I don't know how to apply this to the pandas data frame object. I would be grateful for helping me with that.

Python Solutions


Solution 1 - Python

Cast the dtype of the col to str and you can perform vectorised slicing calling str:

In [29]:
df['new_col'] = df['First'].astype(str).str[0]
df

Out[29]:
   First  Second new_col
0    123     234       1
1     22    4353       2
2     32     355       3
3    453     453       4
4     45     345       4
5    453     453       4
6     56      56       5

if you need to you can cast the dtype back again calling astype(int) on the column

Solution 2 - Python

.str.get

This is the simplest to specify string methods

# Setup
df = pd.DataFrame({'A': ['xyz', 'abc', 'foobar'], 'B': [123, 456, 789]})
df

        A    B
0     xyz  123
1     abc  456
2  foobar  789

df.dtypes

A    object
B     int64
dtype: object

For string (read:object) type columns, use

df['C'] = df['A'].str[0]
# Similar to,
df['C'] = df['A'].str.get(0)

.str handles NaNs by returning NaN as the output.

For non-numeric columns, an .astype conversion is required beforehand, as shown in @Ed Chum's answer.

# Note that this won't work well if the data has NaNs. 
# It'll return lowercase "n"
df['D'] = df['B'].astype(str).str[0]

df
        A    B  C  D
0     xyz  123  x  1
1     abc  456  a  4
2  foobar  789  f  7

List Comprehension and Indexing

There is enough evidence to suggest a simple list comprehension will work well here and probably be faster.

# For string columns
df['C'] = [x[0] for x in df['A']]

# For numeric columns
df['D'] = [str(x)[0] for x in df['B']]

df
        A    B  C  D
0     xyz  123  x  1
1     abc  456  a  4
2  foobar  789  f  7

If your data has NaNs, then you will need to handle this appropriately with an if/else in the list comprehension,

df2 = pd.DataFrame({'A': ['xyz', np.nan, 'foobar'], 'B': [123, 456, np.nan]})
df2

        A      B
0     xyz  123.0
1     NaN  456.0
2  foobar    NaN

# For string columns
df2['C'] = [x[0] if isinstance(x, str) else np.nan for x in df2['A']]

# For numeric columns
df2['D'] = [str(x)[0] if pd.notna(x) else np.nan for x in df2['B']]

        A      B    C    D
0     xyz  123.0    x    1
1     NaN  456.0  NaN    4
2  foobar    NaN    f  NaN

Let's do some timeit tests on some larger data.

df_ = df.copy()
df = pd.concat([df_] * 5000, ignore_index=True) 

%timeit df.assign(C=df['A'].str[0])
%timeit df.assign(D=df['B'].astype(str).str[0])

%timeit df.assign(C=[x[0] for x in df['A']])
%timeit df.assign(D=[str(x)[0] for x in df['B']])

12 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

3.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.84 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

List comprehensions are 4x faster.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionmichalkView Question on Stackoverflow
Solution 1 - PythonEdChumView Answer on Stackoverflow
Solution 2 - Pythoncs95View Answer on Stackoverflow