String concatenation of two pandas columns

Python Problem Overview

I have a following DataFrame:

from pandas import *
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})

It looks like this:

    bar foo
0    1   a
1    2   b
2    3   c

Now I want to have something like:

     bar
0    1 is a
1    2 is b
2    3 is c

How can I achieve this? I tried the following:

df['foo'] = '%s is %s' % (df['bar'], df['foo'])

but it gives me a wrong result:

>>>print df.ix[0]

bar                                                    a
foo    0    a
1    b
2    c
Name: bar is 0    1
1    2
2
Name: 0

Sorry for a dumb question, but this one https://stackoverflow.com/questions/10972410/pandas-combine-two-columns-in-a-dataframe wasn't helpful for me.

Python Solutions

Solution 1 - Python

df['bar'] = df.bar.map(str) + " is " + df.foo

Solution 2 - Python

This question has already been answered, but I believe it would be good to throw some useful methods not previously discussed into the mix, and compare all methods proposed thus far in terms of performance.

Here are some useful solutions to this problem, in increasing order of performance.

`DataFrame.agg`

This is a simple str.format-based approach.

df['baz'] = df.agg('{0[bar]} is {0[foo]}'.format, axis=1)
df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

You can also use f-string formatting here:

df['baz'] = df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

`char.array`-based Concatenation

Convert the columns to concatenate as chararrays, then add them together.

a = np.char.array(df['bar'].values)
b = np.char.array(df['foo'].values)

df['baz'] = (a + b' is ' + b).astype(str)
df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

List Comprehension with `zip`

I cannot overstate how underrated list comprehensions are in pandas.

df['baz'] = [str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])]

Alternatively, using str.join to concat (will also scale better):

df['baz'] = [
    ' '.join([str(x), 'is', y]) for x, y in zip(df['bar'], df['foo'])]

df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

List comprehensions excel in string manipulation, because string operations are inherently hard to vectorize, and most pandas "vectorised" functions are basically wrappers around loops. I have written extensively about this topic in https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care. In general, if you don't have to worry about index alignment, use a list comprehension when dealing with string and regex operations.

The list comp above by default does not handle NaNs. However, you could always write a function wrapping a try-except if you needed to handle it.

def try_concat(x, y):
    try:
        return str(x) + ' is ' + y
    except (ValueError, TypeError):
        return np.nan


df['baz'] = [try_concat(x, y) for x, y in zip(df['bar'], df['foo'])]

`perfplot` Performance Measurements

Graph generated using perfplot. Here's the complete code listing.

Functions

> def brenbarn(df): > return df.assign(baz=df.bar.map(str) + " is " + df.foo) >
> def danielvelkov(df): > return df.assign(baz=df.apply( > lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)) >
> def chrimuelle(df): > return df.assign( > baz=df['bar'].astype(str).str.cat(df['foo'].values, sep=' is ')) >
> def vladimiryashin(df): > return df.assign(baz=df.astype(str).apply(lambda x: ' is '.join(x), axis=1)) >
> def erickfis(df): > return df.assign( > baz=df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1)) >
> def cs1_format(df): > return df.assign(baz=df.agg('{0[bar]} is {0[foo]}'.format, axis=1)) >
> def cs1_fstrings(df): > return df.assign(baz=df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1)) >
> def cs2(df): > a = np.char.array(df['bar'].values) > b = np.char.array(df['foo'].values) >
> return df.assign(baz=(a + b' is ' + b).astype(str)) >
> def cs3(df): > return df.assign( > baz=[str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])])

Solution 3 - Python

The problem in your code is that you want to apply the operation on every row. The way you've written it though takes the whole 'bar' and 'foo' columns, converts them to strings and gives you back one big string. You can write it like:

df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)

It's longer than the other answer but is more generic (can be used with values that are not strings).

Solution 4 - Python

You could also use

df['bar'] = df['bar'].str.cat(df['foo'].values.astype(str), sep=' is ')

Solution 5 - Python

df.astype(str).apply(lambda x: ' is '.join(x), axis=1)

0    1 is a
1    2 is b
2    3 is c
dtype: object

Solution 6 - Python

series.str.cat is the most flexible way to approach this problem:

For df = pd.DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})

df.foo.str.cat(df.bar.astype(str), sep=' is ')

>>>  0    a is 1
     1    b is 2
     2    c is 3
     Name: foo, dtype: object

df.bar.astype(str).str.cat(df.foo, sep=' is ')

>>>  0    1 is a
     1    2 is b
     2    3 is c
     Name: bar, dtype: object

Unlike .join() (which is for joining list contained in a single Series), this method is for joining 2 Series together. It also allows you to ignore or replace NaN values as desired.

Solution 7 - Python

@DanielVelkov answer is the proper one BUT using string literals is faster:

# Daniel's
%timeit df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)
## 963 µs ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# String literals - python 3
%timeit df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
## 849 µs ± 4.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Solution 8 - Python

I have encountered a specific case from my side with 10^11 rows in my dataframe, and in this case none of the proposed solution is appropriate. I have used categories, and this should work fine in all cases when the number of unique string is not too large. This is easily done in the R software with XxY with factors but I could not find any other way to do it in python (I'm new to python). If anyone knows a place where this is implemented I'd be glad to know.

def Create_Interaction_var(df,Varnames):
    '''
    :df data frame
    :list of 2 column names, say "X" and "Y". 
    The two columns should be strings or categories
    convert strings columns to categories
    Add a column with the "interaction of X and Y" : X x Y, with name 
    "Interaction-X_Y"
    '''
    df.loc[:, Varnames[0]] = df.loc[:, Varnames[0]].astype("category")
    df.loc[:, Varnames[1]] = df.loc[:, Varnames[1]].astype("category")
    CatVar = "Interaction-" + "-".join(Varnames)
    Var0Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[0]].cat.categories)).rename(columns={0 : "code0",1 : "name0"})
    Var1Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[1]].cat.categories)).rename(columns={0 : "code1",1 : "name1"})
    NbLevels=len(Var0Levels)

    names = pd.DataFrame(list(itertools.product(dict(enumerate(df.loc[:,Varnames[0]].cat.categories)),
                                                dict(enumerate(df.loc[:,Varnames[1]].cat.categories)))),
                         columns=['code0', 'code1']).merge(Var0Levels,on="code0").merge(Var1Levels,on="code1")
    names=names.assign(Interaction=[str(x) + '_' + y for x, y in zip(names["name0"], names["name1"])])
    names["code01"]=names["code0"] + NbLevels*names["code1"]
    df.loc[:,CatVar]=df.loc[:,Varnames[0]].cat.codes+NbLevels*df.loc[:,Varnames[1]].cat.codes
    df.loc[:, CatVar]=  df[[CatVar]].replace(names.set_index("code01")[["Interaction"]].to_dict()['Interaction'])[CatVar]
    df.loc[:, CatVar] = df.loc[:, CatVar].astype("category")
    return df

Solution 9 - Python

I think the most concise solution for arbitrary numbers of columns is a short-form version of this answer:

df.astype(str).apply(' is '.join, axis=1)

You can shave off two more characters with df.agg(), but it's slower:

df.astype(str).agg(' is '.join, axis=1)

Content Type	Original Author	Original Content on Stackoverflow
Question	nat	View Question on Stackoverflow
Solution 1 - Python	BrenBarn	View Answer on Stackoverflow
Solution 2 - Python	cs95	View Answer on Stackoverflow
Solution 3 - Python	Daniel	View Answer on Stackoverflow
Solution 4 - Python	chriad	View Answer on Stackoverflow
Solution 5 - Python	Vladimir Iashin	View Answer on Stackoverflow
Solution 6 - Python	johnDanger	View Answer on Stackoverflow
Solution 7 - Python	erickfis	View Answer on Stackoverflow
Solution 8 - Python	robin girard	View Answer on Stackoverflow
Solution 9 - Python	1''	View Answer on Stackoverflow

String concatenation of two pandas columns

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

`DataFrame.agg`

`char.array`-based Concatenation

List Comprehension with `zip`

`perfplot` Performance Measurements

Solution 3 - Python

Solution 4 - Python

Solution 5 - Python

Solution 6 - Python

Solution 7 - Python

Solution 8 - Python

Solution 9 - Python

Setting Android Theme background color

When should I use perror("...") and fprintf(stderr, "...")?

Attributions

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

DataFrame.agg

char.array-based Concatenation

List Comprehension with zip

perfplot Performance Measurements

Solution 3 - Python

Solution 4 - Python

Solution 5 - Python

Solution 6 - Python

Solution 7 - Python

Solution 8 - Python

Solution 9 - Python

Setting Android Theme background color

When should I use perror("...") and fprintf(stderr, "...")?

Attributions

`DataFrame.agg`

`char.array`-based Concatenation

List Comprehension with `zip`

`perfplot` Performance Measurements