Splitting a pandas dataframe column by delimiter

PythonPandas

Python Problem Overview


i have a small sample data:

import pandas as pd

df = {'ID': [3009, 129, 119, 120, 121, 122, 130, 3014, 266, 849, 174, 844],
  'V': ['IGHV7-B*01', 'IGHV7-B*01', 'IGHV6-A*01', 'GHV6-A*01', 'IGHV6-A*01',
        'IGHV6-A*01', 'IGHV4-L*03', 'IGHV4-L*03', 'IGHV5-A*01', 'IGHV5-A*04',
        'IGHV6-A*02','IGHV6-A*02'],
  'Prob': [1, 1, 0.8, 0.8056, 0.9, 0.805, 1, 1, 0.997, 0.401, 1, 1]}

df = pd.DataFrame(df)

looks like

df    

Out[25]: 
      ID    Prob           V
0    3009  1.0000  IGHV7-B*01
1     129  1.0000  IGHV7-B*01
2     119  0.8000  IGHV6-A*01
3     120  0.8056  IGHV6-A*01
4     121  0.9000  IGHV6-A*01
5     122  0.8050  IGHV6-A*01
6     130  1.0000  IGHV4-L*03
7    3014  1.0000  IGHV4-L*03
8     266  0.9970  IGHV5-A*01
9     849  0.4010  IGHV5-A*04
10    174  1.0000  IGHV6-A*02
11    844  1.0000  IGHV6-A*02

I want to split the column 'V' by the '-' delimiter and move it to another column named 'allele'

    Out[25]: 
      ID    Prob      V    allele
0    3009  1.0000  IGHV7    B*01
1     129  1.0000  IGHV7    B*01
2     119  0.8000  IGHV6    A*01
3     120  0.8056  IGHV6    A*01
4     121  0.9000  IGHV6    A*01
5     122  0.8050  IGHV6    A*01
6     130  1.0000  IGHV4    L*03
7    3014  1.0000  IGHV4    L*03
8     266  0.9970  IGHV5    A*01
9     849  0.4010  IGHV5    A*04
10    174  1.0000  IGHV6    A*02
11    844  1.0000  IGHV6    A*02

the code i have tried so far is incomplete and didn't work:

df1 = pd.DataFrame()
df1[['V']] = pd.DataFrame([ x.split('-') for x in df['V'].tolist() ])

or

df.add(Series, axis='columns', level = None, fill_value = None)
newdata = df.DataFrame({'V':df['V'].iloc[::2].values, 
                        'Allele': df['V'].iloc[1::2].values})

Python Solutions


Solution 1 - Python

Use vectoried str.split with expand=True:

In [42]:
df[['V','allele']] = df['V'].str.split('-',expand=True)
df

Out[42]:
      ID    Prob      V allele
0   3009  1.0000  IGHV7   B*01
1    129  1.0000  IGHV7   B*01
2    119  0.8000  IGHV6   A*01
3    120  0.8056   GHV6   A*01
4    121  0.9000  IGHV6   A*01
5    122  0.8050  IGHV6   A*01
6    130  1.0000  IGHV4   L*03
7   3014  1.0000  IGHV4   L*03
8    266  0.9970  IGHV5   A*01
9    849  0.4010  IGHV5   A*04
10   174  1.0000  IGHV6   A*02
11   844  1.0000  IGHV6   A*02

Solution 2 - Python

For storing data into a new dataframe use the same approach, just with the new dataframe:

tmpDF = pd.DataFrame(columns=['A','B'])
tmpDF[['A','B']] = df['V'].str.split('-', expand=True)

Eventually (and more usefull for my purposes) if you would need get only a part of the string value (i.e. text before '-'), you could use .str.split(...).str[idx] like:

df['V'] = df['V'].str.split('-').str[0]
df
	ID	    V	    Prob
0	3009	IGHV7	1.0000
1	129	    IGHV7	1.0000
2	119	    IGHV6	0.8000
3	120	    GHV6	0.8056
  • splits 'V' values into list according to separator '-' and stores 1st item back to the column

Solution 3 - Python

Use the below:

df['allele'] = [x.split('-')[-1] for x in df['V']]

The above first part retains any values after the '-' sign

df['V'] = [x.split('-')[-0] for x in df['V']]

The above second part retains any values before the '-' sign and automatically replaces the main column

df.head(3)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJessicaView Question on Stackoverflow
Solution 1 - PythonEdChumView Answer on Stackoverflow
Solution 2 - PythonLukasView Answer on Stackoverflow
Solution 3 - PythonAriel JumbaView Answer on Stackoverflow