How to create a DataFrame while preserving order of the columns?
PythonPandasPython Problem Overview
How can I create a DataFrame from multiple numpy
arrays, Pandas
Series, or Pandas
DataFrame's while preserving the order of the columns?
For example, I have these two numpy
arrays and I want to combine them as a Pandas
DataFrame.
foo = np.array( [ 1, 2, 3 ] )
bar = np.array( [ 4, 5, 6 ] )
If I do this, the bar
column would come first because dict
doesn't preserve order.
pd.DataFrame( { 'foo': pd.Series(foo), 'bar': pd.Series(bar) } )
bar foo
0 4 1
1 5 2
2 6 3
I can do this, but it gets tedious when I need to combine many variables.
pd.DataFrame( { 'foo': pd.Series(foo), 'bar': pd.Series(bar) }, columns = [ 'foo', 'bar' ] )
EDIT: Is there a way to specify the variables to be joined and to organize the column order in one operation? That is, I don't mind using multiple lines to complete the entire operation, but I'd rather not having to specify the variables to be joined multiple times (since I will be changing the code a lot and this is pretty error prone).
EDIT2: One more point. If I want to add or remove one of the variables to be joined, I only want to add/remove in one place.
Python Solutions
Solution 1 - Python
collections.OrderedDict
Original Solution: Incorrect Usage of In my original solution, I proposed to use OrderedDict
from the collections
package in python's standard library.
>>> import numpy as np
>>> import pandas as pd
>>> from collections import OrderedDict
>>>
>>> foo = np.array( [ 1, 2, 3 ] )
>>> bar = np.array( [ 4, 5, 6 ] )
>>>
>>> pd.DataFrame( OrderedDict( { 'foo': pd.Series(foo), 'bar': pd.Series(bar) } ) )
foo bar
0 1 4
1 2 5
2 3 6
Right Solution: Passing Key-Value Tuple Pairs for Order Preservation
However, as noted, if a normal dictionary is passed to OrderedDict
, the order may still not be preserved since the order is randomized when constructing the dictionary. However, a work around is to convert a list of key-value tuple pairs into an OrderedDict
, as suggested from this SO post:
>>> import numpy as np
>>> import pandas as pd
>>> from collections import OrderedDict
>>>
>>> a = np.array( [ 1, 2, 3 ] )
>>> b = np.array( [ 4, 5, 6 ] )
>>> c = np.array( [ 7, 8, 9 ] )
>>>
>>> pd.DataFrame( OrderedDict( { 'a': pd.Series(a), 'b': pd.Series(b), 'c': pd.Series(c) } ) )
a c b
0 1 7 4
1 2 8 5
2 3 9 6
>>> pd.DataFrame( OrderedDict( (('a', pd.Series(a)), ('b', pd.Series(b)), ('c', pd.Series(c))) ) )
a b c
0 1 4 7
1 2 5 8
2 3 6 9
Solution 2 - Python
Use the columns
keyword when creating the DataFrame
:
pd.DataFrame({'foo': foo, 'bar': bar}, columns=['foo', 'bar'])
Also, note that you don't need to create the Series.
Solution 3 - Python
To preserve column order pass in your numpy arrays as a list of tuples to DataFrame.from_items
:
>>> df = pd.DataFrame.from_items([('foo', foo), ('bar', bar)])
foo bar
0 1 4
1 2 5
2 3 6
Update
From pandas 0.23 from_items
is deprecated and will be removed. So pass the numpy
arrays using from_dict
. To use from_dict
you need to pass the items as a dictionary:
>>> from collections import OrderedDict as OrderedDict
>>> df = pd.DataFrame.from_dict(OrderedDict(zip(['foo', 'bar'], [foo, bar])))
From python 3.7 you can depend on insertion order being preserved (see https://mail.python.org/pipermail/python-dev/2017-December/151283.html) so:
>>> df = pd.DataFrame.from_dict(dict(zip(['foo', 'bar'], [foo, bar])))
or simply:
>>> df = pd.DataFrame(dict(zip(['foo', 'bar'], [foo, bar])))
Solution 4 - Python
After having created your dataframe, you can simply reorder the columns the way you want by using
df= df[['foo','bar']]
Solution 5 - Python
I couldn't comment to ask, but how will you specify the order of the columns in the first place (since you can't with a regular dictionary)?
If you want to maintain an ordered dictionary:
from collections import OrderedDict
import numpy as np
import pandas as pd
data = OrderedDict()
data['foo'] = np.array([1, 2, 3])
data['bar'] = np.array([4, 5, 6])
df = pd.DataFrame(data)
If you just have a list of keys for order:
data = {key: value for key, value in data.iteritems()}
df = pd.concat(data.values(), keys=['foo', 'bar'], axis=1)
@tfv's answer is likely the most concise way to do what you want.
Solution 6 - Python
>>> pd.concat([pd.Series(eval(col), name=col) for col in ['foo', 'bar']], axis=1)
foo bar
0 1 4
1 2 5
2 3 6
This works using eval
. Your list of column names must match the corresponding variable name.
>>> eval('foo')
array([1, 2, 3])
Solution 7 - Python
This may be an other way to approach it:
foo = np.array( [ 1, 2, 3 ] )
bar = np.array( [ 4, 5, 6 ] )
stacked = np.vstack((x,y)).T
stacked
array([[1, 4],
[2, 5],
[3, 6]])
new_df = pd.DataFrame(stacked, columns = ['foo', 'bar'] )
new_df
foo bar
0 1 4
1 2 5
2 3 6
Solution 8 - Python
Make the dataframe with just the data in it, and transpose it.
Then add the columns.
>>> foo = np.array( [ 1, 2, 3 ] )
>>> bar = np.array( [ 4, 5, 6 ] )
>>>
>>> df = pd.DataFrame([foo, bar]).T
>>> df.columns = ['foo','bar']
>>> df
> foo bar > 0 1 4 > 1 2 5 > 2 3 6
Solution 9 - Python
Another sketchy solution might be to pass a X_ to the title of the column where X is the number of the order of the column:
pd.DataFrame( { '2_foo': pd.Series(foo), '1_bar': pd.Series(bar) } )
And after that you can use columns or something to rename the columns again! The less pythonic code in the world!!!
Good luck mates!
Solution 10 - Python
What I did is as follow:
# Creating list of dict
list_of_dicts = ({'key1':'valueA', 'key2':'valueB},{'key1':'valueC', 'key2':'valueD}')
#getting list of keys from the dict
keys_list = list(list_of_dicts.keys())
# and finally
df = pd.DataFrame(list_of_dicts, columns = keys_list)
Worked perfectly for me.
Solution 11 - Python
The question is very old but wish to provide my solution to the question of "Preserving the order of columns while reading a csv file into pandas data frame":
import numpy as np
import pandas as pd
# Get column count as a list
cols = np.arange(0, hmprice.shape[1])
df = pd.read_csv('train.csv', usecols=cols)
df.head()
Default Order of dataframe:
Preserved order of dataframe:
NOTE: The attribute usecols can take either column names or column indices; but pandas doesn't honor "any other order" of column names or column indices.
For example,
df = pd.read_csv('train.csv', usecols=[1, 2, 3])<br/>
or
df = pd.read_csv('train.csv', usecols=[3, 2, 1])<br/>
gives the same result.