Create Empty Dataframe in Pandas specifying column types

PythonPandas

Python Problem Overview


I'm trying to create an empty data frame with an index and specify the column types. The way I am doing it is the following:

df = pd.DataFrame(index=['pbp'],
                  columns=['contract',
                           'state_and_county_code',
                           'state',
                           'county',
                           'starting_membership',
                           'starting_raw_raf',
                           'enrollment_trend',
                           'projected_membership',
                           'projected_raf'],
                  dtype=['str', 'str', 'str', 'str',
                         'int', 'float', 'float',
                         'int', 'float'])

However, I get the following error,

TypeError: data type not understood

What does this mean?

Python Solutions


Solution 1 - Python

You can use the following:

df = pd.DataFrame({'a': pd.Series(dtype='int'),
                   'b': pd.Series(dtype='str'),
                   'c': pd.Series(dtype='float')})

or more abstractly:

df = pd.DataFrame({c: pd.Series(dtype=t) for c, t in {'a': 'int', 'b': 'str', 'c': 'float'}.items()})

then if you call df you have:

>>> df 
Empty DataFrame 
Columns: [a, b, c]
Index: []

and if you check its types:

>>> df.dtypes
a      int32
b     object
c    float64
dtype: object

Solution 2 - Python

One way to do it:

import numpy
import pandas

dtypes = numpy.dtype(
    [
        ("a", str),
        ("b", int),
        ("c", float),
        ("d", numpy.datetime64),
    ]
)
df = pandas.DataFrame(numpy.empty(0, dtype=dtypes))

Solution 3 - Python

This is an old question, but I don't see a solid answer (although @eric_g was super close).

You just need to create an empty dataframe with a dictionary of key:value pairs. The key being your column name, and the value being an empty data type.

So in your example dataset, it would look as follows (pandas 0.25 and python 3.7):

variables = {'contract':'',
             'state_and_county_code':'',
             'state':'',
             'county':'',
             'starting_membership':int(),
             'starting_raw_raf':float(),
             'enrollment_trend':float(),
             'projected_membership':int(),
             'projected_raf':float()}

df = pd.DataFrame(variables, index=[])

In old pandas versions, one may have to do:

df = pd.DataFrame(columns=[variables])

Solution 4 - Python

This really smells like a bug.

Here's another (simpler) solution.

import pandas as pd
import numpy as np

def df_empty(columns, dtypes, index=None):
    assert len(columns)==len(dtypes)
    df = pd.DataFrame(index=index)
    for c,d in zip(columns, dtypes):
        df[c] = pd.Series(dtype=d)
    return df

df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
print(list(df.dtypes)) # int64, int64

Solution 5 - Python

Just a remark.

You can get around the Type Error using np.dtype:

pd.DataFrame(index = ['pbp'], columns = ['a','b'], dtype = np.dtype([('str','float')]))

but you get instead:

NotImplementedError: compound dtypes are not implementedin the DataFrame constructor

Solution 6 - Python

My solution (without setting an index) is to initialize a dataframe with column names and specify data types using astype() method.

df = pd.DataFrame(columns=['contract',
                     'state_and_county_code',
                     'state',
                     'county',
                     'starting_membership',
                     'starting_raw_raf',
                     'enrollment_trend',
                     'projected_membership',
                     'projected_raf'])
df = df.astype( dtype={'contract' : str, 
                 'state_and_county_code': str,
                 'state': str,
                 'county': str,
                 'starting_membership': int,
                 'starting_raw_raf': float,
                 'enrollment_trend': float,
                 'projected_membership': int,
                 'projected_raf': float})

Solution 7 - Python

I found this question after running into the same issue. I prefer the following solution (Python 3) for creating an empty DataFrame with no index.

import numpy as np
import pandas as pd

def make_empty_typed_df(dtype):
    tdict = np.typeDict
    types = tuple(tdict.get(t, t) for (_, t, *__) in dtype)
    if any(t == np.void for t in types):
        raise NotImplementedError('Not Implemented for columns of type "void"')
    return pd.DataFrame.from_records(np.array([tuple(t() for t in types)], dtype=dtype)).iloc[:0, :]

Testing this out ...

from itertools import chain

dtype = [('col%d' % i, t) for i, t in enumerate(chain(np.typeDict, set(np.typeDict.values())))]
dtype = [(c, t) for (c, t) in dtype if (np.typeDict.get(t, t) != np.void) and not isinstance(t, int)]

print(make_empty_typed_df(dtype))

Out:

Empty DataFrame

Columns: [col0, col6, col16, col23, col24, col25, col26, col27, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col95, col96, col97, col98, col99, col100, col101, col102, col103, col104, col105, col106, col107, col108, col109, col110, col111, col112, col113, col114, col115, col117, col119, col120, col121, col122, col123, col124, ...]
Index: []

[0 rows x 146 columns]

And the datatypes ...

print(make_empty_typed_df(dtype).dtypes)

Out:

col0      timedelta64[ns]
col6               uint16
col16              uint64
col23                int8
col24     timedelta64[ns]
col25                bool
col26           complex64
col27               int64
col29             float64
col30                int8
col31             float16
col32              uint64
col33               uint8
col34              object
col35          complex128
col36               int64
col37               int16
col38               int32
col39               int32
col40             float16
col41              object
col42              uint64
col43              object
col44               int16
col45              object
col46               int64
col47               int16
col48              uint32
col49              object
col50              uint64
               ...       
col144              int32
col145               bool
col146            float64
col147     datetime64[ns]
col148             object
col149             object
col150         complex128
col151    timedelta64[ns]
col152              int32
col153              uint8
col154            float64
col156              int64
col157             uint32
col158             object
col159               int8
col160              int32
col161             uint64
col162              int16
col163             uint32
col164             object
col165     datetime64[ns]
col166            float32
col167               bool
col168            float64
col169         complex128
col170            float16
col171             object
col172             uint16
col173          complex64
col174         complex128
dtype: object

Adding an index gets tricky because there isn't a true missing value for most data types so they end up getting cast to some other type with a native missing value (e.g., ints are cast to floats or objects), but if you have complete data of the types you've specified, then you can always insert rows as needed, and your types will be respected. This can be accomplished with:

df.loc[index, :] = new_row

Again, as @Hun pointed out, this NOT how Pandas is intended to be used.

Solution 8 - Python

Taking lists columns and dtype from your examle you can do the following:

cdt={i[0]: i[1] for i in zip(columns, dtype)}    # make column type dict
pdf=pd.DataFrame(columns=list(cdt))    # create empty dataframe
pdf=pdf.astype(cdt)                    # set desired column types

DataFrame doc says only a single dtype is allowed in constructor call.

Solution 9 - Python

I found the easiest workaround for me was to simply concatenate a list of empty series for each individual column:

import pandas as pd

columns = ['contract',
           'state_and_county_code',
           'state',
           'county',
           'starting_membership',
           'starting_raw_raf',
           'enrollment_trend',
           'projected_membership',
           'projected_raf']
dtype = ['str', 'str', 'str', 'str', 'int', 'float', 'float', 'int', 'float']
df = pd.concat([pd.Series(name=col, dtype=dt) for col, dt in zip(columns, dtype)], axis=1)
df.info()
# <class 'pandas.core.frame.DataFrame'>
# Index: 0 entries
# Data columns (total 9 columns):
# contract                 0 non-null object
# state_and_county_code    0 non-null object
# state                    0 non-null object
# county                   0 non-null object
# starting_membership      0 non-null int32
# starting_raw_raf         0 non-null float64
# enrollment_trend         0 non-null float64
# projected_membership     0 non-null int32
# projected_raf            0 non-null float64
# dtypes: float64(3), int32(2), object(4)
# memory usage: 0.0+ bytes

Solution 10 - Python

pandas doesn't offer pure integer column. You can either use float column and convert that column to integer as needed or treat it like an object. What you are trying to implement is not the way pandas is supposed to be used. But if you REALLY REALLY want that, you can get around the TypeError message by doing this.

df1 =  pd.DataFrame(index=['pbp'], columns=['str1','str2','str2'], dtype=str)
df2 =  pd.DataFrame(index=['pbp'], columns=['int1','int2'], dtype=int)
df3 =  pd.DataFrame(index=['pbp'], columns=['flt1','flt2'], dtype=float)
df = pd.concat([df1, df2, df3], axis=1)

    str1 str2 str2 int1 int2  flt1  flt2
pbp  NaN  NaN  NaN  NaN  NaN   NaN   NaN

You can rearrange the col order as you like. But again, this is not the way pandas was supposed to be used.

 df.dtypes
str1     object
str2     object
str2     object
int1     object
int2     object
flt1    float64
flt2    float64
dtype: object

Note that int is treated as object.

Solution 11 - Python

You can do this by passing a dictionary into the DataFrame constructor:

df = pd.DataFrame(index=['pbp'],
                  data={'contract' : np.full(1, "", dtype=str),
                        'starting_membership' : np.full(1, np.nan, dtype=float),
                        'projected_membership' : np.full(1, np.nan, dtype=int)
                       }
                 )

This will correctly give you a dataframe that looks like:

	 contract  projected_membership	  starting_membership
pbp		""             NaN           -9223372036854775808

With dtypes:

contract                 object
projected_membership    float64
starting_membership       int64

That said, there are two things to note:

  1. str isn't actually a type that a DataFrame column can handle; instead it falls back to the general case object. It'll still work properly.

  2. Why don't you see NaN under starting_membership? Well, NaN is only defined for floats; there is no "None" value for integers, so it casts np.NaN to an integer. If you want a different default value, you can change that in the np.full call.

Solution 12 - Python

Create empty dataframe in Pandas specifying column types:

import pandas as pd

c1 = pd.Series(data=None, dtype='string', name='c1')
c2 = pd.Series(data=None, dtype='bool', name='c2')
c3 = pd.Series(data=None, dtype='float', name='c3')
c4 = pd.Series(data=None, dtype='int', name='c4')

df = pd.concat([c1, c2, c3, c4], axis=1)

df.info('verbose')

We create columns as Series and give them the correct dtype, then we concat the Series into a DataFrame, and that's it

We have the DataFrame constructor with dtypes!

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      0 non-null      string 
 1   c2      0 non-null      bool   
 2   c3      0 non-null      float64
 3   c4      0 non-null      int32  
dtypes: bool(1), float64(1), int32(1), string(1)
memory usage: 0.0+ bytes

Solution 13 - Python

I recommend this:

columns = ["a", "b"]
types = ['float32', 'str']
predefined_size = 10

df = pd.DataFrame({c: pd.Series(index=range(predefined_size), dtype=t) 
                   for c,t in zip(columns, types)})

Advantages

  • support old pandas version (e.g. 0.19.2)
  • could initialize both the type and size

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionVincentView Question on Stackoverflow
Solution 1 - PythonAlbertoView Answer on Stackoverflow
Solution 2 - PythonryanjdillonView Answer on Stackoverflow
Solution 3 - PythonSummerElaView Answer on Stackoverflow
Solution 4 - Pythonuser48956View Answer on Stackoverflow
Solution 5 - PythonptrjView Answer on Stackoverflow
Solution 6 - PythonKorhanView Answer on Stackoverflow
Solution 7 - PythonJaminSoreView Answer on Stackoverflow
Solution 8 - PythonJacek BłockiView Answer on Stackoverflow
Solution 9 - PythonjdehesaView Answer on Stackoverflow
Solution 10 - PythonHunView Answer on Stackoverflow
Solution 11 - PythonEric G.View Answer on Stackoverflow
Solution 12 - PythonPaulView Answer on Stackoverflow
Solution 13 - PythonKatoView Answer on Stackoverflow