Adding meta-information/metadata to pandas DataFrame


Python Problem Overview

Is it possible to add some meta-information/metadata to a pandas DataFrame?

For example, the instrument's name used to measure the data, the instrument responsible, etc.

One workaround would be to create a column with that information, but it seems wasteful to store a single piece of information in every row!

Python Solutions

Solution 1 - Python

Sure, like most Python objects, you can attach new attributes to a pandas.DataFrame:

import pandas as pd
df = pd.DataFrame([])
df.instrument_name = 'Binky'

Note, however, that while you can attach attributes to a DataFrame, operations performed on the DataFrame (such as groupby, pivot, join or loc to name just a few) may return a new DataFrame without the metadata attached. Pandas does not yet have a robust method of propagating metadata attached to DataFrames.

Preserving the metadata in a file is possible. You can find an example of how to store metadata in an HDF5 file here.

Solution 2 - Python

As of pandas 1.0, possibly earlier, there is now a Dataframe.attrs property. It is experimental, but this is probably what you'll want in the future. For example:

import pandas as pd
df = pd.DataFrame([])
df.attrs['instrument_name'] = 'Binky'

Find it in the docs here.

Trying this out with to_parquet and then from_parquet, it doesn't seem to persist, so be sure you check that out with your use case.

Solution 3 - Python

Just ran into this issue myself. As of pandas 0.13, DataFrames have a _metadata attribute on them that does persist through functions that return new DataFrames. Also seems to survive serialization just fine (I've only tried json, but I imagine hdf is covered as well).

Solution 4 - Python

Not really. Although you could add attributes containing metadata to the DataFrame class as @unutbu mentions, many DataFrame methods return a new DataFrame, so your meta data would be lost. If you need to manipulate your dataframe, then the best option would be to wrap your metadata and DataFrame in another class. See this discussion on GitHub:

There is currently an open pull request to add a MetaDataFrame object, which would support metadata better.

Solution 5 - Python

The top answer of attaching arbitrary attributes to the DataFrame object is good, but if you use a dictionary, list, or tuple, it will emit an error of "Pandas doesn't allow columns to be created via a new attribute name". The following solution works for storing arbitrary attributes.

from types import SimpleNamespace
df = pd.DataFrame()
df.meta = SimpleNamespace() = [1,2,3]

Solution 6 - Python

As mentioned in other answers and comments, _metadata is not a part of public API, so it's definitely not a good idea to use it in a production environment. But you still may want to use it in a research prototyping and replace it if it stops working. And right now it works with groupby/apply, which is helpful. This is an example (which I couldn't find in other answers):

df = pd.DataFrame([1, 2, 2, 3, 3], columns=['val']) 
df.my_attribute = "my_value"
df.groupby('val').apply(lambda group: group.my_attribute)


1    my_value
2    my_value
3    my_value
dtype: object

Solution 7 - Python

As mentioned by @choldgraf I have found xarray to be an excellent tool for attaching metadata when comparing data and plotting results between several dataframes.

In my work, we are often comparing the results of several firmware revisions and different test scenarios, adding this information is as simple as this:

df = pd.read_csv(meaningless_test)
metadata = {'fw': foo, 'test_name': bar, 'scenario': sc_01}
ds = xr.Dataset.from_dataframe(df)
ds.attrs = metadata

Solution 8 - Python

Coming pretty late to this, I thought this might be helpful if you need metadata to persist over I/O. There's a relatively new package called h5io that I've been using to accomplish this.

It should let you do a quick read/write from HDF5 for a few common formats, one of them being a dataframe. So you can, for example, put a dataframe in a dictionary and include metadata as fields in the dictionary. E.g.:

save_dict = dict(data=my_df, name='chris', record_date='1/1/2016')
h5io.write_hdf5('path/to/file.hdf5', save_dict)
in_data = h5io.read_hdf5('path/to/file.hdf5')
df = in_data['data']
name = in_data['name']

Another option would be to look into a project like xray, which is more complex in some ways, but I think it does let you use metadata and is pretty easy to convert to a DataFrame.

Solution 9 - Python

I have been looking for a solution and found that pandas frame has the property attrs

pd.DataFrame().attrs.update({'your_attribute' : 'value'})

This attribute will always stick to your frame whenever you pass it!

Solution 10 - Python

I was having the same issue and used a workaround of creating a new, smaller DF from a dictionary with the metadata:

    meta = {"name": "Sample Dataframe", "Created": "19/07/2019"}
    dfMeta = pd.DataFrame.from_dict(meta, orient='index')

This dfMeta can then be saved alongside your original DF in pickle etc

See (Lutz's answer) for excellent answer on saving and retrieving multiple dataframes using pickle

Solution 11 - Python

Referring to the section Define original properties(of the official Pandas documentation) and if subclassing from pandas.DataFrame is an option, note that:

> To let original data structures have additional properties, you should let pandas know what properties are added.

Thus, something you can do - where the name MetaedDataFrame is arbitrarily chosen - is

class MetaedDataFrame(pd.DataFrame):
    _metadata = ['instrument_name']

    def _constructor(self):
        return self.__class__

    # Define the following if providing attribute(s) at instantiation
    # is a requirement, otherwise, if YAGNI, don't.
    def __init__(
        self, *args, instrument_name: str = None, **kwargs
        super().__init__(*args, **kwargs)
        self.instrument_name = instrument_name

And then instantiate your dataframe with your (_metadata-prespecified) attribute(s)

>>> mdf = MetaedDataFrame(instrument_name='Binky')
>>> mdf.instrument_name

Or even after instantiation

>>> mdf = MetaedDataFrame()
>>> mdf.instrument_name = 'Binky'

Without any kind of warning (as of 2021/06/15): serialization and ~.copy work like a charm. Also, such approach allows to enrich your API, e.g. by adding some instrument_name-based members to MetaedDataFrame, such as properties (or methods):

    def lower_instrument_name(self) -> str:
        if self.instrument_name is not None:
            return self.instrument_name.lower()

>>> mdf.lower_instrument_name

... but this is rather beyond the scope of this question ...

Solution 12 - Python

Adding raw attributes with pandas (e.g. df.my_metadata = "source.csv") is not a good idea.

Even on the latest version (1.2.4 on python 3.8), doing this will randomly cause segfaults when doing very simple operations with things like read_csv. It will be hard to debug, because read_csv will work fine, but later on (seemingly at random) you will find that the dataframe has been freed from memory.

It seems cpython extensions involved with pandas seem to make very explicit assumptions about the data layout of the dataframe.

attrs is the only safe way to use metadata properties currently:


df.attrs.update({'my_metadata' : "source.csv"})

How attrs should behave in all scenarios is not fully fleshed out. You can help provide feedback on the expected behaviors of attrs in this issue:

Solution 13 - Python

For those looking to store the datafram in an HDFStore, according to, the recommended approach is:

import pandas as pd

df = pd.DataFrame(dict(keys=['a', 'b', 'c'], values=['1', '2', '3']))
df.to_hdf('/tmp/temp_df.h5', key='temp_df')
store = pd.HDFStore('/tmp/temp_df.h5') 
store.get_storer('temp_df').attrs.attr_key = 'attr_value'


All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionP3trusView Question on Stackoverflow
Solution 1 - PythonunutbuView Answer on Stackoverflow
Solution 2 - PythonryanjdillonView Answer on Stackoverflow
Solution 3 - PythonfollyroofView Answer on Stackoverflow
Solution 4 - PythonMatti JohnView Answer on Stackoverflow
Solution 5 - PythonbscanView Answer on Stackoverflow
Solution 6 - PythonDennis GolomazovView Answer on Stackoverflow
Solution 7 - PythonjtwilsonView Answer on Stackoverflow
Solution 8 - PythoncholdgrafView Answer on Stackoverflow
Solution 9 - PythonDisplayNameView Answer on Stackoverflow
Solution 10 - PythonSenAnanView Answer on Stackoverflow
Solution 11 - PythonkeepAliveView Answer on Stackoverflow
Solution 12 - PythonJonView Answer on Stackoverflow
Solution 13 - PythonOlshanskView Answer on Stackoverflow