What are Python pandas equivalents for R functions like str(), summary(), and head()?

PythonRPandas

Python Problem Overview


I'm only aware of the describe() function. Are there any other functions similar to str(), summary(), and head()?

Python Solutions


Solution 1 - Python

In pandas the info() method creates a very similar output like R's str():

> str(train)
'data.frame':   891 obs. of  13 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
 $ Child      : num  0 0 0 0 0 NA 0 1 0 1 ...


train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Solution 2 - Python

This provides output similar to R's str(). It presents unique values instead of initial values.

def rstr(df): return df.shape, df.apply(lambda x: [x.unique()])

print(rstr(iris))

((150, 5), sepal_length    [[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.4, 4.8, 4.3,...
sepal_width     [[3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 2.9, 3.7,...
petal_length    [[1.4, 1.3, 1.5, 1.7, 1.6, 1.1, 1.2, 1.0, 1.9,...
petal_width     [[0.2, 0.4, 0.3, 0.1, 0.5, 0.6, 1.4, 1.5, 1.3,...
class            [[Iris-setosa, Iris-versicolor, Iris-virginica]]
dtype: object)

Solution 3 - Python

  • summary() ~ describe()
  • head() ~ head()

I'm not sure about the str() equivalent.

Solution 4 - Python

Pandas offers an extensive Comparison with R / R libraries. The most obvious difference is that R prefers functional programming while Pandas is object orientated, with the data frame as the key object. Another difference between R and Python is that Python starts arrays at 0, but R at 1.

R               | Pandas
-------------------------------
summary(df)     | df.describe()
head(df)        | df.head()
dim(df)	        | df.shape
slice(df, 1:10)	| df.iloc[:9]

Solution 5 - Python

For a Python equivalent to the str() function in R, I use the method dtypes. This will provide the data types for each column.

In [22]: df2.dtypes
Out[22]: 
Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

Solution 6 - Python

I still prefer str() because it list some examples. A confusing aspect of info is that its behavior depends on some environment settings like pandas.options.display.max_info_columns.

I think the best alternative is to call info with some other parameters that will force a fixed behavior:

df.info(null_counts=True, verbose=True)

And for your other functions:

summary(df)     | df.describe()
head(df)        | df.head()
dim(df)         | df.shape

Solution 7 - Python

I don't know much about R, but here are some leads:

str => 

difficult one... for functions you can use dir(), dir() on datasets will give you all the methods, so maybe that's not what you want...

summary => describe. 

See the parameters to customize the results.

head => your can use head(), or use slices. 

head as you already do. To get the first 10 rows of a dataset called ds ds[:10] same for tail ds[:-10]

Solution 8 - Python

I don't think there is a direct equivalent to the str() function (or glimpse() from dplyr) in Pandas that gives the same information. I think an equivalent function would have to display the following:

  1. The number of rows and columns in the data frame
  2. The names of all the columns
  3. The data type stored in each column
  4. A quick look at the first few values in each column

Building on @jjurach's answer, I wrote a helper function that works as a stand-in for the R str or glimpse function to quickly get an overview of my DataFrames. Here's the code with an example:

import pandas as pd
import random

# an example dataframe to test the helper function
example_df = pd.DataFrame({
    "var_a": [random.choice(["foo","bar"]) for i in range(20)],
    "var_b": [random.randint(0, 1) for i in range(20)],
    "var_c": [random.random() for i in range(20)]
})

# helper function for viewing pandas dataframes
def glimpse_pd(df, max_width=76):

    # find the max string lengths of the column names and dtypes for formatting
    _max_len = max([len(col) for col in df])
    _max_dtype_label_len = max([len(str(df[col].dtype)) for col in df])

    # print the dimensions of the dataframe
    print(f"{type(df)}:  {df.shape[0]} rows of {df.shape[1]} columns")

    # print the name, dtype and first few values of each column
    for _column in df:

        _col_vals = df[_column].head(max_width).to_list()
        _col_type = str(df[_column].dtype)

        output_col = f"{_column}:".ljust(_max_len+1, ' ')
        output_dtype = f" {_col_type}".ljust(_max_dtype_label_len+3, ' ')

        output_combined = f"{output_col} {output_dtype} {_col_vals}"

        # trim the output if too long
        if len(output_combined) > max_width:
            output_combined = output_combined[0:(max_width-4)] + " ..."

        print(output_combined)

Running the function returns the following output:

glimpse_pd(example_df)
<class 'pandas.core.frame.DataFrame'>:  20 rows of 3 columns
var_a:  object    ['foo', 'bar', 'foo', 'foo', 'bar', 'bar', 'foo', 'bar ...
var_b:  int64     [0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, ...
var_c:  float64   [0.7346545694885085, 0.7776711488732364, 0.49558114902 ...

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionmegashiggerView Question on Stackoverflow
Solution 1 - PythonreedcourtyView Answer on Stackoverflow
Solution 2 - PythonjjurachView Answer on Stackoverflow
Solution 3 - Pythonomer sagiView Answer on Stackoverflow
Solution 4 - PythonMartin ThomaView Answer on Stackoverflow
Solution 5 - Pythonfubar2021View Answer on Stackoverflow
Solution 6 - PythonnevesView Answer on Stackoverflow
Solution 7 - PythonWakaru44View Answer on Stackoverflow
Solution 8 - PythonCameron RaynorView Answer on Stackoverflow