Skip to content

Pandas


Google Colaboratory


Data Exploration

Paint cell by value:

df.head(10).style.background_gradient()

It will look like :

Untitled

DataFrame info (e.g. memory) :

Such as memory used

all_data.info()
"""e.g. 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12970 entries, 0 to 4276
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   12970 non-null  object 
 1   HomePlanet    12682 non-null  object
.......
dtypes: float64(6), object(8)
memory usage: 1.5+ MB
"""

Display null value count :

all_data.isnull().sum()

Display some statistic summary of data :

all_data.describe()

Display unique values in a column :

all_data['HomePlanet'].unique()
# array(['Europa', 'Earth', 'Mars', nan], dtype=object)

Display counting of unique values in a column :

all_data['HomePlanet'].value_counts()
"""e.g.
Earth     6865
Europa    3133
Mars      2684
Name: HomePlanet, dtype: int64
"""

The number of unique values in each column :

all_data.nunique(
    # dropna=False
)
"""e.g.
PassengerId     12970
HomePlanet          3
CryoSleep           2
Cabin            9825
Destination         3
Age                80
VIP                 2
RoomService      1578
FoodCourt        1953
ShoppingMall     1367
Spa              1679
VRDeck           1642
Name            12629
Transported         2
dtype: int64
"""

Display some unique values in each column :

Note: only show the columns with < 20 unique values

all_uniq = { i: all_data[i].unique() for i in all_data.columns if len(all_data[i].unique()) < 20 }
all_uniq

Helper func - display multiple DataFrame together

from IPython.core.display import HTML
def multi_table_vertical(lt) :
    return HTML(" <hr> ".join(table._repr_html_() for table in lt))
# Usage:
multi_table_vertical([all_st, train_st, test_st])

Data Transform

To list of dict :

df.to_dict('records')
"""
[{'customer': 1L, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
 {'customer': 2L, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
 {'customer': 3L, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}]
"""

Concat multiple pandas DF :

all_data = pd.concat([train, test], axis=0)

Comments