data manipulation pandas
Data selection and manipulation
loc
and iloc
The most useful was is to loc by callable to select rows/columns. As long as the callable returns a list of true/false, it works.
df.reindex
returns data when index and/or columns presents but returns NaN or custom fillin when not present.
s = pd.Series([1,2,3])
s.reindex([1,2,3]) #this returns [2,3,NaN]
use index.intersection(other_df.index)
to find common columns with other df
use iat
/at
for selecting and setting
isin
takes both list and dict as arguments to specify columns for df
where
and mask
. mask
is the inverse selection of where
df.where(df>5, '?') # return value if > 5, otherwise fill with '?'. fill with NaN if second argument not given
query
df[(df['a'] < df['b']) & (df['b'] < df['c'])]
is same asdf.query('(a < b) & (b < c)')
df.query('index < 2')
selects data from row 3. ‘index’ is a special namedf.query('a in b')
same asdf.loc[df.a.isin(df.b),:]
.df.quer('a not in b')
is also valid- use case of
query
. when operations are the same but datasets are differentexpr = '0.0 <= a <= c <= 0.5' map(lambda frame: frame.query(expr), [df, df2])
lookup
query by a pair of rows and columns list. return a list of these values. raise error when one of the labels not found.Data transformation
Table transpose
of course we are not talking about normal mx transpose….
pivot
vsmelt
pivot
transforms data in different columns into one volumns of label and one column of value.melt
is the reverse operation ofpivot
pivot_table
allows us to have a aggregate function when there are multiple values of that label.unstack
vsstack
. which level of index can be selectedstack
is a special case ofpivot
, put column names not in column but into another level of index, forming Multi-index table.
unstack
otherwise, is the reverse ofstack
puting one level of index into column name
group by mapping
groupby({'a': 1, 'b': 2, 'c': 1})
so that ‘a’ and ‘c’ are considered as the same group
Iterate through groups to have some operations…
groupby().aggregate()
- apply different oeprations on all columns. ```groupby().aggregate([np.mean, ‘min’, max])
- apply different operation on different columns by mapping. ```groupby().aggregate({‘col1’: func1, ‘col2: func2})
groupby().transform()
must return a same dimensioned data as the original
groupby().apply()
- axis argument indicates how function goes forward. At each step, the function applies to the other axis.
- apply method can return ‘aggregate’ like or ‘transform’ like results. It automatically interprets the results but we can also specify the format in result_type arguent
groupby().filtering()
def func(): return boolean #this has to be a boolean value groupby().filter(func)
slicing data
for multi-index
- selection
1.1. normal/simple selection# df is 3x3 and multi indexed with 3 levels in index and column df.loc[(slice(None),['a', 'b'],), (,,,)]
1.2. pd.IndexSlice for loc like selection. can use range and no more slice()
idx = pd.IndexSlice() df.loc[idx['a':'c', :, :], idx[:,:,:]]
2.
Multi-index
axis means both column and index are possible
DF methods to set;reset/drop/rename axis.
- set;reset
1.1. reindex(length of axis and data can differ. if axis longer, add nan value, otherwise truncate data likeloc
)
1.2. set_index <-> reset_index(consumes a column)
1.3. set_axis: length of axis must be same as data dimension - rename
- droplevel
index manipulation
Series has the most of the methods like axis. If axis needs manipulation, firstly extract the axis, then transform and finally set them as new axis.
- to_flat_index; transform axis to list of tuples
- to_frame; transform axis to dataframe
- reorder_levels