data manipulation pandas
Data selection and manipulation
loc and iloc
The most useful was is to loc by callable to select rows/columns. As long as the callable returns a list of true/false, it works.
df.reindex returns data when index and/or columns presents but returns NaN or custom fillin when not present.
s = pd.Series([1,2,3])
s.reindex([1,2,3]) #this returns [2,3,NaN]
use index.intersection(other_df.index) to find common columns with other df
use iat/at for selecting and setting
isin takes both list and dict as arguments to specify columns for df
where and mask. mask is the inverse selection of where
df.where(df>5, '?') # return value if > 5, otherwise fill with '?'. fill with NaN if second argument not given
query
df[(df['a'] < df['b']) & (df['b'] < df['c'])]is same asdf.query('(a < b) & (b < c)')df.query('index < 2')selects data from row 3. ‘index’ is a special namedf.query('a in b')same asdf.loc[df.a.isin(df.b),:].df.quer('a not in b')is also valid- use case of
query. when operations are the same but datasets are differentexpr = '0.0 <= a <= c <= 0.5' map(lambda frame: frame.query(expr), [df, df2])lookupquery by a pair of rows and columns list. return a list of these values. raise error when one of the labels not found.Data transformation
Table transpose
of course we are not talking about normal mx transpose….
pivotvsmelt pivottransforms data in different columns into one volumns of label and one column of value.meltis the reverse operation ofpivotpivot_tableallows us to have a aggregate function when there are multiple values of that label.unstackvsstack. which level of index can be selectedstackis a special case ofpivot, put column names not in column but into another level of index, forming Multi-index table.
unstackotherwise, is the reverse ofstackputing one level of index into column name
group by mapping
groupby({'a': 1, 'b': 2, 'c': 1}) so that ‘a’ and ‘c’ are considered as the same group
Iterate through groups to have some operations…
groupby().aggregate()
- apply different oeprations on all columns. ```groupby().aggregate([np.mean, ‘min’, max])
- apply different operation on different columns by mapping. ```groupby().aggregate({‘col1’: func1, ‘col2: func2})
groupby().transform()must return a same dimensioned data as the original
groupby().apply() - axis argument indicates how function goes forward. At each step, the function applies to the other axis.
- apply method can return ‘aggregate’ like or ‘transform’ like results. It automatically interprets the results but we can also specify the format in result_type arguent
groupby().filtering()def func(): return boolean #this has to be a boolean value groupby().filter(func)slicing data
for multi-index
- selection
1.1. normal/simple selection# df is 3x3 and multi indexed with 3 levels in index and column df.loc[(slice(None),['a', 'b'],), (,,,)]1.2. pd.IndexSlice for loc like selection. can use range and no more slice()
idx = pd.IndexSlice() df.loc[idx['a':'c', :, :], idx[:,:,:]]2.
Multi-index
axis means both column and index are possible
DF methods to set;reset/drop/rename axis.
- set;reset
1.1. reindex(length of axis and data can differ. if axis longer, add nan value, otherwise truncate data likeloc)
1.2. set_index <-> reset_index(consumes a column)
1.3. set_axis: length of axis must be same as data dimension - rename
- droplevel
index manipulation
Series has the most of the methods like axis. If axis needs manipulation, firstly extract the axis, then transform and finally set them as new axis.
- to_flat_index; transform axis to list of tuples
- to_frame; transform axis to dataframe
- reorder_levels