Data frame with pandas

Data frame is like table in SQL, it can be loaded from file and operated in memory. This is the process of data-preprocessing, which can be very tedious work.

Luckily, Pandas is a powerful tool to help us.

0.import package

import pandas as pd

1.create a data frame

df = pd.DataFrame()

Or load from (csv) file

header : is not set, header is the first line

delimiter : if not set, delimiter is ‘,’

bad split setting : error_bad_lines=False,warn_bad_lines=True

df = pd.read_csv(filepath_or_buffer=file_root, header=None, delimiter='\t')

2.rename columns

df.rename(columns={'old_column_name_1': 'new_column_name_1', 'old_column_name_2': 'new_column_name_2'}, inplace=True)

3.add a new column

df['new_column_name'] = 0

4.concat two data frames

add_df = pd.concat([one_part_df, two_part_df], axis=1)

5.column replace

df = df.replace('None', np.nan)

6.column change

df['column_name']  = df['column_name'] .apply(lambda x: trans_float(x, col))

7.fill null

df['column_name'] = df['column_name'].fillna(0)

8.remove duplicate columns

df = df.loc[:, ~df.columns.duplicated()]

9.drop some rows according to invalid value

df = df.drop(df[df['column_name] == -1.0].index)

10.split into chunk when df is too large

for chunk in pd.read_csv(file_name, chunksize=chunk_size):

