Data frame with pandas
Read this in "about 2 minutes".
Hi!
Data frame is like table in SQL, it can be loaded from file and operated in memory. This is the process of data-preprocessing, which can be very tedious work.
Luckily, Pandas is a powerful tool to help us.
0.import package
import pandas as pd
1.create a data frame
df = pd.DataFrame()
Or load from (csv) file
header : is not set, header is the first line
delimiter : if not set, delimiter is ‘,’
bad split setting : error_bad_lines=False,warn_bad_lines=True
df = pd.read_csv(filepath_or_buffer=file_root, header=None, delimiter='\t')
2.rename columns
df.rename(columns={'old_column_name_1': 'new_column_name_1', 'old_column_name_2': 'new_column_name_2'}, inplace=True)
3.add a new column
df['new_column_name'] = 0
4.concat two data frames
add_df = pd.concat([one_part_df, two_part_df], axis=1)
5.column replace
df = df.replace('None', np.nan)
6.column change
df['column_name'] = df['column_name'] .apply(lambda x: trans_float(x, col))
7.fill null
df['column_name'] = df['column_name'].fillna(0)
8.remove duplicate columns
df = df.loc[:, ~df.columns.duplicated()]
9.drop some rows according to invalid value
df = df.drop(df[df['column_name] == -1.0].index)
10.split into chunk when df is too large
for chunk in pd.read_csv(file_name, chunksize=chunk_size):
do_something(chunk)
Goodbye!