Data frame with pandas

Read this in "about 2 minutes".

Hi!

Data frame is like table in SQL, it can be loaded from file and operated in memory. This is the process of data-preprocessing, which can be very tedious work.

Luckily, Pandas is a powerful tool to help us.

0.import package

import pandas as pd

1.create a data frame

df = pd.DataFrame()

Or load from (csv) file

header : is not set, header is the first line

delimiter : if not set, delimiter is ‘,’

bad split setting : error_bad_lines=False,warn_bad_lines=True

df = pd.read_csv(filepath_or_buffer=file_root, header=None, delimiter='\t')

2.rename columns

df.rename(columns={'old_column_name_1': 'new_column_name_1', 'old_column_name_2': 'new_column_name_2'}, inplace=True)

3.add a new column

df['new_column_name'] = 0

4.concat two data frames

add_df = pd.concat([one_part_df, two_part_df], axis=1)

5.column replace

df = df.replace('None', np.nan)

6.column change

df['column_name']  = df['column_name'] .apply(lambda x: trans_float(x, col))

7.fill null

df['column_name'] = df['column_name'].fillna(0)

8.remove duplicate columns

df = df.loc[:, ~df.columns.duplicated()]

9.drop some rows according to invalid value

df = df.drop(df[df['column_name] == -1.0].index)

10.split into chunk when df is too large

for chunk in pd.read_csv(file_name, chunksize=chunk_size):
    do_something(chunk)

Goodbye! :wink:


Author

Typing Theme

Lorem ipsum dolor sit amet, consectetur adipisicing elit. Tempora non aut eos voluptas debitis unde impedit aliquid ipsa.

 The comment for this post is disabled.