Python_Pandas_性能提升
合理使用numpy及Pandas的一些方法,可以使运算速度成倍提升。本文将介绍一些常用的方法,并进行运算速度对比。
首先读取数据。
import pandas as pd
import numpy as np
data = pd.read_csv("gun_deaths_in_america.csv",header=0)
data.head()
year | month | intent | police | sex | age | race | hispanic | place | education | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | 1 | Suicide | 0 | M | 34 | Asian/Pacific Islander | 100 | Home | 4 |
1 | 2012 | 1 | Suicide | 0 | F | 21 | White | 100 | Street | 3 |
2 | 2012 | 1 | Suicide | 0 | M | 60 | White | 100 | Other specified | 4 |
3 | 2012 | 2 | Suicide | 0 | M | 64 | White | 100 | Home | 4 |
4 | 2012 | 2 | Suicide | 0 | M | 31 | White | 100 | Other specified | 2 |
一般的apply()方法
def judge_edu(row):
if row['education'] > 3:
return 'high'
else:
return row['education']
%timeit data['judge_edu'] = data.apply(judge_edu,axis=1)
out:
1.81 s ± 38.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
使用np.where()
where(condition, [x, y]) ,类似于if...else...,如果满足条件则返回x,否则返回y,可以嵌套。
%timeit data['judge_edu'] = np.where(data['education']>3,'high',data['education'])
out:
58.2 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
使用np.vectorize()
def judge_edu_2(col):
if col > 3:
return 'high'
else:
return col
vectfunc = np.vectorize(judge_edu_2)
%timeit data['judge_edu'] = vectfunc(data['education'])
out:
52.3 ms ± 5.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
多条件np.select()
apply()
def judge_age(row):
if row['age'] > 60:
return 'old'
elif row['age'] > 40:
return 'mid'
elif row['age'] > 20:
return 'young'
elif row['age'] > 10:
return 'teen'
else:
return 'child'
%timeit data['judge_age'] = data.apply(judge_age,axis=1)
out:
2.26 s ± 72.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.where()
%timeit data['judage_age_2'] = np.where(data['age']>60,'old',\
np.where(data['age']>40,'mid',\
np.where(data['age']>20,'young',\
np.where(data['age']>10,'teen','child'))))
out:
17.9 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.select()
np.select(condlist, choicelist, default=0) ,类似Excel中的choose函数。
conditions = [data['age']>60,
data['age']>40,
data['age']>20,
data['age']>10]
choices = ['old','mid','young','teen']
%timeit data['judge_age_3'] = np.select(conditions,choices,default='child')
out:
13.4 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
总结: 相较于pandas的apply方法,对于这种条件判断的计算,计算速度np.select > np.where > apply。
网友评论