美文网首页
使用Pandas和Python探索数据集3

使用Pandas和Python探索数据集3

作者: python测试开发 | 来源:发表于2020-02-25 15:29 被阅读0次

数据清洗

您可能会惊讶地发现本节太晚了!通常,在继续进行更复杂的分析之前,您会仔细查看数据集以解决所有问题。但是,在本教程中,您将依靠在上一节中学到的技术来清理数据集。

缺失值

图片.png

当您使用检查nba数据集时nba.info(),您会发现它非常整洁。只有列notes的大多数行都包含空值。

>>> rows_without_missing_data = nba.dropna()
>>> rows_without_missing_data.shape
(5424, 23)

>>> data_without_missing_columns = nba.dropna(axis=1)
>>> data_without_missing_columns.shape
(126314, 22)

>>> data_with_default_notes = nba.copy()
>>> data_with_default_notes["notes"].fillna(
...     value="no notes at all",
...     inplace=True
... )
>>> data_with_default_notes["notes"].describe()
count              126314
unique                232
top       no notes at all
freq               120890
Name: notes, dtype: object

>>> nba.describe()
>>> nba[nba["pts"] == 0]

>>> nba[(nba["pts"] > nba["opp_pts"]) & (nba["game_result"] != 'W')].empty
True
>>> nba[(nba["pts"] < nba["opp_pts"]) & (nba["game_result"] != 'L')].empty
True

合并多个数据集

>>> further_city_data = pd.DataFrame(
...     {"revenue": [7000, 3400], "employee_count":[2, 2]},
...     index=["New York", "Barcelona"]
... )

>>> all_city_data = pd.concat([city_data, further_city_data], sort=False)
>>> all_city_data
Amsterdam   4200    5.0
Tokyo       6500    8.0
Toronto     8000    NaN
New York    7000    2.0
Barcelona   3400    2.0

>>> city_countries = pd.DataFrame({
...     "country": ["Holland", "Japan", "Holland", "Canada", "Spain"],
...     "capital": [1, 1, 0, 0, 0]},
...     index=["Amsterdam", "Tokyo", "Rotterdam", "Toronto", "Barcelona"]
... )
>>> cities = pd.concat([all_city_data, city_countries], axis=1, sort=False)
>>> cities
           revenue  employee_count  country  capital
Amsterdam   4200.0             5.0  Holland      1.0
Tokyo       6500.0             8.0    Japan      1.0
Toronto     8000.0             NaN   Canada      0.0
New York    7000.0             2.0      NaN      NaN
Barcelona   3400.0             2.0    Spain      0.0
Rotterdam      NaN             NaN  Holland      0.0

>>> pd.concat([all_city_data, city_countries], axis=1, join="inner")
           revenue  employee_count  country  capital
Amsterdam     4200             5.0  Holland        1
Tokyo         6500             8.0    Japan        1
Toronto       8000             NaN   Canada        0
Barcelona     3400             2.0    Spain        0

>>> countries = pd.DataFrame({
...     "population_millions": [17, 127, 37],
...     "continent": ["Europe", "Asia", "North America"]
... }, index= ["Holland", "Japan", "Canada"])
>>> pd.merge(cities, countries, left_on="country", right_index=True)

>>> pd.merge(
...     cities,
...     countries,
...     left_on="country",
...     right_index=True,
...     how="left"
... )

数据可视化

>>> nba[nba["fran_id"] == "Knicks"].groupby("year_id")["pts"].sum().plot()
>>> nba["fran_id"].value_counts().head(10).plot(kind="bar")
>>> nba[
...     (nba["fran_id"] == "Heat") &
...     (nba["year_id"] == 2013)
... ]["game_result"].value_counts().plot(kind="pie")

https://www.somebits.com/~nelson/pandas-multiindex-slice-demo.html

相关文章

网友评论

      本文标题:使用Pandas和Python探索数据集3

      本文链接:https://www.haomeiwen.com/subject/ymltchtx.html