Loading... <div class="tip inlineBlock error"> ipynb格式:[6.pandas 数据统计函数.html](http://type.zimopy.com/usr/uploads/2022/12/1714419953.html) </div> 下面是md格式 # pandas 数据统计函数 1. 汇总类统计 -- 数字 2. 唯一去重和按值计数 3. 先关系数和协方差 ```python import pandas as pd ``` ```python fpath = "../datas/beijing_tianqi/beijing_tianqi_2018.csv" df = pd.read_csv(fpath) df.head(3) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>ymd</th> <th>bWendu</th> <th>yWendu</th> <th>tianqi</th> <th>fengxiang</th> <th>fengli</th> <th>aqi</th> <th>aqiInfo</th> <th>aqiLevel</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>2018-01-01</td> <td>3℃</td> <td>-6℃</td> <td>晴~多云</td> <td>东北风</td> <td>1-2级</td> <td>59</td> <td>良</td> <td>2</td> </tr> <tr> <th>1</th> <td>2018-01-02</td> <td>2℃</td> <td>-5℃</td> <td>阴~多云</td> <td>东北风</td> <td>1-2级</td> <td>49</td> <td>优</td> <td>1</td> </tr> <tr> <th>2</th> <td>2018-01-03</td> <td>2℃</td> <td>-5℃</td> <td>多云</td> <td>北风</td> <td>1-2级</td> <td>28</td> <td>优</td> <td>1</td> </tr> </tbody> </table> </div> ```python df.loc[:,"bWendu"] = df["bWendu"].str.replace("℃","").astype("int32") df.loc[:,"yWendu"] = df["yWendu"].str.replace("℃","").astype("int32") ``` ```python df.head(3) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>ymd</th> <th>bWendu</th> <th>yWendu</th> <th>tianqi</th> <th>fengxiang</th> <th>fengli</th> <th>aqi</th> <th>aqiInfo</th> <th>aqiLevel</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>2018-01-01</td> <td>3</td> <td>-6</td> <td>晴~多云</td> <td>东北风</td> <td>1-2级</td> <td>59</td> <td>良</td> <td>2</td> </tr> <tr> <th>1</th> <td>2018-01-02</td> <td>2</td> <td>-5</td> <td>阴~多云</td> <td>东北风</td> <td>1-2级</td> <td>49</td> <td>优</td> <td>1</td> </tr> <tr> <th>2</th> <td>2018-01-03</td> <td>2</td> <td>-5</td> <td>多云</td> <td>北风</td> <td>1-2级</td> <td>28</td> <td>优</td> <td>1</td> </tr> </tbody> </table> </div> # 1. 汇总类统计 ## 提出所有数字列统计结果 ```python # 保留指定小数位 # df.describe().round(2) df.describe() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>bWendu</th> <th>yWendu</th> <th>aqi</th> <th>aqiLevel</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>365.000000</td> <td>365.000000</td> <td>365.000000</td> <td>365.000000</td> </tr> <tr> <th>mean</th> <td>18.665753</td> <td>8.358904</td> <td>82.183562</td> <td>2.090411</td> </tr> <tr> <th>std</th> <td>11.858046</td> <td>11.755053</td> <td>51.936159</td> <td>1.029798</td> </tr> <tr> <th>min</th> <td>-5.000000</td> <td>-12.000000</td> <td>21.000000</td> <td>1.000000</td> </tr> <tr> <th>25%</th> <td>8.000000</td> <td>-3.000000</td> <td>46.000000</td> <td>1.000000</td> </tr> <tr> <th>50%</th> <td>21.000000</td> <td>8.000000</td> <td>69.000000</td> <td>2.000000</td> </tr> <tr> <th>75%</th> <td>29.000000</td> <td>19.000000</td> <td>104.000000</td> <td>3.000000</td> </tr> <tr> <th>max</th> <td>38.000000</td> <td>27.000000</td> <td>387.000000</td> <td>6.000000</td> </tr> </tbody> </table> </div> ## 查看单个Series的数据 ```python df["bWendu"].mean() ``` 18.665753424657535 ## 最高温 ```python df["bWendu"].max() ``` 38 ## 最低温 ```python df["bWendu"].min() ``` -5 # 2.唯一去重和按值计数 ## 唯一性去重 **筛选出不一样的值** 一般不用于数值列,而是**枚举**、分类列 ```python df["fengxiang"].unique() ``` array(['东北风', '北风', '西北风', '西南风', '南风', '东南风', '东风', '西风'], dtype=object) ```python df["tianqi"].unique() ``` array(['晴~多云', '阴~多云', '多云', '阴', '多云~晴', '多云~阴', '晴', '阴~小雪', '小雪~多云', '小雨~阴', '小雨~雨夹雪', '多云~小雨', '小雨~多云', '大雨~小雨', '小雨', '阴~小雨', '多云~雷阵雨', '雷阵雨~多云', '阴~雷阵雨', '雷阵雨', '雷阵雨~大雨', '中雨~雷阵雨', '小雨~大雨', '暴雨~雷阵雨', '雷阵雨~中雨', '小雨~雷阵雨', '雷阵雨~阴', '中雨~小雨', '小雨~中雨', '雾~多云', '霾'], dtype=object) ```python df["fengli"].unique() ``` array(['1-2级', '4-5级', '3-4级', '2级', '1级', '3级'], dtype=object) ## 按值计数 找出不同值并计算数量 降序排序 ```python df["fengxiang"].value_counts() ``` 南风 92 西南风 64 北风 54 西北风 51 东南风 46 东北风 38 东风 14 西风 6 Name: fengxiang, dtype: int64 ```python df["tianqi"].value_counts() ``` 晴 101 多云 95 多云~晴 40 晴~多云 34 多云~雷阵雨 14 多云~阴 10 雷阵雨 8 小雨~多云 8 阴~多云 8 雷阵雨~多云 7 小雨 6 多云~小雨 5 阴 4 雷阵雨~中雨 4 中雨~小雨 2 中雨~雷阵雨 2 阴~小雨 2 霾 2 大雨~小雨 1 阴~雷阵雨 1 小雨~雨夹雪 1 雷阵雨~大雨 1 小雨~阴 1 小雨~大雨 1 暴雨~雷阵雨 1 小雪~多云 1 小雨~雷阵雨 1 雷阵雨~阴 1 阴~小雪 1 小雨~中雨 1 雾~多云 1 Name: tianqi, dtype: int64 ```python df["fengli"].value_counts() ``` 1-2级 236 3-4级 68 1级 21 4-5级 20 2级 13 3级 7 Name: fengli, dtype: int64 # 3.相关系数和协方差 用途: 1. 两只股票,是不是同涨同跌?程度多大?正相关还是负相关? 2. 产品销量的波动,跟哪些因素正相关、负相关,程度有多大? 来自知乎,对于两个变量X、Y: 1. 协方差︰**衡量同向反向程度**,如果协方差为正,说明X,Y同向变化,协方差越大说明同向程度越高;如果协方差为负,说明X,Y反向运动,协方差越小说明反向程度越高。 2. 相关系数:**衡量相似度程度**,当他们的相关系数为1时,说明两个变量变化时的正向相似度最大,当相关系数为– 1时,说明两个变量变化的反向相似度最大 ## 协方差矩阵 ```python df.cov() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>bWendu</th> <th>yWendu</th> <th>aqi</th> <th>aqiLevel</th> </tr> </thead> <tbody> <tr> <th>bWendu</th> <td>140.613247</td> <td>135.529633</td> <td>47.462622</td> <td>0.879204</td> </tr> <tr> <th>yWendu</th> <td>135.529633</td> <td>138.181274</td> <td>16.186685</td> <td>0.264165</td> </tr> <tr> <th>aqi</th> <td>47.462622</td> <td>16.186685</td> <td>2697.364564</td> <td>50.749842</td> </tr> <tr> <th>aqiLevel</th> <td>0.879204</td> <td>0.264165</td> <td>50.749842</td> <td>1.060485</td> </tr> </tbody> </table> </div> ## 相关系矩阵 ```python df.corr() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>bWendu</th> <th>yWendu</th> <th>aqi</th> <th>aqiLevel</th> </tr> </thead> <tbody> <tr> <th>bWendu</th> <td>1.000000</td> <td>0.972292</td> <td>0.077067</td> <td>0.071999</td> </tr> <tr> <th>yWendu</th> <td>0.972292</td> <td>1.000000</td> <td>0.026513</td> <td>0.021822</td> </tr> <tr> <th>aqi</th> <td>0.077067</td> <td>0.026513</td> <td>1.000000</td> <td>0.948883</td> </tr> <tr> <th>aqiLevel</th> <td>0.071999</td> <td>0.021822</td> <td>0.948883</td> <td>1.000000</td> </tr> </tbody> </table> </div> ## 单独查看空气质量和最高温度的相关系数 ```python df["aqi"].corr(df["bWendu"]) ``` 0.07706705916811069 ```python df["aqi"].corr(df["yWendu"]) ``` 0.02651328267296889 ## 空气质量额和温差的相关系数 ```python df["aqi"].corr(df["bWendu"]-df["yWendu"]) ``` 0.2165225757638205 使用的数据集在第4节后面 最后修改:2022 年 12 月 15 日 © 允许规范转载 打赏 赞赏作者 支付宝微信 赞 如果觉得我的文章对你有用,请随意赞赏