Python Pandas – Aggregation Function & Groupby

測試環境為 CentOS 8 (虛擬機)

參考資料 – https://medium.com/@b89202027_37759/%E5%AF%A6%E7%94%A8%E4%BD%86%E5%B8%B8%E5%BF%98%E8%A8%98%E7%9A%84pandas-dataframe%E5%B8%B8%E7%94%A8%E6%8C%87%E4%BB%A4-1-976f48eb2bd5

如想針對指定特定欄來做分類+ Aggregation Function 可以使用 Groupby () 函數.

安裝所需模組

[root@localhost ~]# pip install pandas

匯入模組

[root@localhost ~]# python3
Python 3.6.8 (default, Sep 10 2021, 09:13:53)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd

從網路抓的資料大多是 JSON 格式,下面就以此為範例.

df = pd.DataFrame({
    "Class":
    {
        "Ben": 'A1',
        "Alex": 'A1',
        "Jeff": 'B1',
        "Dexter": 'B1'
    },
    "Chinese":
    {
        "Ben": 68,
        "Alex": 86,
        "Jeff": 57,
        "Dexter": 95
    },
    "English":
    {
        "Ben": 63,
        "Alex": 92,
        "Jeff": 83,
        "Dexter": 89
    },
    "Math":
    {
        "Ben": 65,
        "Alex": 89,
        "Jeff": 77,
        "Dexter": 100
    }
})

>>> df
       Class  Chinese  English  Math
Ben       A1       68       63    65
Alex      A1       86       92    89
Jeff      B1       57       83    77
Dexter    B1       95       89   100

Aggregation Function

Pandas 的集合計算函數

count()
Returns count for each group

>>> df[['Chinese','English','Math']].count()
Chinese    4
English    4
Math       4
dtype: int64

size()
Returns size for each group

sum()
Returns total sum for each group

>>> df[['Chinese','English','Math']].sum()
Chinese    306
English    327
Math       331

mean()
Returns mean for each group. Same as average()

>>> df[['Chinese','English','Math']].mean()
Chinese    76.50
English    81.75
Math       82.75
dtype: float64

average()
Returns average for each group. Same as mean()
std()
Returns standard deviation for each group
var()
Return var for each group
sem()
Standard error of the mean of groups
describe()
回傳數值的統計資料.
```
>>> df.describe()
         Chinese    English        Math
count   4.000000   4.000000    4.000000
mean   76.500000  81.750000   82.750000
std    17.175564  13.047988   15.107945
min    57.000000  63.000000   65.000000
25%    65.250000  78.000000   74.000000
50%    77.000000  86.000000   83.000000
75%    88.250000  89.750000   91.750000
max    95.000000  92.000000  100.000000
```
其中的
- mean 資料的平均
- count 資料數量
- min 資料中的最小值
- max 資料中的最大值
- std 標準差 , 主要看一組數據中常態分布的機率
- 25% , 50% , 75% 的百分位數（Percentile）,可以用以下數學公式來表示 (不過算出來不一樣,奇怪)
```
L=n(樣本數)*P% (25% , 50% , 75%)
```
  情況1：如果 L 是整數，則取 L 與 L+1 的平均值.
  情況2：如果 L 非整數，則取離 L 下一個最近的整數.

min()
Returns minimum value for each group

>>> df[['Chinese','English','Math']].min()
Chinese    57
English    63
Math       65
dtype: int64

max()
Returns maximum value for each group

>>> df[['Chinese','English','Math']].max()
Chinese     95
English     92
Math       100
dtype: int64

first()
Returns first value for each group
last()
Returns last value for each group
nth()
Returns nth value for each group

Groupby + Aggregation Function

通常 Aggregation Function 會使用在 Groupby (指定特定欄來做分類)

>>> df.groupby('Class')[['Chinese','English','Math']].mean()
       Chinese  English  Math
Class
A1        77.0     77.5  77.0
B1        76.0     86.0  88.5

如是要 groupby 多個欄位時候

groupby(['col1','col2','col3'])

還可搭配 .sort_values 做升冪或是降冪的排序.

>>> df.groupby('Class')[['Chinese','English','Math']].mean().sort_values(by=['Chinese', 'English'] , ascending=False)
       Chinese  English  Math
Class
A1        77.0     77.5  77.0
B1        76.0     86.0  88.5
>>> df.groupby('Class')[['Chinese','English','Math']].mean().sort_values(by=['Chinese', 'English'] , ascending=True)
       Chinese  English  Math
Class
B1        76.0     86.0  88.5
A1        77.0     77.5  77.0

沒有解決問題,試試搜尋本站其他內容

Aggregation Function

Groupby + Aggregation Function

發佈留言 取消回覆

發佈留言取消回覆