測試環境為 CentOS 8 (虛擬機)
參考資料 – https://medium.com/@b89202027_37759/%E5%AF%A6%E7%94%A8%E4%BD%86%E5%B8%B8%E5%BF%98%E8%A8%98%E7%9A%84pandas-dataframe%E5%B8%B8%E7%94%A8%E6%8C%87%E4%BB%A4-1-976f48eb2bd5
如想針對指定特定欄來做分類+ Aggregation Function 可以使用 Groupby () 函數.
安裝所需模組
[root@localhost ~]# pip install pandas
匯入模組
[root@localhost ~]# python3 Python 3.6.8 (default, Sep 10 2021, 09:13:53) [GCC 8.5.0 20210514 (Red Hat 8.5.0-3)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd
從網路抓的資料大多是 JSON 格式,下面就以此為範例.
df = pd.DataFrame({ "Class": { "Ben": 'A1', "Alex": 'A1', "Jeff": 'B1', "Dexter": 'B1' }, "Chinese": { "Ben": 68, "Alex": 86, "Jeff": 57, "Dexter": 95 }, "English": { "Ben": 63, "Alex": 92, "Jeff": 83, "Dexter": 89 }, "Math": { "Ben": 65, "Alex": 89, "Jeff": 77, "Dexter": 100 } })
>>> df Class Chinese English Math Ben A1 68 63 65 Alex A1 86 92 89 Jeff B1 57 83 77 Dexter B1 95 89 100
Aggregation Function
Pandas 的集合計算函數
- count()
Returns count for each group>>> df[['Chinese','English','Math']].count() Chinese 4 English 4 Math 4 dtype: int64
- size()
Returns size for each group - sum()
Returns total sum for each group>>> df[['Chinese','English','Math']].sum() Chinese 306 English 327 Math 331
- mean()
Returns mean for each group. Same as average()>>> df[['Chinese','English','Math']].mean() Chinese 76.50 English 81.75 Math 82.75 dtype: float64
- average()
Returns average for each group. Same as mean() - std()
Returns standard deviation for each group - var()
Return var for each group - sem()
Standard error of the mean of groups - describe()
回傳數值的統計資料.>>> df.describe() Chinese English Math count 4.000000 4.000000 4.000000 mean 76.500000 81.750000 82.750000 std 17.175564 13.047988 15.107945 min 57.000000 63.000000 65.000000 25% 65.250000 78.000000 74.000000 50% 77.000000 86.000000 83.000000 75% 88.250000 89.750000 91.750000 max 95.000000 92.000000 100.000000
其中的
- mean 資料的平均
- count 資料數量
- min 資料中的最小值
- max 資料中的最大值
- std 標準差 , 主要看一組數據中常態分布的機率
- 25% , 50% , 75% 的百分位數(Percentile),可以用以下數學公式來表示 (不過算出來不一樣,奇怪)
L=n(樣本數)*P% (25% , 50% , 75%)
情況1:如果 L 是整數,則取 L 與 L+1 的平均值.
情況2:如果 L 非整數,則取離 L 下一個最近的整數.
- min()
Returns minimum value for each group>>> df[['Chinese','English','Math']].min() Chinese 57 English 63 Math 65 dtype: int64
- max()
Returns maximum value for each group>>> df[['Chinese','English','Math']].max() Chinese 95 English 92 Math 100 dtype: int64
- first()
Returns first value for each group - last()
Returns last value for each group - nth()
Returns nth value for each group
Groupby + Aggregation Function
通常 Aggregation Function 會使用在 Groupby (指定特定欄來做分類)
>>> df.groupby('Class')[['Chinese','English','Math']].mean() Chinese English Math Class A1 77.0 77.5 77.0 B1 76.0 86.0 88.5
如是要 groupby 多個欄位時候
groupby(['col1','col2','col3'])
還可搭配 .sort_values 做升冪或是降冪的排序.
>>> df.groupby('Class')[['Chinese','English','Math']].mean().sort_values(by=['Chinese', 'English'] , ascending=False) Chinese English Math Class A1 77.0 77.5 77.0 B1 76.0 86.0 88.5 >>> df.groupby('Class')[['Chinese','English','Math']].mean().sort_values(by=['Chinese', 'English'] , ascending=True) Chinese English Math Class B1 76.0 86.0 88.5 A1 77.0 77.5 77.0
沒有解決問題,試試搜尋本站其他內容