112 total views , 1 views today
測試環境為 CentOS 8 (虛擬機)
參考資料 – https://medium.com/@b89202027_37759/%E5%AF%A6%E7%94%A8%E4%BD%86%E5%B8%B8%E5%BF%98%E8%A8%98%E7%9A%84pandas-dataframe%E5%B8%B8%E7%94%A8%E6%8C%87%E4%BB%A4-1-976f48eb2bd5
如想針對指定特定欄來做分類+ Aggregation Function 可以使用 Groupby () 函數.
安裝所需模組
[root@localhost ~]# pip install pandas |
匯入模組
[root@localhost ~]# python3 Python 3.6.8 ( default , Sep 10 2021, 09:13:53) [GCC 8.5.0 20210514 (Red Hat 8.5.0-3)] on linux Type "help" , "copyright" , "credits" or "license" for more information. >>> import pandas as pd |
從網路抓的資料大多是 JSON 格式,下面就以此為範例.
df = pd.DataFrame({ "Class" : { "Ben" : 'A1' , "Alex" : 'A1' , "Jeff" : 'B1' , "Dexter" : 'B1' }, "Chinese" : { "Ben" : 68, "Alex" : 86, "Jeff" : 57, "Dexter" : 95 }, "English" : { "Ben" : 63, "Alex" : 92, "Jeff" : 83, "Dexter" : 89 }, "Math" : { "Ben" : 65, "Alex" : 89, "Jeff" : 77, "Dexter" : 100 } }) |
>>> df Class Chinese English Math Ben A1 68 63 65 Alex A1 86 92 89 Jeff B1 57 83 77 Dexter B1 95 89 100 |
Aggregation Function
Pandas 的集合計算函數
- count()
Returns count for each group>>> df[[
'Chinese'
,
'English'
,
'Math'
]].
count
()
Chinese 4
English 4
Math 4
dtype: int64
- size()
Returns size for each group - sum()
Returns total sum for each group>>> df[[
'Chinese'
,
'English'
,
'Math'
]].sum()
Chinese 306
English 327
Math 331
- mean()
Returns mean for each group. Same as average()>>> df[[
'Chinese'
,
'English'
,
'Math'
]].mean()
Chinese 76.50
English 81.75
Math 82.75
dtype: float64
- average()
Returns average for each group. Same as mean() - std()
Returns standard deviation for each group - var()
Return var for each group - sem()
Standard error of the mean of groups - describe()
回傳數值的統計資料.>>> df.describe()
Chinese English Math
count
4.000000 4.000000 4.000000
mean 76.500000 81.750000 82.750000
std 17.175564 13.047988 15.107945
min 57.000000 63.000000 65.000000
25% 65.250000 78.000000 74.000000
50% 77.000000 86.000000 83.000000
75% 88.250000 89.750000 91.750000
max 95.000000 92.000000 100.000000
其中的
- mean 資料的平均
- count 資料數量
- min 資料中的最小值
- max 資料中的最大值
- std 標準差 , 主要看一組數據中常態分布的機率
- 25% , 50% , 75% 的百分位數(Percentile),可以用以下數學公式來表示 (不過算出來不一樣,奇怪)
L=n(樣本數)*P% (25% , 50% , 75%)
情況1:如果 L 是整數,則取 L 與 L+1 的平均值.
情況2:如果 L 非整數,則取離 L 下一個最近的整數.
- min()
Returns minimum value for each group>>> df[[
'Chinese'
,
'English'
,
'Math'
]].min()
Chinese 57
English 63
Math 65
dtype: int64
- max()
Returns maximum value for each group>>> df[[
'Chinese'
,
'English'
,
'Math'
]].max()
Chinese 95
English 92
Math 100
dtype: int64
- first()
Returns first value for each group - last()
Returns last value for each group - nth()
Returns nth value for each group
Groupby + Aggregation Function
通常 Aggregation Function 會使用在 Groupby (指定特定欄來做分類)
>>> df.groupby( 'Class' )[[ 'Chinese' , 'English' , 'Math' ]].mean() Chinese English Math Class A1 77.0 77.5 77.0 B1 76.0 86.0 88.5 |
如是要 groupby 多個欄位時候
groupby([ 'col1' , 'col2' , 'col3' ]) |
還可搭配 .sort_values 做升冪或是降冪的排序.
>>> df.groupby( 'Class' )[[ 'Chinese' , 'English' , 'Math' ]].mean().sort_values(by=[ 'Chinese' , 'English' ] , ascending=False) Chinese English Math Class A1 77.0 77.5 77.0 B1 76.0 86.0 88.5 >>> df.groupby( 'Class' )[[ 'Chinese' , 'English' , 'Math' ]].mean().sort_values(by=[ 'Chinese' , 'English' ] , ascending=True) Chinese English Math Class B1 76.0 86.0 88.5 A1 77.0 77.5 77.0 |
沒有解決問題,試試搜尋本站其他內容