測試環境為 CentOS 8 x86_64
- 關於 rasdaemon 的說明 – https://github.com/mchehab/rasdaemon
- 與 EDAC – https://www.kernel.org/doc/html/latest/driver-api/edac.html
Rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists.
簡單的來說 rasdaemon 全名為 RAS : (R)eliability , (A)vailability , (S)erviceability Daemon , 主要是透過 Linux 核心 Kernel EDAC (Error Detection and Correction) 功能來偵測並收集統計硬體的錯誤訊息.包含了 mc_event (MC : memory controller), aer_event (PCI-E AER : Advanced Error Reporting), mce_record , extlog_mem_event 這幾種.
所需套件.
[root@localhost ~]# dnf install rasdaemon
在 CentOS 7 還提供另外指令 edac-util – EDAC error reporting utility 與 edac-ctl – EDAC admin utility 來使用 (CentOS 8 則是使用 ras-mc-ctl – RAS memory controller admin utility ).
[root@localhost ~]# yum install rasdaemon [root@localhost ~]# yum install edac-utils
安裝完 rasdaemon 就可以啟動服務了.
[root@localhost ~]# rasdaemon --enable rasdaemon: ras:mc_event event enabled rasdaemon: ras:aer_event event enabled rasdaemon: mce:mce_record event enabled rasdaemon: ras:extlog_mem_event event enabled [root@localhost ~]# systemctl enable rasdaemon Created symlink /etc/systemd/system/multi-user.target.wants/rasdaemon.service → /usr/lib/systemd/system/rasdaemon.service. [root@localhost ~]# systemctl start rasdaemon [root@localhost ~]# systemctl status rasdaemon ● rasdaemon.service - RAS daemon to log the RAS events Loaded: loaded (/usr/lib/systemd/system/rasdaemon.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2021-03-05 23:36:29 CST; 5s ago Process: 2514 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS) Main PID: 2513 (rasdaemon) Tasks: 1 (limit: 23493) Memory: 612.0K CGroup: /system.slice/rasdaemon.service └─2513 /usr/sbin/rasdaemon -f -r Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: Enabled event ras:aer_event Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: Family 6 Model 8e CPU: only decoding architectural errors Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: mce:mce_record event enabled Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: Enabled event mce:mce_record Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: ras:extlog_mem_event event enabled Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: Enabled event ras:extlog_mem_event Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: rasdaemon: Recording mc_event events Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: rasdaemon: Recording aer_event events Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: rasdaemon: Recording extlog_event events Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: rasdaemon: Recording mce_record events
透過指令 ras-mc-ctl –summary 來檢視系統硬體是否有問題 ,以下錯誤參考 https://www.setphaserstostun.org/posts/monitoring-ecc-memory-on-linux-with-rasdaemon/ .
[root@localhost ~]# ras-mc-ctl --summary Memory controller events summary: Corrected on DIMM Label(s): 'DIMM_B1' location: 0:2:0:-1 errors: 5 PCIe AER events summary: 1 Uncorrected (Non-Fatal) errors: BIT21 No Extlog errors. No devlink errors. Disk errors summary: 0:0 has 6646 errors No MCE errors.
或是
[root@localhost ~]# ras-mc-ctl --errors No Memory errors. No PCIe AER errors. No Extlog errors. No MCE errors.
這兩種的差別如下:
- –summary : Presents a summary of the logged errors.
- –errors : Shows the errors stored at the error database.
其他 ras-mc-ctl 可使用參數如下:
[root@localhost ~]# ras-mc-ctl Usage: ras-mc-ctl [OPTIONS...] --quiet Quiet operation. --mainboard Print mainboard vendor and model for this hardware. --status Print status of EDAC drivers. --print-labels Print Motherboard DIMM labels to stdout. --guess-labels Print DMI labels, when bank locator is available. --register-labels Load Motherboard DIMM labels into EDAC driver. --delay=N Delay N seconds before writing DIMM labels. --labeldb=DB Load label database from file DB. --layout Display the memory layout. --summary Presents a summary of the logged errors. --errors Shows the errors stored at the error database. --error-count Shows the corrected and uncorrected error counts using sysfs. --help This help message.
大部分的參數是針對 DIMM 的部分,來看一下參數 –error-count
[root@localhost ~]# # ras-mc-ctl --error-count Label CE UE mc#0csrow#2channel#0 0 0 mc#0csrow#2channel#1 0 0 mc#0csrow#3channel#1 0 0 mc#0csrow#3channel#0 0 0
CE 與 UE 需要有 Error-Correcting Code ( ECC ) 的 DIMM 才能使用,這一類的記憶體常用於伺服器( Server )上.
- CE : 代表發生 Correctable Error 的次數.
- UE: 代表發生 UNCorrectable Error 的次數.
Label 所代表的意思.
- mc : Memory Controller
- csrow : Chip-Select Row
- channel : Memory Channel
我們通常會透過 指令 #journalctl 來查看系統發生什麼錯誤訊息,現在可以直接使用指令 ras-mc-ctl 來查看與硬體相關發生的錯誤次數與其統計.