Linux – EDAC & rasdaemon

Loading

測試環境為 CentOS 8 x86_64

  • 關於 rasdaemon 的說明 – https://github.com/mchehab/rasdaemon
  • 與 EDAC – https://www.kernel.org/doc/html/latest/driver-api/edac.html

Rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists.

簡單的來說 rasdaemon 全名為 RAS : (R)eliability , (A)vailability , (S)erviceability Daemon , 主要是透過 Linux 核心 Kernel EDAC (Error Detection and Correction) 功能來偵測並收集統計硬體的錯誤訊息.包含了 mc_event (MC : memory controller), aer_event (PCI-E AER : Advanced Error Reporting), mce_record , extlog_mem_event 這幾種.

所需套件.

[root@localhost ~]# dnf install rasdaemon

在 CentOS 7 還提供另外指令 edac-util – EDAC error reporting utility 與 edac-ctl – EDAC admin utility 來使用 (CentOS 8 則是使用 ras-mc-ctl – RAS memory controller admin utility ).

[root@localhost ~]# yum install rasdaemon
[root@localhost ~]# yum install edac-utils

安裝完 rasdaemon 就可以啟動服務了.

[root@localhost ~]# rasdaemon --enable
rasdaemon: ras:mc_event event enabled
rasdaemon: ras:aer_event event enabled
rasdaemon: mce:mce_record event enabled
rasdaemon: ras:extlog_mem_event event enabled
[root@localhost ~]# systemctl enable rasdaemon
Created symlink /etc/systemd/system/multi-user.target.wants/rasdaemon.service → /usr/lib/systemd/system/rasdaemon.service.
[root@localhost ~]# systemctl start rasdaemon
[root@localhost ~]# systemctl status rasdaemon
● rasdaemon.service - RAS daemon to log the RAS events
   Loaded: loaded (/usr/lib/systemd/system/rasdaemon.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2021-03-05 23:36:29 CST; 5s ago
  Process: 2514 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
 Main PID: 2513 (rasdaemon)
    Tasks: 1 (limit: 23493)
   Memory: 612.0K
   CGroup: /system.slice/rasdaemon.service
           └─2513 /usr/sbin/rasdaemon -f -r

Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: Enabled event ras:aer_event
Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: Family 6 Model 8e CPU: only decoding architectural errors
Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: mce:mce_record event enabled
Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: Enabled event mce:mce_record
Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: ras:extlog_mem_event event enabled
Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: Enabled event ras:extlog_mem_event
Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: rasdaemon: Recording mc_event events
Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: rasdaemon: Recording aer_event events
Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: rasdaemon: Recording extlog_event events
Mar 05 23:36:29 localhost.localdomain rasdaemon[2513]: rasdaemon: Recording mce_record events

透過指令 ras-mc-ctl –summary 來檢視系統硬體是否有問題 ,以下錯誤參考 https://www.setphaserstostun.org/posts/monitoring-ecc-memory-on-linux-with-rasdaemon/ .

[root@localhost ~]# ras-mc-ctl --summary
Memory controller events summary:
  Corrected on DIMM Label(s): 'DIMM_B1' location: 0:2:0:-1 errors: 5

PCIe AER events summary:
  1 Uncorrected (Non-Fatal) errors: BIT21

No Extlog errors.

No devlink errors.
Disk errors summary:
  0:0 has 6646 errors
No MCE errors.

或是

[root@localhost ~]# ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

這兩種的差別如下:

  1. –summary : Presents a summary of the logged errors.
  2. –errors : Shows the errors stored at the error database.

其他 ras-mc-ctl 可使用參數如下:

[root@localhost ~]# ras-mc-ctl         
Usage: ras-mc-ctl [OPTIONS...]
 --quiet            Quiet operation.
 --mainboard        Print mainboard vendor and model for this hardware.
 --status           Print status of EDAC drivers.
 --print-labels     Print Motherboard DIMM labels to stdout.
 --guess-labels     Print DMI labels, when bank locator is available.
 --register-labels  Load Motherboard DIMM labels into EDAC driver.
 --delay=N          Delay N seconds before writing DIMM labels.
 --labeldb=DB       Load label database from file DB.
 --layout           Display the memory layout.
 --summary          Presents a summary of the logged errors.
 --errors           Shows the errors stored at the error database.
 --error-count      Shows the corrected and uncorrected error counts using sysfs.
 --help             This help message.

大部分的參數是針對 DIMM 的部分,來看一下參數 –error-count

[root@localhost ~]# # ras-mc-ctl --error-count
Label                 CE  UE
mc#0csrow#2channel#0  0   0
mc#0csrow#2channel#1  0   0
mc#0csrow#3channel#1  0   0
mc#0csrow#3channel#0  0   0

CE 與 UE 需要有 Error-Correcting Code ( ECC ) 的 DIMM 才能使用,這一類的記憶體常用於伺服器( Server )上.

  • CE : 代表發生 Correctable Error 的次數.
  • UE: 代表發生 UNCorrectable Error 的次數.

Label 所代表的意思.

  • mc : Memory Controller
  • csrow : Chip-Select Row
  • channel : Memory Channel

我們通常會透過 指令 #journalctl 來查看系統發生什麼錯誤訊息,現在可以直接使用指令 ras-mc-ctl 來查看與硬體相關發生的錯誤次數與其統計.

沒有解決問題,試試搜尋本站其他內容

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *

這個網站採用 Akismet 服務減少垃圾留言。進一步了解 Akismet 如何處理網站訪客的留言資料