測試環境為 Ubuntu14.04
HD 硬碟, Network 網路, FC (Fiber Channel)光纖 這幾種常見的 I/O 介面要如何檢查是不是有產生 Error count.
HD (SATA , SAS)
透過硬碟本身所提供的 S.M.A.R.T. (Self-Monitoring Analysis and Reporting Technology )技術 ,就可以檢測該顆硬碟的 Error count (Errors Corrected by ECC fast | delayed , Errors Corrected by rereads / rewrites , Total errors corrected , Correction algorithm invocations , Total uncorrected errors)
root@ubuntu:~# smartctl -l error /dev/sdb smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-24-generic] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 0 0 123263.011 0 write: 0 0 0 0 0 5218.671 0 verify: 0 0 0 0 0 24.721 0 Non-medium error count: 148
[root@localhost ~]# smartctl -A /dev/sdb smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.el7.x86_64] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 050 Pre-fail Always - 13 5 Reallocated_Sector_Ct 0x0032 100 100 001 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 001 Old_age Always - 1586 12 Power_Cycle_Count 0x0032 100 100 001 Old_age Always - 554 170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 44 171 Unknown_Attribute 0x0032 100 100 001 Old_age Always - 0 172 Unknown_Attribute 0x0032 100 100 001 Old_age Always - 0 173 Unknown_Attribute 0x0033 100 100 000 Pre-fail Always - 39 174 Unknown_Attribute 0x0032 100 100 001 Old_age Always - 466 184 End-to-End_Error 0x0033 100 100 050 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 001 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 001 Old_age Always - 1 194 Temperature_Celsius 0x0022 078 065 000 Old_age Always - 22 (Min/Max 14/35) 195 Hardware_ECC_Recovered 0x003a 100 100 001 Old_age Always - 1908 197 Current_Pending_Sector 0x0032 100 100 001 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 001 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 001 Old_age Always - 0 202 Unknown_SSD_Attribute 0x0018 100 100 001 Old_age Offline - 0 206 Unknown_SSD_Attribute 0x000e 100 100 001 Old_age Always - 0 247 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 928166108 248 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 21320209
更多關於 smartctl 的使用請參考
- smartctl – https://benjr.tw/95984
- smartctl -t TEST, –test=TEST – https://benjr.tw/96015
- smartctl (RAID controllers) – https://benjr.tw/96471
- smartctl (S.M.A.R.T. attributes) – https://benjr.tw/98889
- smartd – https://benjr.tw/96013
HD (NVMe)
NVMe 的儲存裝置,是不是也可以透過 smartctl 來檢視資料呢!雖然在 smartctl 官網有提到 https://www.smartmontools.org/wiki/NVMe_Support 但透過 smartctl 看 nvme 所得到的資訊卻很不多,官網也建議使用 nvme 指令 (由 nvme-cli 套件提供) ,請參考 https://benjr.tw/98887 .
[root@localhost ~]# smartctl -A /dev/nvme0 smartctl 6.6 2017-11-05 r4594 [x86_64-linux-3.10.0-514.el7.x86_64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === Read NVMe SMART/Health Information failed: NVMe Status 0x04
[root@localhost ~]# nvme error-log /dev/nvme0
Network
指令 #ip 加入參數 -s (statistics) 就可以看到 RX (Receive) , TX (Transmit) packets , errors , dropped , overrun , mcast , arrier , collsns 等統計資料.
root@ubuntu:~# ip -s link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 RX: bytes packets errors dropped overrun mcast 0 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 0 0 0 0 0 0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether 08:00:27:cb:a9:8b brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 2756198 8601 0 0 0 0 TX: bytes packets errors dropped carrier collsns 682123 7551 0 0 0 0
也可以直接查看 /sys/class/net 網路裝置下的錯誤統計.
root@ubuntu:~# ls /sys/class/net/eth0/statistics/ collisions rx_dropped rx_missed_errors tx_carrier_errors tx_heartbeat_errors multicast rx_errors rx_over_errors tx_compressed tx_packets rx_bytes rx_fifo_errors rx_packets tx_dropped tx_window_errors rx_compressed rx_frame_errors tx_aborted_errors tx_errors rx_crc_errors rx_length_errors tx_bytes tx_fifo_errors
其他關於 Network Error 檢查請參考 https://benjr.tw/94371
FC
可以直接查看 /sys/class/fc_host 網路裝置下的錯誤統計.
root@ubuntu:~# ls /sys/class/fc_host/host2/statistics/ dumped_frames fcp_output_requests loss_of_sync_count seconds_since_last_reset error_frames invalid_crc_count nos_count tx_frames fcp_control_requests invalid_tx_word_count prim_seq_protocol_err_count tx_words fcp_input_megabytes link_failure_count reset_statistics fcp_input_requests lip_count rx_frames fcp_output_megabytes loss_of_signal_count rx_words