반응형
로그 확인
커널로그
- MC1(Memoty Controller)CE error
- "CPU_SrcID#0_Ha#0_Chan#2_DIMM#0" 이 메모리 에서 발생
- 에러 관련 page 주소는 "2ab71f000"
[57034.062252] mce: [Hardware Error]: Machine check events logged
[57034.062274] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[57034.062277] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 11: cc002002000800c2
[57034.062279] EDAC sbridge MC1: TSC 0
[57034.062281] EDAC sbridge MC1: ADDR 2ab71f000
[57034.062283] EDAC sbridge MC1: MISC 90848988190848c
[57034.062284] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1654220627 SOCKET 0 APIC 0
[57034.907899] EDAC MC1: 128 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 page:0x2ab71f offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:4 rank:255)
mcelog
- CPU 0, BANK 11번 위치
> mcelog
CPU 0 BANK 11
MISC 90848988190848c ADDR 2ab71f000
TIME 1654169011 Thu Jun 2 20:23:31 2022
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR
Transaction: Memory scrubbing error
MemCtrl: Corrected patrol scrub error
STATUS cc002002000800c2 MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0
PPIN 88e84d587df6b085
MICROCODE b00001a
CPUID Vendor Intel Family 6 Model 79
Hardware event. This is not a software error.
...
CE count로 메모리 확인
Shell
# ce count 확인
> grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:895
/sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count:0
# 레이블 확인
> cat /sys/devices/system/edac/mc/mc1/csrow0/ch2_dimm_label
CPU_SrcID#0_Ha#0_Chan#2_DIMM#0
edac-utils
- 설치
yum install edac-utils
- 확인
- mc1: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 895 Corrected Errors
> edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 895 Corrected Errors
mc1: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
물리 메모리 정보 확인
dmidecode
- CPU_SrcID#0_Ha#0_Chan#2_DIMM#0에 해당하는 정보 확인
> dmidecode -t memory | grep -C 4 "Channel2_Dimm0"
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM020
Bank Locator: _Node0_Channel2_Dimm0
Type: DRAM
Type Detail: Synchronous Registered (Buffered)
Speed: 2133 MT/s
Manufacturer: Hynix
---
- 혹은 물리메모리의 page 주소를 이용하여 검색 가능
- 에러 관련 page 주소는 " 2ab71f000" 이며 각 물리메모리마다 매핑된 page 주소를 아래와 같이 찾을 수 있음.
dmidecode -t 20
참고
- dmidecode -t 옵션
0 BIOS
1 System
2 Baseboard
3 Chassis
4 Processor
5 Memory Controller
6 Memory Module
7 Cache
8 Port Connector
9 System Slots
10 On Board Devices
11 OEM Strings
12 System Configuration Options
13 BIOS Language
14 Group Associations
15 System Event Log
16 Physical Memory Array
17 Memory Device
18 32-bit Memory Error
19 Memory Array Mapped Address
20 Memory Device Mapped Address
21 Built-in Pointing Device
22 Portable Battery
23 System Reset
24 Hardware Security
25 System Power Controls
26 Voltage Probe
27 Cooling Device
28 Temperature Probe
29 Electrical Current Probe
30 Out-of-band Remote Access
31 Boot Integrity Services
32 System Boot
33 64-bit Memory Error
34 Management Device
35 Management Device Component
36 Management Device Threshold Data
37 Memory Channel
38 IPMI Device
39 Power Supply
40 Additional Information
41 Onboard Devices Extended Information
42 Management Controller Host Interface
반응형
'하드웨어' 카테고리의 다른 글
[Megaraid] Firmware state: Failed 로 인한 Disk 교체 및 Rebuild (0) | 2022.06.02 |
---|---|
[DISK RAID] Dell iDRAC Dedicated Hot Spare 잡기 (0) | 2022.05.12 |
[DISK RAID] Dell iDRAC Rebuild (0) | 2022.05.12 |
[DISK RAID] Dell iDRAC CopyBack (0) | 2022.05.12 |
[Fusion-IO] 수명 문제 (0) | 2022.05.12 |