
[hardware] Memory EDAC 관련 로그(1)

ploz 2021. 3. 18. 18:14

1. 2021.02.21 03:44:01 ~ 02.22 08:55:11(이후도 지속 발생)

## mcelog에 남은 로그는 없으며 커널로그에 찍힌 로그
[root@localhost]# cat /var/log/messages | grep -v "snmp\|ACPI"
Feb 21 03:44:01 localhost rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1446" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Feb 21 04:48:52 localhost kernel: sbridge: HANDLING MCE MEMORY ERROR
Feb 21 04:48:52 localhost kernel: CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093
Feb 21 04:48:52 localhost kernel: TSC 0 ADDR 2f66e53c0 MISC 20400e0e86 PROCESSOR 0:206d7 TIME 1613850532 SOCKET 0 APIC 0
Feb 21 04:48:52 localhost kernel: sbridge: HANDLING MCE MEMORY ERROR
Feb 21 04:48:52 localhost kernel: CPU 0: Machine Check Exception: 0 Bank 11: 8800004600800093
Feb 21 04:48:52 localhost kernel: TSC 0 ADDR 0 MISC 4900002000200c8c PROCESSOR 0:206d7 TIME 1613850532 SOCKET 0 APIC 0
Feb 21 04:48:53 localhost kernel: EDAC MC0: CE row 3, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr = 0x2f66e53c0 => socket=0, Channel=3(mask=8), rank=0
Feb 21 04:48:53 localhost kernel:
Feb 22 08:55:10 localhost kernel: sbridge: HANDLING MCE MEMORY ERROR
Feb 22 08:55:10 localhost kernel: CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093
Feb 22 08:55:10 localhost kernel: TSC 0 ADDR 2f66e53c0 MISC 2040020286 PROCESSOR 0:206d7 TIME 1613951710 SOCKET 0 APIC 0
Feb 22 08:55:10 localhost kernel: sbridge: HANDLING MCE MEMORY ERROR
Feb 22 08:55:10 localhost kernel: CPU 0: Machine Check Exception: 0 Bank 11: 8800004600800093
Feb 22 08:55:10 localhost kernel: TSC 0 ADDR 0 MISC 4900002000200c8c PROCESSOR 0:206d7 TIME 1613951710 SOCKET 0 APIC 0
Feb 22 08:55:11 localhost kernel: EDAC MC0: CE row 3, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr = 0x2f66e53c0 => socket=0, Channel=3(mask=8), rank=0
## EDAC 관련 CE(correctable error) 로그 "CPU_SrcID#0_Channel#3_DIMM#0 " label의 메모리에  Unknown error 가 발생하였음.



  • 하드웨어 에러검출 및 정정을 지원하는 Linux Kernel Module 중 하나이다.
  • PCI 버스 전송에러 및 주변 장치 에러검출도 지원
  • MCE 관련 로그는 OS의 메모리 모니터링 기술 EDAC 기능에 의해 기록되는데 이 기술은 하드웨어의 메모리 모니터링 기술보다 정밀하지 못하다.
    간혹 실제 오류가 없음에도 OS의 EDAC의 민감한 엔진에 의해 오류로 기록되는 경우가 있다.
  • 메시지 발생 시 하드웨어 정보(iLO,IML)을 통해 중복 확인하여 이상이 없는 경우 해당 메시지는 무시하거나 OS의 MCE 감지 기능을 비활성화 하는 것이 좋다.


3. Types of errors

  • Correctable Error (CE) - the error detection mechanism detected and corrected the error. Such errors are usually not fatal, although some Kernel mechanisms allow the system administrator to consider them as fatal.
  • Uncorrected Error (UE) - the amount of errors happened above the error correction threshold, and the system was unable to auto-correct.
  • Fatal Error - when an UE error happens on a critical component of the system (for example, a piece of the Kernel got corrupted by an UE), the only reliable way to avoid data corruption is to hang or reboot the machine.
  • Non-fatal Error - when an UE error happens on an unused component, like a CPU in power down state or an unused memory bank, the system may still run, eventually replacing the affected hardware by a hot spare, if available.


4. 메모리 찾기

## Dual controller이 아니고 mc0의 단일 controller.
[root@localhost mc0]# pwd
[root@localhost mc0]# tree
├── ce_count
├── ce_noinfo_count
├── csrow0
│   ├── ce_count
│   ├── ch0_ce_count
│   ├── ch0_dimm_label
│   ├── dev_type
│   ├── edac_mode
│   ├── mem_type
│   ├── size_mb
│   └── ue_count
├── csrow1
│   ├── ce_count
│   ├── ch0_ce_count
│   ├── ch0_dimm_label
│   ├── dev_type
│   ├── edac_mode
│   ├── mem_type
│   ├── size_mb
│   └── ue_count
├── csrow2
│   ├── ce_count
│   ├── ch0_ce_count
│   ├── ch0_dimm_label
│   ├── dev_type
│   ├── edac_mode
│   ├── mem_type
│   ├── size_mb
│   └── ue_count
├── csrow3
│   ├── ce_count
│   ├── ch0_ce_count
│   ├── ch0_dimm_label
│   ├── dev_type
│   ├── edac_mode
│   ├── mem_type
│   ├── size_mb
│   └── ue_count
├── device -> ../../../../pci0000:3f/0000:3f:0e.0
├── mc_name
├── reset_counters
├── sdram_scrub_rate
├── seconds_since_reset
├── size_mb
├── ue_count
└── ue_noinfo_count
## 총 ue_count, ce_count
## UE 로그는 없으며 CE로그가 총 34건 발생.
[root@localhost mc0]# cat ue_count
[root@localhost mc0]# cat ce_count
## csrow 별 ce_count
## csrow0, csrow3에서 ce 로그 발생.
[root@localhost mc0]# cat csrow*/c*
## 메모리 정보
## cpu socket 1개 인 머신에 DIMM_A1 ~ DIMM_A4 총 4개의 메모리가 마운트되어 있음.
[root@localhost mc0]# dmidecode -t memory | more
# dmidecode 2.12
SMBIOS 2.7 present.
Handle 0x1000, DMI type 16, 23 bytes
Physical Memory Array
    Location: System Board Or Motherboard
    Use: System Memory
    Error Correction Type: Multi-bit ECC
    Maximum Capacity: 768 GB
    Error Information Handle: Not Provided
    Number Of Devices: 24
Handle 0x1100, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x1000
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 8192 MB
    Form Factor: DIMM
    Set: 1
    Locator: DIMM_A1
    Bank Locator: Not Specified
    Type: DDR3
    Type Detail: Synchronous Registered (Buffered)
    Speed: 1333 MHz
    Manufacturer: 00CE04B300CE
    Serial Number: 8606B48E
    Asset Tag: 02104811
    Part Number: M393B1K70CH0-YH9 
    Rank: 2
    Configured Clock Speed: 1333 MHz
Handle 0x1101, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x1000
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 8192 MB
    Form Factor: DIMM
    Set: 1
    Locator: DIMM_A2
    Bank Locator: Not Specified
    Type: DDR3
    Type Detail: Synchronous Registered (Buffered)
    Speed: 1333 MHz
    Manufacturer: 00CE04B300CE
    Serial Number: 8606ADAD
    Asset Tag: 02104811
    Part Number: M393B1K70CH0-YH9 
    Rank: 2
    Configured Clock Speed: 1333 MHz
Handle 0x1102, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x1000
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 8192 MB
    Form Factor: DIMM
    Set: 1
    Locator: DIMM_A3
    Bank Locator: Not Specified
    Type: DDR3
    Type Detail: Synchronous Registered (Buffered)
    Speed: 1600 MHz
    Manufacturer: 00CE04B300CE
    Serial Number: 39839EF9
    Asset Tag: 02144421
    Part Number: M393B1K70PH0-YK0 
    Rank: 2
    Configured Clock Speed: 1333 MHz
Handle 0x1103, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x1000
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 8192 MB
    Form Factor: DIMM
    Set: 1
    Locator: DIMM_A4
    Bank Locator: Not Specified
    Type: DDR3
    Type Detail: Synchronous Registered (Buffered)
    Speed: 1600 MHz
    Manufacturer: 00CE04B300CE
    Serial Number: 39839F14
    Asset Tag: 02144421
    Part Number: M393B1K70PH0-YK0 
    Rank: 2
    Configured Clock Speed: 1333 MHz

정확한 메모리의 위치정보를 알수 가 없어 아래 표와 같이 예상해 볼 수 있다.

  ch0 ch1 ch2 ch3
csrow0 DIMM_A1      
csrow1   DIMM_A2    
csrow2     DIMM_A3  
csrow3       DIMM_A4



5. 결론

- 현재 DIMM_A1, DIMM_A4에서 ce_count가 올라가 있고 커널 로그 기준으로 DIMM_A4가 최근 발생한 메모리로 판단됨.
- 일단 ce 로그가 좀더 지켜보기로 했으며 로그가 지속적으로 올라오거나 다른 로그가 올라올시 교체로 방행을 잡아야 할 듯함.
