반응형
1. 2021.02.21 03:44:01 ~ 02.22 08:55:11(이후도 지속 발생)
## mcelog에 남은 로그는 없으며 커널로그에 찍힌 로그
[root@localhost]# cat /var/log/messages | grep -v "snmp\|ACPI"
Feb 21 03:44:01 localhost rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1446" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Feb 21 04:48:52 localhost kernel: sbridge: HANDLING MCE MEMORY ERROR
Feb 21 04:48:52 localhost kernel: CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093
Feb 21 04:48:52 localhost kernel: TSC 0 ADDR 2f66e53c0 MISC 20400e0e86 PROCESSOR 0:206d7 TIME 1613850532 SOCKET 0 APIC 0
Feb 21 04:48:52 localhost kernel: sbridge: HANDLING MCE MEMORY ERROR
Feb 21 04:48:52 localhost kernel: CPU 0: Machine Check Exception: 0 Bank 11: 8800004600800093
Feb 21 04:48:52 localhost kernel: TSC 0 ADDR 0 MISC 4900002000200c8c PROCESSOR 0:206d7 TIME 1613850532 SOCKET 0 APIC 0
Feb 21 04:48:53 localhost kernel: EDAC MC0: CE row 3, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr = 0x2f66e53c0 => socket=0, Channel=3(mask=8), rank=0
Feb 21 04:48:53 localhost kernel:
...
Feb 22 08:55:10 localhost kernel: sbridge: HANDLING MCE MEMORY ERROR
Feb 22 08:55:10 localhost kernel: CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093
Feb 22 08:55:10 localhost kernel: TSC 0 ADDR 2f66e53c0 MISC 2040020286 PROCESSOR 0:206d7 TIME 1613951710 SOCKET 0 APIC 0
Feb 22 08:55:10 localhost kernel: sbridge: HANDLING MCE MEMORY ERROR
Feb 22 08:55:10 localhost kernel: CPU 0: Machine Check Exception: 0 Bank 11: 8800004600800093
Feb 22 08:55:10 localhost kernel: TSC 0 ADDR 0 MISC 4900002000200c8c PROCESSOR 0:206d7 TIME 1613951710 SOCKET 0 APIC 0
Feb 22 08:55:11 localhost kernel: EDAC MC0: CE row 3, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr = 0x2f66e53c0 => socket=0, Channel=3(mask=8), rank=0
## EDAC 관련 CE(correctable error) 로그 "CPU_SrcID#0_Channel#3_DIMM#0 " label의 메모리에 Unknown error 가 발생하였음.
2. EDAC
- 하드웨어 에러검출 및 정정을 지원하는 Linux Kernel Module 중 하나이다.
- PCI 버스 전송에러 및 주변 장치 에러검출도 지원
- MCE 관련 로그는 OS의 메모리 모니터링 기술 EDAC 기능에 의해 기록되는데 이 기술은 하드웨어의 메모리 모니터링 기술보다 정밀하지 못하다.
간혹 실제 오류가 없음에도 OS의 EDAC의 민감한 엔진에 의해 오류로 기록되는 경우가 있다. - 메시지 발생 시 하드웨어 정보(iLO,IML)을 통해 중복 확인하여 이상이 없는 경우 해당 메시지는 무시하거나 OS의 MCE 감지 기능을 비활성화 하는 것이 좋다.
3. Types of errors
- Correctable Error (CE) - the error detection mechanism detected and corrected the error. Such errors are usually not fatal, although some Kernel mechanisms allow the system administrator to consider them as fatal.
- Uncorrected Error (UE) - the amount of errors happened above the error correction threshold, and the system was unable to auto-correct.
- Fatal Error - when an UE error happens on a critical component of the system (for example, a piece of the Kernel got corrupted by an UE), the only reliable way to avoid data corruption is to hang or reboot the machine.
- Non-fatal Error - when an UE error happens on an unused component, like a CPU in power down state or an unused memory bank, the system may still run, eventually replacing the affected hardware by a hot spare, if available.
4. 메모리 찾기
## Dual controller이 아니고 mc0의 단일 controller.
[root@localhost mc0]# pwd
/sys/devices/system/edac/mc/mc0
[root@localhost mc0]# tree
.
├── ce_count
├── ce_noinfo_count
├── csrow0
│ ├── ce_count
│ ├── ch0_ce_count
│ ├── ch0_dimm_label
│ ├── dev_type
│ ├── edac_mode
│ ├── mem_type
│ ├── size_mb
│ └── ue_count
├── csrow1
│ ├── ce_count
│ ├── ch0_ce_count
│ ├── ch0_dimm_label
│ ├── dev_type
│ ├── edac_mode
│ ├── mem_type
│ ├── size_mb
│ └── ue_count
├── csrow2
│ ├── ce_count
│ ├── ch0_ce_count
│ ├── ch0_dimm_label
│ ├── dev_type
│ ├── edac_mode
│ ├── mem_type
│ ├── size_mb
│ └── ue_count
├── csrow3
│ ├── ce_count
│ ├── ch0_ce_count
│ ├── ch0_dimm_label
│ ├── dev_type
│ ├── edac_mode
│ ├── mem_type
│ ├── size_mb
│ └── ue_count
├── device -> ../../../../pci0000:3f/0000:3f:0e.0
├── mc_name
├── reset_counters
├── sdram_scrub_rate
├── seconds_since_reset
├── size_mb
├── ue_count
└── ue_noinfo_count
## 총 ue_count, ce_count
## UE 로그는 없으며 CE로그가 총 34건 발생.
[root@localhost mc0]# cat ue_count
0
[root@localhost mc0]# cat ce_count
34
## csrow 별 ce_count
## csrow0, csrow3에서 ce 로그 발생.
[root@localhost mc0]# cat csrow*/c*
7
7
CPU_SrcID#0_Channel#0_DIMM#0
0
0
CPU_SrcID#0_Channel#1_DIMM#0
0
0
CPU_SrcID#0_Channel#2_DIMM#0
27
27
CPU_SrcID#0_Channel#3_DIMM#0
## 메모리 정보
## cpu socket 1개 인 머신에 DIMM_A1 ~ DIMM_A4 총 4개의 메모리가 마운트되어 있음.
[root@localhost mc0]# dmidecode -t memory | more
# dmidecode 2.12
SMBIOS 2.7 present.
Handle 0x1000, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 768 GB
Error Information Handle: Not Provided
Number Of Devices: 24
Handle 0x1100, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: 1
Locator: DIMM_A1
Bank Locator: Not Specified
Type: DDR3
Type Detail: Synchronous Registered (Buffered)
Speed: 1333 MHz
Manufacturer: 00CE04B300CE
Serial Number: 8606B48E
Asset Tag: 02104811
Part Number: M393B1K70CH0-YH9
Rank: 2
Configured Clock Speed: 1333 MHz
Handle 0x1101, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: 1
Locator: DIMM_A2
Bank Locator: Not Specified
Type: DDR3
Type Detail: Synchronous Registered (Buffered)
Speed: 1333 MHz
Manufacturer: 00CE04B300CE
Serial Number: 8606ADAD
Asset Tag: 02104811
Part Number: M393B1K70CH0-YH9
Rank: 2
Configured Clock Speed: 1333 MHz
Handle 0x1102, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: 1
Locator: DIMM_A3
Bank Locator: Not Specified
Type: DDR3
Type Detail: Synchronous Registered (Buffered)
Speed: 1600 MHz
Manufacturer: 00CE04B300CE
Serial Number: 39839EF9
Asset Tag: 02144421
Part Number: M393B1K70PH0-YK0
Rank: 2
Configured Clock Speed: 1333 MHz
Handle 0x1103, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: 1
Locator: DIMM_A4
Bank Locator: Not Specified
Type: DDR3
Type Detail: Synchronous Registered (Buffered)
Speed: 1600 MHz
Manufacturer: 00CE04B300CE
Serial Number: 39839F14
Asset Tag: 02144421
Part Number: M393B1K70PH0-YK0
Rank: 2
Configured Clock Speed: 1333 MHz
정확한 메모리의 위치정보를 알수 가 없어 아래 표와 같이 예상해 볼 수 있다.
ch0 | ch1 | ch2 | ch3 | |
csrow0 | DIMM_A1 | |||
csrow1 | DIMM_A2 | |||
csrow2 | DIMM_A3 | |||
csrow3 | DIMM_A4 |
5. 결론
- 현재 DIMM_A1, DIMM_A4에서 ce_count가 올라가 있고 커널 로그 기준으로 DIMM_A4가 최근 발생한 메모리로 판단됨.
- 일단 ce 로그가 좀더 지켜보기로 했으며 로그가 지속적으로 올라오거나 다른 로그가 올라올시 교체로 방행을 잡아야 할 듯함.
반응형
'하드웨어' 카테고리의 다른 글
[DISK RAID] Dell iDRAC Dedicated Hot Spare 잡기 (0) | 2022.05.12 |
---|---|
[DISK RAID] Dell iDRAC Rebuild (0) | 2022.05.12 |
[DISK RAID] Dell iDRAC CopyBack (0) | 2022.05.12 |
[Fusion-IO] 수명 문제 (0) | 2022.05.12 |
[MegaRaid] MegaCli (0) | 2021.03.17 |