본문 바로가기

하드웨어

[hardware] Memory EDAC 관련 로그(2)

반응형

로그 확인


커널로그

  • MC1(Memoty Controller)CE error 
  • "CPU_SrcID#0_Ha#0_Chan#2_DIMM#0" 이 메모리 에서 발생
  • 에러 관련 page 주소는 "2ab71f000"
[57034.062252] mce: [Hardware Error]: Machine check events logged
[57034.062274] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[57034.062277] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 11: cc002002000800c2
[57034.062279] EDAC sbridge MC1: TSC 0 
[57034.062281] EDAC sbridge MC1: ADDR 2ab71f000 
[57034.062283] EDAC sbridge MC1: MISC 90848988190848c 
[57034.062284] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1654220627 SOCKET 0 APIC 0
[57034.907899] EDAC MC1: 128 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 page:0x2ab71f offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:4 rank:255)

 

mcelog

  • CPU 0, BANK 11번 위치
> mcelog
CPU 0 BANK 11 
MISC 90848988190848c ADDR 2ab71f000 
TIME 1654169011 Thu Jun  2 20:23:31 2022
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR
Transaction: Memory scrubbing error
MemCtrl: Corrected patrol scrub error
STATUS cc002002000800c2 MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0 
PPIN 88e84d587df6b085
MICROCODE b00001a
CPUID Vendor Intel Family 6 Model 79
Hardware event. This is not a software error.
...

 

 

CE count로 메모리 확인


Shell 

# ce count 확인
> grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:895
/sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count:0

# 레이블 확인
> cat /sys/devices/system/edac/mc/mc1/csrow0/ch2_dimm_label 
CPU_SrcID#0_Ha#0_Chan#2_DIMM#0

 

edac-utils

  • 설치
yum install edac-utils
  • 확인
    • mc1: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 895 Corrected Errors
> edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 895 Corrected Errors
mc1: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors

 

물리 메모리 정보 확인


dmidecode

  • CPU_SrcID#0_Ha#0_Chan#2_DIMM#0에 해당하는 정보 확인
> dmidecode -t memory | grep -C 4 "Channel2_Dimm0"
	Size: 16384 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM020
	Bank Locator: _Node0_Channel2_Dimm0
	Type: DRAM
	Type Detail: Synchronous Registered (Buffered)
	Speed: 2133 MT/s
	Manufacturer: Hynix
---

 

 

  • 혹은 물리메모리의 page 주소를 이용하여 검색 가능
    • 에러 관련 page 주소는 " 2ab71f000" 이며 각 물리메모리마다 매핑된 page 주소를 아래와 같이 찾을 수 있음.
dmidecode -t 20

 

 

 

참고

  • dmidecode -t  옵션
0   BIOS
1   System
2   Baseboard 
3   Chassis   
4   Processor      
5   Memory Controller      
6   Memory Module    
7   Cache       
8   Port Connector    
9   System Slots      
10   On Board Devices     
11   OEM Strings    
12   System Configuration Options  
13   BIOS Language     
14   Group Associations  
15   System Event Log       
16   Physical Memory Array   
17   Memory Device    
18   32-bit Memory Error  
19   Memory Array Mapped Address   
20   Memory Device Mapped Address 
21   Built-in Pointing Device      
22   Portable Battery   
23   System Reset      
24   Hardware Security  
25   System Power Controls  
26   Voltage Probe    
27   Cooling Device   
28   Temperature Probe  
29   Electrical Current Probe   
30   Out-of-band Remote Access  
31   Boot Integrity Services     
32   System Boot    
33   64-bit Memory Error     
34   Management Device      
35   Management Device Component     
36   Management Device Threshold Data     
37   Memory Channel 
38   IPMI Device    
39   Power Supply    
40   Additional Information     
41   Onboard Devices Extended Information     
42   Management Controller Host Interface
반응형