system panic and mpstat

2007-12-25 10:27:00

Dear sun-managers,

my original question was :

I have a Ultra-1 Sparc machine with a 200 Mhz cpu (propoerly patched) which

paniced with an Ecache Data Parity Error. I found some entries in the

database of sun-managers, saying that some CPUs had a problem with the

cache. But I do not know, if the revision of 270-2702-04 Rev.: 01 (standing

on the bottom of the processor card) is part of the defective series. Can

anyone tell me please ?

What does the number 270-2702-04 Rev.:01 mean number by number ? Is there a

special meaning in these numbers at all ?

Another question is about the output of mpstat :

My server is an Ultra-2 at 2x300 MHz (even properly patched). In the last

days we had some panics with "Copyout Data Parity Error" and mpstat says

that there are up to 160 minor and 1600 (!!) major fault on the cpu. What do

minor and major faults mean ? Do we have a hardware failure ?

Thank you for all replies !

Answers in no order from :

Ray Delany

Kevin

Robert Hill

Eearl Locken

Thomas Anders

The anser (from Earl Locken) was:

>Dear sun-managers,

>

>I have a Ultra-1 Sparc machine with a 200 Mhz cpu (propoerly patched) which

>paniced with an Ecache Data Parity Error. I found some entries in the

>database of sun-managers, saying that some CPUs had a problem with the

>cache. But I do not know, if the revision of 270-2702-04 Rev.: 01 (standing

>on the bottom of the processor card) is part of the defective series. Can

>anyone tell me please ?

     The CPU is bad. It may not be a manufacturing defect, it could

just be an electrical component on the module starting to fail.

...

>Another question is about the output of mpstat :

>My server is an Ultra-2 at 2x300 MHz (even properly patched). In the last

>days we had some panics with "Copyout Data Parity Error" and mpstat says

>that there are up to 160 minor and 1600 (!!) major fault on the cpu. What

do

>minor and major faults mean ? Do we have a hardware failure ?

    A copyout error occurs when one CPU reports that another did not

respond within a timeout. The destination CPU, not the one reporting,

is the one that is defective. Odds are the ecache errors and the copyout

errors show the same defective CPU.

     Major faults and minor faults are paging statistics. Major faults

means the OS had to go all the way to disk to get the page. Minor faults

mean the OS found the page still cached in RAM even though the page was

no longer referenced. Minor faults are a significant performance win.

Many thanks again, that solved the problem

Many thanks even at this way to Mr. Locken.

Juergen Schultz

Juergen.Schultz@m.dasa.de

Comments

Got something to say?

You must be logged in to post a comment.