(LONG): Locating bad memory chip in 3/160 memory board

2007-12-25 7:46:00

Hello all,

About a month ago, I had asked:

>A bit in one of the memory boards in one of our 3/160's has apparently

>gone bad. Booting this machine with the 'diag' switch set gives the

>error

>

> Err 11: Parity Error at 0x0063C004

> Exp 0x5A972C5A, Obs 0x5A172C5A, Xor 0x00800000

>

>during the memory test. Anyone know how to find the offending chip

>given the address reported in the error? The board in question is a

>Sun 4MB memory board, Sun part# 501-1132 (Rev 52), is located in the

>second slot of the card cage, next to the CPU board (which has 4MB of

>memory itself), and is populated with an 8x18 array of Mitsubishi

>MN41256-12 chips. Anyone know the specs for this chip (ie., static or

>dynamic, size, speed (120ns I assume))?

With the help of this newsgroup, I was able to track down the offending

chip and replace it. The board is now working again with all 4MB in our

3/160. Total repair cost: $4 (I replaced two chips, the first by

mistake!) and probably 10 cents in solder.

I've attached below a summary of how to track down the faulty chip which

I prepared from the information I received.

Many thanks to those who responded:

uug@cpsc.ucalgary.ca (William Graham)

nanook@eskimo.celestial.com (Robert Dinse)

tarsa@elijah.mv.com (Greg ...)

boyle@wrl.dec.com (Patrick Boyle)

And any others I've left out!

John Valdes Department of the Geophysical Sciences

valdes@geosun.uchicago.edu University of Chicago

---------------------------------------------------------------------------

                      Repairing a Sun 4MB Memory Board

                                John Valdes

                           University of Chicago

                         valdes@geosun.uchicago.edu

                                   6/4/92

Introduction

------------

  This report is based on my experiences and the information I received from

the newsgroup comp.sys.sun.hardware when a Sun Microsystems Int. 4MB memory

board, part# 501-1132, in one of our Sun 3/160's failed. The information

below may also apply to Sun's 2MB memory board, part# 501-1131, as this

board may have the same chip arrangement (I haven't verified this, however).

These boards were used in Sun's 3/1x0 series of computers (and perhaps in

others). The information given below is correct to the best of my

knowledge, but of course, it is presented without warranty and neither I nor

the U. of Chicago can assume any responsibility for anything that may happen

as a result of it (hey, it worked for me...!). I would be glad to receive

any corrections or clarifications.

Diagnosing the Problem

----------------------

  If a chip in one of the memory boards in your Sun3 fails, you will most

likely discover it while the computer is running unix. Your system will

probably panic with an "unknown memory error" similar to:

  vmunix: Memory Error Register d4<INTR,INTENA,CHECK,ERR16>

          d4<INTR,INTENA,CE_ENA,WBACKERR>

  vmunix: DVMA = 0, context 4, virtual address = de06008

  vmunix: pme = d300031e, physical address = 63c008

  vmunix: panic: unknown memory error

  vmunix: syncing file systems...

The exact message, of course, will depend on the location of the error, what

the machine was doing at the time, and the version of SunOS running on your

machine.

  To find the exact address of the error, you will need to boot your Sun3 in

DIAG mode in order to run a complete memory test. If not already there,

bring your machine down to the PROM monitor (the prompt will be a '>') and

move the switch on the back of the CPU board from 'NORM' to 'DIAG'. Then,

with a terminal attached to serial port A on the CPU board (set the terminal

characteristics to 9600 baud, 8 data bits, 1 stop bit, no parity;

alternately, you can connect a terminal to serial port B at 1200 baud),

either type 'k2' at the monitor prompt or press the 'RESET' button on the

CPU board to restart the machine. The machine will then run through a

series of self tests, as it normally does, but with the switch set to DIAG,

the machine will also echo the progress of the self tests to the terminal.

For 3/1x0 machines your terminal display should look something like:

  Boot PROM Selftest

    PROM Checksum Test

    DVMA Reg Test

    Context Reg Test

    ...

    Parity Test

    Memory Size = 0x00000xxx Megabytes

    Memory Test (testing xxxxxxxx MBytes)

The last test to run is the memory test. If your machine doesn't make it

this far, then something else is wrong with your system. Consult Sun's

"PROM User's Manual" for a description of the test which failed.

  Once at the memory test, the firmware will test all of the memory

installed in the machine, regardless of the current setting of the EEPROM

memory test parameter. If your system does indeed have a bad memory chip,

the memory test will fail with an error similar to

  Err 11: Parity Error at 0x0063C004

  Exp 0x5A972C5A, Obs 0x5A172C5A, Xor 0x00800000

and the system will continue to loop at this point (and, hence, any other

bad addresses following this one will not be found). The exact message may

depend on your PROM level. In any case, the message will tell you the start

address of the 4-byte word containing the error, the value written to the

word at that address (Exp), and the value read from the address (Obs).

Write down the message, as you will need the address, the xor value and the

error number to determine the defective chip. The xor value indicates which

data bit is in error (if no xor value is reported, simply compute it from

the Exp and Obs values). If the xor value is zero (0x00000000), then the

error is in a parity bit. In this case the error number will indicate which

bit of the four parity bits in the word is bad (the error number should be

0xd8, 0xd4, 0xd2 or 0xd1).

  If the memory self-test completes successfully without reporting any

errors, run the self-test two or three more times. If all tests succeed,

credit the initial error to cosmic rays, reboot the machine (don't forget

to set the DIAG switch back to NORM), and get back to work!

Locating the Offending Chip

---------------------------

  Given the address and the xor value reported by the memory self-test, it

is fairly straight forward to locate the bad memory chip. From the address

first determine which memory board (if you have more than one) contains the

bad chip. The board with the bad chip is the one which has the largest base

address which is less than the address of the error. Memory is mapped

sequentially between boards, so if your system has two memory boards, for

example, with 4MB installed on the CPU board, 2MB of memory on the first

memory board, and 4MB on the second memory board, then the 2MB board has a

base address of 0x400000, and the 4MB board has a base address of 0x600000

(the CPU board always has a base address of 0x0, of course). Typically, the

memory boards are installed from left to right in the card cage (when

looking at the system from the back) in order of increasing base address, so

that the first memory board will be the first memory board located to the

right of the CPU board (there may be an FPA or graphics board between the

CPU board and the first memory board), the second memory board will be the

second memory board to the right of the CPU board, and so on. Ultimately,

however, the base address of the board is determined by a set of switches or

jumpers on the board itself, so it may be possible to have the order of the

memory boards shuffled in the card cage. Assuming, that the boards are

installed in order, then for each one, subtract the base address of the

board from the address reported by the memory error, and if the result lies

within the capacity of the board, then that's the board with the bad chip.

If your system has 4MB of memory on the CPU board, and the memory error is

located at an address less than 0x400000 (or at an address less than

0x200000 if there's only 2MB of memory on the CPU board), then the bad chip

is on the CPU board itself. In this case, the information below will be of

little use in helping you find it.

  Once you've located the board you believe to contain the bad chip, remove

it from the card cage (be sure that the power is OFF before removing it!!!)

and verify that the base address of the board is set to what you think it

should be by checking the switch settings. For the Sun 4MB board, (and the

2MB board) there are two DIP switches, U3118 and U3119, located as shown

below for setting the base address of the board.

        V |

        M +-|

        E | |

           | |

        C | | +----- short for 2MB Board

        o | | |

        n | | | +-- short for 4MB Board

        n | | | |

        e | | V V

        c | | o o +------+ +------+ +------+ +------+

        t | | I | DIP | | DIP | | DIP | | DIP | . . .

        o | | o o +------+ +------+ +------+ +------+

        r +-| jumper

             |

             | +----+ +----+

             | | | | |

             | | | | |

             | | | | |

             | +----+ +----+

             | U3118 U3119

             |

                                      

        Location of switches U3118 and U3119 (Based on diagram from

             "Sun 3/160 Hardware Installation Manual," pg. 50)

The switches will set the base address of the board as given in the table

below.

           +----------------------------------------------------+

           | Base Address | U3118 setting^ | U3119 setting^ |

           |----------------|-----------------|-----------------|

           | 0x200000 | 2 ON | 3 ON |

           | 0x400000 | 3 ON | 4 ON |

           | 0x600000 | 4 ON | 5 ON |

           | 0x800000 | 5 ON | 6 ON |

           | 0xA00000 | 6 ON | 7 ON |

           | 0xC00000 | 7 ON | 8 ON |

           +----------------------------------------------------+

            ^switches other than the one specified are OFF

                                      

             Switch settings for 4MB board (Based on table from

             "Sun 3/160 Hardware Installation Manual," pg. 51)

(The switch settings for Sun's 2MB board are:

                    +----------------------------------+

                    | Base Address | U3118 setting |

                    |----------------|-----------------|

                    | 0x200000 | 2 ON |

                    | 0x400000 | 3 ON |

                    | 0x600000 | 4 ON |

                    | 0x800000 | 4 ON |

                    | 0xA00000 | 4 ON |

                    | 0xC00000 | 4 ON |

                    | 0xE00000 | 4 ON |

                    +----------------------------------+

                                      

             Switch settings for 2MB board (Based on table from

             "Sun 3/160 Hardware Installation Manual," pg. 51)

The setting for 0x800000 through 0xE00000 look odd to me, but this is what

the manual shows.)

If the board you removed wasn't the last memory board in the system, you

should reconfigure the other memory boards to plug the hole in the address

space left by the one you removed. To do this, you'll have to set the

appropriate switches on one or more of the boards to set the correct base

address for it. It is also a good idea to physically order the boards by

base address as mentioned previously--this may actually be necessary in

order to the system to work.

  With the suspect memory board removed and the other memory boards

correctly configured, and with the DIAG switch still set on the CPU board,

power up the system in order to run memory diagnostics again. If the system

now passes the memory test, then you've found the correct board and the

others are properly configured. If the memory test fails with the same

error at the same location (or at an integral multiple of MB elsewhere),

then you've removed the wrong board; power down the system and try again (if

the error was off by an integral multiple of MB, then the bad board is one

of the ones you've reconfigured). If the memory test fails with a

completely different error, then you may have another bad memory board, or

perhaps a problem with the VME backplane or VME connectors to the memory

board. After the system passes the diagnostic self-tests, and if you decide

to fully reboot the machine without the memory board, be sure to set the

DIAG switch back to NORM and to adjust the EEPROM values for memory size

(q14) and memory to test (q15) appropriately.

  Finally, you're ready to locate the bad chip on the board itself. The

501-1132 4MB memory board has an 8 row by 18 column array of memory chips.

The rows are indexed by letter (A, B, C, D, E, F, H, J) while the columns

are indexed by number (4-21) as silk-screened onto the board. The 8 rows

can be subdivided in 4 "row pairs" or "banks", which are similar to SIMM

banks in that each 4-byte word is contained within a single bank. The four

banks are formed by the row pairs as follows:

   Bank 0: Row pair J,H

   Bank 1: Row pair F,E

   Bank 2: Row pair D,C

   Bank 3: Row pair B,A

The bits for each word in a bank are arranged among the columns according to

  21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 <- column number

  -----------------------------------------------------

   P 16 17 18 19 20 21 22 23 P 24 25 26 27 28 29 30 31 <- bit numbers of

                                                            rows J,F,D,B

   P 0 1 2 3 4 5 6 7 P 8 9 10 11 12 13 14 15 <- bit numbers of

                                                            rows H,E,C,A

where P stands for a parity bit. Bit 31 is the most significant bit in the

word, and bit 0 the least.

  The 4MB on the board are divided into 0x800 byte (2K) partitions and

distributed circularly among the four banks. The first 2K is mapped to

bank 0, the next 2K to bank 1, the next to bank 2, then bank 3, then back

to bank 0, bank 1, and so on. The mapping for the first 256K is given

below.

  Bank 0 (Row pair J,H):

    00000->007FF, 02000->027FF, 04000->047FF,

    06000->067FF, 08000->087FF, 0A000->0A7FF,

    0C000->0C7FF, 0E000->0E7FF

    10000->107FF, 12000->127FF, 14000->147FF,

    16000->167FF, 18000->187FF, 1A000->1A7FF,

    1C000->1C7FF, 1E000->1E7FF

    20000->207FF, 22000->227FF, 24000->247FF,

    26000->267FF, 28000->287FF, 2A000->2A7FF,

    2C000->2C7FF, 2E000->2E7FF

    30000->307FF, 32000->327FF, 34000->347FF,

    36000->367FF, 38000->387FF, 3A000->3A7FF,

    3C000->3C7FF, 3E000->3E7FF

  Bank 1 (Row pair F,E):

    00800->00FFF, 02800->02FFF, 04800->04FFF,

    06800->06FFF, 08800->08FFF, 0A800->0AFFF,

    0C800->0CFFF, 0E800->0EFFF

    10800->10FFF, 12800->12FFF, 14800->14FFF,

    16800->16FFF, 18800->18FFF, 1A800->1AFFF,

    1C800->1CFFF, 1E800->1EFFF

    20800->20FFF, 22800->22FFF, 24800->24FFF,

    26800->26FFF, 28800->28FFF, 2A800->2AFFF,

    2C800->2CFFF, 2E800->2EFFF

    30800->30FFF, 32800->32FFF, 34800->34FFF,

    36800->36FFF, 38800->38FFF, 3A800->3AFFF,

    3C800->3CFFF, 3E800->3EFFF

  Bank 2 (Row pair D,C):

    01000->017FF, 03000->037FF, 05000->057FF,

    07000->077FF, 09000->097FF, 0B000->0B7FF,

    0D000->0D7FF, 0F000->0F7FF

    11000->117FF, 13000->137FF, 15000->157FF,

    17000->177FF, 19000->197FF, 1B000->1B7FF,

    1D000->1D7FF, 1F000->1F7FF

    21000->217FF, 23000->237FF, 25000->257FF,

    27000->277FF, 29000->297FF, 2B000->2B7FF,

    2D000->2D7FF, 2F000->2F7FF

    31000->317FF, 33000->337FF, 35000->357FF,

    37000->377FF, 39000->397FF, 3B000->3B7FF,

    3D000->3D7FF, 3F000->3F7FF

  Bank 3 (Row pair B,A):

    01800->01FFF, 03800->03FFF, 05800->05FFF,

    07800->07FFF, 09800->09FFF, 0B800->0BFFF,

    0D800->0DFFF, 0F800->0FFFF

    11800->11FFF, 13800->13FFF, 15800->15FFF,

    17800->17FFF, 19800->19FFF, 1B800->1BFFF,

    1D800->1DFFF, 1F800->1FFFF

    21800->21FFF, 23800->23FFF, 25800->25FFF,

    27800->27FFF, 29800->29FFF, 2B800->2BFFF,

    2D800->2DFFF, 2F800->2FFFF

    31800->31FFF, 33800->33FFF, 35800->35FFF,

    37800->37FFF, 39800->39FFF, 3B800->3BFFF,

    3D800->3DFFF, 3F800->3FFFF

  You can determine which bank contains the bad chip using the formula

    bank = ( addr / 0x800 ) % 0x4

where addr is the address of the error, '/' is the integer division operator

and '%' is the modulus operator. Then from the xor value of the error, you

can finally locate the bad chip using the bit-to-column mapping given above;

simply convert the xor value to binary and see which bit contains the '1'.

If the xor value is 0, then the error is in one of the four parity bits for

the word. In this case, use the error number to find the chip as follows:

    Error# 0xd8: parity bit for MSByte: {J,F,D,B}12

    Error# 0xd4: parity bit for MSByte-1: {J,F,D,B}21

    Error# 0xd2: parity bit for MSByte-2: {H,E,C,A}12

    Error# 0xd1: parity bit for LSByte: {H,E,C,A}21

For example, for the error

    Err 11: Parity Error at 0x0063C004

    Exp 0x5A972C5A, Obs 0x5A172C5A, Xor 0x00800000

bank = 0 and bit = 23. Hence, the bad chip is located at J13 on the memory

board.

Replacing the Chip

------------------

  I won't say too much about replacing the chip itself. The memory chips

used on the 4MB board are 256Kx1, 120ns, DIP DRAM and are very cheap and

easy to find.

  When removing chip from the board, it is best to clip the leads on the

chip close to the body, and then remove the individual leads from the board

one at a time, heating them from the bottom and pulling them up from the

top. Be sure to be gentle when clipping and pulling the leads to prevent

breaking any of the traces on the board. Finally, after removing all the

leads, clean away all the excess solder with a solder sucker, and solder the

new chip into place.

Finishing Up

------------

  Once you've replaced the chip, reinstall the board in the system--

resetting its base address, if necessary--and run the memory diagnostics on

it. If the test passes, congratulations! You've just repaired your board!

If the test fails with another error, repeat all of the above until all of

the errors are gone (remember, the self-test can only diagnose one error at

a time). If the test gives the same error (possibly offset by a few MB if

you've reconfigured the board), then either you've replaced the wrong chip,

or something else is wrong with the board. Double check your work to make

sure you located the chip correctly. (I did this once!)

  With the board finally working again, set the DIAG switch on the CPU board

back to NORM, adjust the EEPROM values in q14 and q15 if necessary, and

reboot. You're now back in business.

Acknowledgements

----------------

  I gathered most of this information from Sun's "PROM User's Manual" and

"Sun 3/160 Hardware Installation Manual". Many thanks are also due to

  William Graham, uug@cpsc.ucalgary.ca

  Robert Dinse, nanook@eskimo.celestial.com

  Greg (sorry, I don't have your last name), tarsa@elijah.mv.com

  Patrick Boyle, boyle@wrl.dec.com

from whom I received most of the information on tracking down the chip

location given its memory address.

----------------------------------------------------------

Comments

Got something to say?

You must be logged in to post a comment.