System with RAID array fails to reboot after power failure

2007-12-25 10:55:00

I am going to summarize while the number of responses is still

manageable. Thanks to everyone who responded an appologies to

those who responded after I composed this note.

Original question:

I am running a SPARC 20 (Solaris 2.4) with a Sun SPARC Storage Array.

The disks on the array are configured as RAID 5 volumes managed

by the Veritas volume manager (supplied with the array by Sun).

The system gives me no problems except when we have a power failure

which outlasts the UPS. In this situation, the disk array apparently

takes longer to complete its initialization process than the main

CPU. Thus when the CPU tries to mount the RAID volumes it gets

a large number of errors and the boot process fails. Since I administer

this system from a remote site, I would like everything to be as

automatic as possible (When this happens I cannot access the SPARC 20

via the network.). Does anyone have a solution to this problem?

___________________________________________________________________________

Summary:

Most likely, I will try the approach suggested by

mshon@sunrock.East.Sun.COM (Michael J. Shon) (option 3)

He sent me two files

etc_nvramrc.fth to be placed in /etc/nvram.fth

This is a forth program which inserts an 80 second delay at startup,

breakable by hitting any key

init.d_nvramrc to be placed in /etc/init.d/nvramrc

This shell script adds the above program to nvram at bootup. Put a pointer

to this file in one of the /etc/rc directories. As I understand it,

this script just reenters nvram.fth in case the CPU is swapped, so it

does not matter too much when it is run.

___________________________________________________________________________

Detailed Responses:

From: amy.hollander@amp.com (Amy Hollander)

In the eeprom, put an 80 second delay. that gives it time for the ssa to come up

___________________________________________________________________________

>From MHILL@graver.com Thu Aug 22 10:22 EDT 1996

From: "Matt Hill" <MHILL@graver.com>

Date: Thu, 22 Aug 1996 10:24:43 EST

Subject: Re: System with RAID array fails to reboot after power failure

put a sleep command somewhere near the beginning of /etc/rc2.d?

___________________________________________________________________________

>From erich@decux.nvg.com Thu Aug 22 10:34 EDT 1996

From: erich@s1000e.nvg.com (Erich Breu)

Subject: Re: System with RAID array fails to reboot after power failure

Our two different UPSs notify our sparcserver once the battery power

is running out and the system performs a clean shutdown.

Liebert is our main UPS and they supply software that runs on the

SUN and communicates to the UPS via a serial line.

Erich

erich@decux.nvg.com

___________________________________________________________________________

>From mshon@sunrock.East.Sun.COM Thu Aug 22 10:45 EDT 1996

From: mshon@sunrock.East.Sun.COM (Michael J. Shon {*Prof Services} Sun Rochester)

Subject: Re: System with RAID array fails to reboot after power failure

The attachments implement Solution 3.

Subject: nvramrc for disk delay Internal SRDBs document 11298

http://sunservice.Corp.Sun.COM:80/cgi-bin/sunsolvei/doc2html?intsrdb/11298

                        Internal SRDBs document 11298

----------------------------------------------------------------------------

SRDB ID: 11298

SYNOPSIS: SSA requires delay to boot

DETAIL DESCRIPTION:

If a host machine boots faster than the SPARCstorage Array that is

attached can become ready, the host machine might fail to initialize

correctly.

The SSA has an internal diagnostic sequence that takes a minimum of 75

seconds. The host machine can often boot from a cold power on faster

than that. When it starts its routine of looking for devices

the SSA has not yet become ready.

SOLUTION SUMMARY:

SOLUTION 1

----------

The following forth code will delay the host for 80 seconds,

and print the seconds until boot, to allow the SSA to beat

the OS.

>ok nvedit

probe-all install-console banner

: wait_for_ssa

." Waiting 80 seconds for SSA" cr

d# 80 0 do

i .d (cr

d# 1000 ms

loop

;

wait_for_ssa

then ctrl c to exit

>ok nvstore

>ok setenv use-nvramrc? true

Note that this loop is not breakable, and will pause for 80 seconds after

every reset.

SOLUTION 2

----------

        1. Place the following in a file, /etc/nvramrc.fth

                This version lets you break out simply

                by pressing any single key.

                        -----------------------

probe-all \ install devices

install-console \ install console device

banner \ output banner

: abort-on-key ( -- ) \ Define the function, ( -- ) is a comment

                                \ that means 'no stack change'

  key? \ key pressed?

  abort" Booting continuing. Waiting timer abourted." \ abort with

message if true

; \ finish function definition

: timed-startup ( -- ) \ Define the function, ( -- ) is a

                                \ comment that means 'call this with one

                                \ parameter on the stack, nothing left

                                \ on the stack on return'

 ." Waiting for the SPARCstorage Array disks to spin-up."

 cr

 ." Press any key to abort timer and continue to boot." cr

 0 B4 do i \ do loop (count down 180 seconds)

   ." Booting will continue in " \ construct the count down string

   .d \ print the value in decimal

   i 1 = if ." second." \ with the correct plural!

   else ." seconds..." \ I am a fussy programmer

   then (cr \ let us get the plurals correct

   d# 1000 ms \ spin for 1000 milliseconds

   abort-on-key \ check for a key

 -1 +loop \ subtract 1 from the i index then loop

 cr

 ." Booting continuing. Waiting time elapsed." cr \ all done

;

timed-startup \call timed-startup

                        -----------------------

        2. Do the following. You may want to place this in one of the

                startup scripts, so that if a CPU is swapped it is automatically

                re-installed.

                eeprom fcode-debug?=true

                eeprom use-nvramrc?=true

                eeprom nvramrc="`cat /etc/nvramrc.fth`"

        3. Reboot the machine.

SOLUTION 3

----------

There have been reports of SOLUTION 2 not running on some platforms,

especially 1000E and 2000E machines.

The following code has been tested on 1000E and 2000E machines. Please use

this code in place of the above.

        1. Place the following in a file, /etc/nvramrc.fth

probe-all \ install devices

install-console \ install console device

banner \ output banner

  .( Timer implemented to allow SSA's to boot from cold start)

  cr

: abort-on-key ( -- ) \ Define the function, ( -- ) is a comment

                                \ that means 'no stack change'

  key? \ key pressed?

  abort" Start delay aborted" \ abort with message if true

; \ finish function definition

: timed-startup ( -- ) \ Define the function, ( time -- ) is a

                                \ comment that means 'call this with one

                                \ parameter on the stack, nothing left

                                \ on the stack on return'

  .( Press any key to abort timer) cr

d# 80 0 do \ set up loop parameters

i .d (cr \ print the value

d# 1000 ms \ wait a second

abort-on-key \ Key pressed?

loop \ do it again

  ." Timer complete" cr \ all done

; \ finish function definition

timed-startup \ call the routine

        2. Do the following. You may want to place this in one of the

                startup scripts, so that if a CPU is swapped it is automatically

                re-installed.

                eeprom fcode-debug?=true

                eeprom use-nvramrc?=true

                eeprom nvramrc="`cat /etc/nvramrc.fth`"

        3. Reboot the machine.

----------------------------------------------------------------------------

If you have access to sunsolve, see srdb/11298 for some forth

code you enter to delay the boot process while waiting for the

SSA to get ready.

Here's one example:

SOLUTION 1

----------

The following forth code will delay the host for 80 seconds,

and print the seconds until boot, to allow the SSA to beat

the OS.

>ok nvedit

probe-all install-console banner

: wait_for_ssa

." Waiting 80 seconds for SSA" cr

d# 80 0 do

i .d (cr

d# 1000 ms

loop

;

wait_for_ssa

then ctrl c to exit

>ok nvstore

>ok setenv use-nvramrc? true

Note that this loop is not breakable, and will pause for 80

seconds after every reset.

----------------------------------------------------------------------------

Comments

Got something to say?

You must be logged in to post a comment.