• Keine Ergebnisse gefunden

Storage Server Recovery Process

Im Dokument Cartridge Syste·m (Seite 54-59)

Overview

Storage Server recovery procedures take place automatically under the following circumstances:

• The Storage Server is initiated. See the Storage Server Initiation section in this chapter for details.

• A major Storage Server failure occurs.

Recovery processing does not need to be initiated by the System Administrator.

During Storage Server recovery, the ACSLM performs the following processes for each ACS in the library:

• Verifies that all online ports can communicate with the ACS.

• Verifies that the library configuration recorded in the data base matches that recorded in the LMU.

• If possible, varies each ACS and its LSMs online, and marks them online in the data base.

• Directs the LSM robot to scan the physical contents of each of the following locations, and updates the data base to match:

- Reserved storage cells - Cartridge drives

- Last known location of each cartridge selected for use Once these processes are completed successfully, request processing can resume.

Storage Server Recovery Process

The following are the steps the A CSLM goes through in performing Storage Server recovery. All data base changes that occur as a result of this procedure are logged in the Event Log. If the recovery fails, additional error messages detailing the reasons for the failure will also be found in the Event Log. See Appendix A: Event Log for the Event Log entries that may be made during recovery.

Note: The ACSLM will not be able to verify configuration or contents of LSMs that were in the offline or diagnostic state at the time the Storage Server failed or was terminated. This is because an offline LSM is unable to provide configuration data and the LSM robot is unable to scan storage cells and tape drives for their contents. The ACSLMwill perfonn as much of the recovery procedure as possible and will note in the Event Log that the LSM is offline.

ACSLM Processes Storage Server Recovery

1. Issues the following unsolicited message to the Display Area of the Command Processor:

Server system recovery started

2. Updates all ACS records in the data base as follows:

ACSs in the recovery state are changed to online.

- ACSs in the diagnostic or offline-pending states are changed to offline.

3. Attempts to communicate with each ACS, using each port that the data base indicates is online. The ACSLM must find at least one port that can successfully communicate with the library in order for recovery processing to continue.

4. Verifies that the LSM and drive configurations in the Storage Server data base match those defined in the LMU. Discrepancies are noted in the Event Log.

5. Varies online all LSMs attached to an online ACS, if possible.

Cartridge recovery is performed as part of this step.

6. Directs the LSM robot to scan the contents of all cell locations marked "reserved" in the data base. These are locations that tape cartridges were being moved either to or from at the time the system failure occurred. The ACSLM updates the data base to reflect the actual physical contents of these cells, as determined by the robot.

7. Updates the data base to reflect the true status of all library tape drives (that is, available, in use, offline).

8. Directs the LSM robot to scan the contents of all library drives that the data base indicates are in use. Updates the data base to reflect the, true physical contents.

9. Directs the LSM robot to scan the contents of the last known location of each cartridge selected for use at the time of the system failure. Updates the data base with the true contents of these cells. If a cartridge is not found in its last known location it is deleted from the data base.

10. Displays either of the following unsolicited messages in the Display Area of the Command Processor, based on whether the recovery process was successful or not.

Server system recovery complete

-or-Server system recovery failed

Unsolicited Messages ACSLM Processes

UNSOLICITED MESSAGES

The ACSLM sends an unsolicited message to the ACSSA whenever an event requiring operator or System Administrator action occurs.

The ACSSA, in turn, displays the message in the Display Area of the Command Processor screen and sends the message to the Event Logger. The Event Log entry may show additional detail concerning the event. See Appendix A: Event Log for the specific entries that may be written to the Event Log.

Unsolicited messages are "asynchronous," meaning that their timing is not necessarily related to the processing of a particular request.

Most unsolicited messages indicate an error, although some (particularly those related to CAP processing) serve to notify the library operator when a particular routine action can be taken.

The status codes for all unsolicited messages are listed below in alphabetical order.

STATUS_ACSLM_IDLE if the ACSLM has l;>een placed in the idle state and is therefore unavailable for requests using library resources.

See Library Request Processing in this chapter for details on ACSLM states.

STAT.US_ACTIVITY_START when the ACSLM has been placed in the run state.

STATUS_CARTRIDGES_IN_CAP if cartridges are detected in the CAP and need to be removed by the operator.

STATUS CLEAN DRIVE

- -

if a drive needs to be cleaned.

STATUS_CONFIGURATION_ERROR if the library configuration specified in the Storage Server data base is not the same as that defined in the LMU by a Customer Services Engineer, or if a component appears in the data base but fails to respond to LMU commands.

STATUS DATABASE ERROR - - if the ACSLM is unable to access the data base.

STATUS_DEGRADED_MODE if the library hardware is operable, but with degraded performance.

STATUS_DIAGNOSTIC if the specified device has been varied to the diagnostic state and is therefore available for requests submitted through the Command Processor only. See the vary command description in Chapter 4 for additional details.

STATUS_EVENT_LOG_FAILURE if the Event Logger is unable to open or write to the Event Log file.

ACSLM Processes Unsolicited Messages

STATUS_EVENT_LOG_FULL if the Event Log has reached the maximum size defined during installation. This unsolicited message will be sent at one minute intexvals until the size of the Log is reduced. See the Event Logging section in this chapter for details.

STATUS_IDLE_PENDING if the ACSLM is in an idle-pending state and is therefore unavailable for requests using library resources.

See the Library Request Processing section in this chapter for details on ACSLM states.

STATUS_INPUT_CARTRIDGES if a CAP is ready to receive cartridges.

STATUS IPC FAILURE if the ACSLM or CSI cannot communicate with another Storage Server process.

STATUS_LIBRARY_FAILURE if a library hardware error occurred while the ACSLM was processing a request.

STATUS NI TIMEDOUT if the CSI is unable to establish a connection with the Network Interface. Data may have been lost.

STATUS_OFFLINE if a device has been varied offline. See the vary command description in Chapter 4 for additional details.

STATUS_ONLINE if a device has been varied online. See the vary command description in Chapter 4 for additional details.

STATUS_RECOVERY_COMPLETE when Storage Sexver recovery ·has been completed successfully. See the Storage Server Recovery section in this chapter for details. .

STATUS_RECOVERY_FAILED if Storage Server recovery has failed.

See the Storage Server Recovery section in this chapter for details.

STATUS_RECOVERY_INCOMPLETE if the specified LSM has failed to recover in-transit cartridges during Storage Server recovery. See the Storage Server Recovery section in this chapter for details.

STATUS_RECOVERY_STARTED when Storage Sexver recovery has been initiated. See the Storage Server Recovery section in this chapter for details.

STATUS_REMOVE_CARTRIDGES if a CAP contains cartridges and is ready for the operator to remove them.

STATUS RPC FAILURE if the CSI has encountered a Remote Procedure Call (RPC) failure. Data may have been lost.

Event Logging ACSLM Processes

EVENT LOGGING Description

One system-wide Event Log contains infonnation about library events and errors. All Storage Server software components log events to the Log through the centralized Event Logger.

The infonnation in this Log permits later analysis and tracking of nonilallibrary events as well as errors. Logged events include:

Library errors. Both fatal and nonfatal hardware and software errors are logged. Examples include LSM .failures, problems with cartridges, data base errors, interprocess and library

communications failures, and software failures not normally handled by the operating system.

Significant events. These are normal events that may be of

significance in monitoring library operations. For example, events are logged when an audi t is initiated or terminated, a device changes state, or a CAP is opened or closed.

The Event Log is automatically created when the Storage Server software is installed. The Log exists in the file

acsss_home/log/acsss_event .log

where acsss_home is the directory in which the Storage Server software was installed, usually /usr/ACSSS.

How Events Are Logged

To log an event, a Storage Server component such as the ACSLM, ACSSA, or CSI, sends a message to the centralized Event Logger.

The Event Logger accepts the message and updates the Event Log in the following manner.

1. Reformats the message by applying a standard prefix.

2. Opens the Event Log file, or creates it if it does not already exist 3. Appends the Event Log message to the end of the fue.

4. Checks the current file size against the limit parameter specified at installation. If the current size exceeds the specified limit, the Event Logger sends an unsolicited message to the ACSSA to alert the System Administrator.

5. Closes the Event Log file.

Updating the Event Log in this manner keeps the Log entries

sequential and allows the System Administrator to truncate or delete the file at any time during system operation.

ACSLM Processes Event Logging

Im Dokument Cartridge Syste·m (Seite 54-59)