Check Pointing

 

Checkpointing is a method of periodically saving the state of a job step so that if the step does not complete it can be restated from the saved state. When checkpointing is enabled, checkpoints can be initiated from within the application at major milestones, or by the user, administrator or LoadLeveler external to the application. Both serial and parallel job steps can be checkpointed.

 

Once a job step has been successfully checkpointed, if that step terminates before completion, the checkpoint file can be used to resume the job step from its saved state rather than from the beginning. When a job step terminates and is removed from the LoadLeveler job queue, it can be restarted from the checkpoint file by submitting a new job and setting the restart_from_ckpt = yes job command file keyword. When a job is terminated and remains on the LoadLeveler job queue, such as when a job step is vacated, the job step will automatically be restarted from the latest valid checkpoint file. A job can be vacated as a result of flushing a node, issuing checkpoint and hold, stopping or recycling LoadLeveler or as the result of a node crash.

 

 

User Portal – Reservations

The Portal is a Web interface that allows interaction with SDSC computing and data resources via portlets (portal-specific applications). The initial launch of the Portal features a portlet for User Settable Reservations. Once you have logged into the Portal, you will be able to view available opportunities on DataStar [and TeraGrid] and to make reservations using a form.

The Reservations Portlet allows you to reserve nodes in advance on DataStar (P655) and TeraGrid [IA-64]. From the portal, you can see which nodes are available in an easy calendar format. You may choose the number of nodes, the start time, and the duration of the reservation in a Web form and get an immediate confirmation without e-mail to the help desk. Once a successful reservation is made, you may run any number of jobs under that reservation.

Charges:

A premium is charged for creating reservations that depends on the job size (node count):

Class name

Size

Charge [or Premium]

Small jobs – only on TeraGrid [IA-64]

< 32 nodes

100% (Total charge = job size x 2)

Medium jobs – both DataStar & IA-64

32 to 128 nodes

50% (Total charge = job size x 1.5)

Large jobs – both DataStar & IA-64

> 128 nodes

20% (Total charge = job size x 1.2)

 

How to Request a Reservation Using the SDSC Portal:

·        You must login to set a reservation

·        Setting a reservation requires 3 steps:

o       Find available opportunities:

§         Fill in the minimum number of nodes

§         Choose a start date

§         Choose the Minimum duration that you wish to run

§         Click on the [ Show Opportunities ] button

o       Choose and opportunity

§         From the drop down list of opportunities for each day and time, select the radio button next to the period you wish to reserve

§         Click on the [ Select an Opportunity ] button

o       Complete the requested form

§         Supply Project ID and current contact information, including e-mail address and phone number. (Roll over the help icon for a list of your project IDs and remaining SUs.)

§         Double-check your available allocation time (also in the rollover with your project ID).

§         Review the earliest start time field. This is based on the opportunity you selected from the calendar.

§         Confirm the number of nodes and requested duration; you may change these values up to the maximum available in the Opportunities Calendar.

§         Supply the latest end time. This is required to place your reservation. Together with the duration, this value will help determine the actual start time in case of a conflict with another reservation submitted simultaneously.

§         Click on [ Submit Reservation ] button

 

Reservation Policies:

Available resources & opportunities

DataStar [P655 nodes] & TeraGrid [IA-64]

Reservations may be made only on dates that are displayed in the Opportunities Calendar

Minimum and maximum advance period

Reservations may be made between 10 minutes and 4 weeks in advance.

Minimum SUs:
Minimum nodes:
Minimum time:
Maximum time:

64 SUs on DataStar  & 16 SUs on TeraGrid [IA-64]
8 nodes
1 hour
18 hours

Cancellation refunds

Cancel 24 hours or more before scheduled start time to receive a refund

Run-time limits

Your entire job request must run within the allotted reservation time. This includes a few minutes for the scheduler and other back end processes to complete initial activities at the beginning of your job. So allow for 5-10 minutes at the start.