Check Pointing
Checkpointing is a method of periodically saving the state
of a job step so that if the step does not complete it can be restated from the
saved state. When checkpointing is enabled,
checkpoints can be initiated from within the application at major milestones,
or by the user, administrator or LoadLeveler external
to the application. Both serial and parallel job steps can be checkpointed.
Once a job step has been successfully checkpointed, if that step terminates before completion,
the checkpoint file can be used to resume the job step from its saved state
rather than from the beginning. When a job step terminates and is removed from
the LoadLeveler job queue, it can be restarted from
the checkpoint file by submitting a new job and setting the restart_from_ckpt
= yes job command file keyword. When a job is terminated and remains on the LoadLeveler job queue, such as when a job step is vacated,
the job step will automatically be restarted from the latest valid checkpoint
file. A job can be vacated as a result of flushing a node, issuing checkpoint
and hold, stopping or recycling LoadLeveler or as the
result of a node crash.
User Portal – Reservations
The Portal is a Web interface that
allows interaction with SDSC computing and data resources via portlets (portal-specific applications). The initial launch
of the Portal features a portlet
for User Settable Reservations. Once you have logged into the Portal, you will
be able to view available opportunities on DataStar [and
TeraGrid] and
to make reservations using a form.
The Reservations Portlet allows you to reserve nodes in advance on DataStar (P655) and TeraGrid [IA-64]. From the portal, you can see which nodes are available in an easy calendar format. You may choose the number of nodes, the start time, and the duration of the reservation in a Web form and get an immediate confirmation without e-mail to the help desk. Once a successful reservation is made, you may run any number of jobs under that reservation.
Class
name
|
Size
|
Charge [or Premium]
|
Small jobs – only on TeraGrid [IA-64]
|
< 32 nodes
|
100%
(Total charge = job size x 2)
|
Medium
jobs – both DataStar & IA-64
|
32
to 128 nodes |
50%
(Total charge = job size x 1.5)
|
Large jobs – both DataStar & IA-64
|
> 128 nodes
|
20% (Total charge = job size x
1.2)
|
Available resources
& opportunities
|
DataStar [P655 nodes] & TeraGrid [IA-64]
Reservations
may be made only on dates that are displayed in the Opportunities Calendar
|
Minimum and maximum
advance period
|
Reservations
may be made between 10 minutes and 4 weeks in advance.
|
Minimum SUs:
|
64
SUs on DataStar & 16 SUs on TeraGrid [IA-64]
|
Cancellation refunds
|
Cancel
24 hours or more before scheduled start time to receive a refund
|
Run-time limits
|
Your
entire job request must run within the allotted reservation time. This
includes a few minutes for the scheduler and other back end processes to
complete initial activities at the beginning of your job. So allow for 5-10
minutes at the start.
|