Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

it’s time to investigate why the tasks are failing, immediately in this case.

Debugging Failing Jobs

Using our current pipeline, the log files for workers are stored in the /tmp/$USER directory on the compute node. Unfortunately, the log for a process is deleted as soon as that process fails, making debugging tricky.

First, log in to the worker node. SSH access to the compute nodes is turned off on HPC3 by finding the jobid of the worker, and submitting an interactive job attached to that job:

srun --pty --jobid jobid --overlap /bin/bash

Navigate to /tmp/$USER and into the directory of the worker job you are interested in (by default named something like worker-2491543-177014). Note that there may be many directories and you'll have to identify which one is of interest. In the worker directory there should be a directory called t.XXXX, where XXXX is the task ID of whatever task is currently running, and inside is that task’s log file.

You can use this script to monitor the log file of a running process. It should be run in the worker directory, and the one command line input is the task ID. The script prints the content of the file as soon as it is written, so that it will be preserved when the process crashes. Because it monitors the file in real time, it may take a while to print something useful.

Code Block
#!/bin/bash

# Brent's script to isolate a failing ForceBalance job to try to find the root cause

# $1 is the task number to look for
while true; do
        dir=$(find . -name t.$1)
        if ! [ -z $dir ]; then
                echo found $dir
                tail -F $dir/rtarget.out
                exit 0
        fi
        sleep 1
done

Once the final output is printed and the task crashes, you can use the error message to debug the issue.

One common reason for failed jobs, especially if all the jobs are failing, is the presence of constraints in the initial force field.

After the run finishes

ForceBalance output files are long. When your ForceBalance run finishes “normally” (e.g. the job didn’t crash), it will print All done at the end of the file. This doesn’t mean the optimization converged.

If your optimization converged

To find out if the optimization converged, search for Optimization Converged. If it converged, congrats! ForceBalance will print the final value of the objective function in the output file, and will save the final force field to result/optimize/force-field.offxml(unless you specified another location). Time to benchmark your run.

If your optimization didn’t converge

To find out if the optimization failed, search for Convergence Failure. ForceBalance should print the reason for the failure right before the line where it says Convergence Failure. Now it’s time to debug the failure.

Step size is too small

If ForceBalance prints something like Step size is too small to continue (4.371e-07 < 1.000e-06), it means that it has tried to take many steps in parameter space, but has determined that the optimal step is very small and thus there is likely a problem.

Try running grep "Hessian diagonal search" master.out | less. It should print something like this, for every step. Find the last few steps and examine the output.

Code Block
Starting Hessian diagonal search with step size 1.9969e-04
Hessian diagonal search: H+77321405.2954*I, length 1.9969e-04, result  1.6949e+01
Hessian diagonal search: H+1237353531.9447*I, length 1.2867e-05, result  2.2369e+01
Hessian diagonal search: H+1148869094.2832*I, length 1.3855e-05, result  3.7423e+01
Hessian diagonal search: H+56433052.7945*I, length 2.7047e-04, result  3.1612e+01
Hessian diagonal search: H+356098901.8934*I, length 4.4477e-05, result -4.5025e+00
Hessian diagonal search: H+629945332.6519*I, length 2.5222e-05, result  9.1052e+00
Hessian diagonal search: H+288436713.8387*I, length 5.4817e-05, result  1.3866e+01
Hessian diagonal search: H+451171109.7705*I, length 3.5158e-05, result -6.5705e-01

In this part of the calcuation, ForceBalance is trying to determine what size step to take. Here, “length” refers to the proposed step length, and “result” is the resulting change in objective function. For this step, most of the proposed step sizes lead to an increase in the objective function, rather than a decrease. Additionally, there is a lot of variability in the objective function change, even when the change in step size is very small. This suggests we are on a weird part of the surface.

One way to fix this is to simply restart from a checkpoint, sometimes it converges. Another way is to restart the run with a smaller initial step size. We use finite_difference_h 0.01 as a default, try decreasing to finite_difference_h 0.001.

Geometry failed to minimize

Another possible error is that the geometry fails to minimize. This could be a one-off issue, where it fails once but gets resubmitted and succeeds that time. However, if it is consistently failing to minimize, it may be an issue with the parameters. You can visualize the parameter changes using this script to diagnose the issue.

Restarting from a checkpoint

Force balance may stop mid-way through an optimization, because the job was killed, the maximum number of steps was reached, etc. Luckily it saves a checkpoint file, called <inputfilename>.sav that records the last round of parameters and can be used to restart the job. Make sure the initial force field is the same as the initial force field from the previous run, as the parameters in the sav file are changes relative to the initial force field parameters.

Assuming your input file is called optimize.in and output files are called master.out and master.err, you can use the following script to move all the files from the previous run to a directory called round1 to keep for your records, and rename optimize.sav to optimize.in to facilitate resubmitting.

Code Block
#!/bin/bash 
mkdir round1
mv optimize* round1
mv master* round1
mv slurm* round1
mv worker_logs round1
mv host round1
mv copy.log round1
mv result round1
cp round1/optimize.sav ./optimize.in