This tutorial assumes you’ve generated ForceBalance
inputs using my valence-fitting repository through step 04 and have the corresponding fb-fit
, parameters-to-optimize
, and scripts
directories.
Preparing Files
Copy over
fb-fit/
forcefield/force-field.offxml
optimize.in
targets.tar.gz
oe_license.txt (probably not needed, but I had it)
parameters-to-optimize/ (actually these appear to be unused too)
scripts/
The valence-fitting
scripts produce the directory fb-fit/targets
, so either tar them up to use the list above or wait for the whole directory to transfer.
Preparing Scripts
The scripts/hpc3_master.sh
script is a template file, so you next have to modify it with your account/environment details. Specifically, I updated the #SBATCH -p
line to use the free
queue instead of the standard
queue, and I added my UCI email to the --mail-user line
. I also added the -o
and -e
lines to capture the output and error streams separately in files of my choosing. At least for my first attempt, I also commented out the COMPRESSED_CONDA_ENVIRONMENT
stuff, leaving me with a somewhat simpler script than the one currently in the repo:
#!/bin/bash #SBATCH -J valence-fit #SBATCH -p free #SBATCH -t 72:00:00 #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=10000mb #SBATCH --account dmobley_lab #SBATCH --export ALL #SBATCH --mail-user=bwestbr1@uci.edu #SBATCH --constraint=fastscratch #SBATCH -o master.out #SBATCH -e master.err rm -rf /tmp/$SLURM_JOB_NAME source $HOME/.bashrc mamba activate valence-fitting rsync -avzIi $SLURM_SUBMIT_DIR/optimize.in $SLURM_TMPDIR/$SLURM_JOB_NAME rsync -avzIi $SLURM_SUBMIT_DIR/targets.tar.gz $SLURM_TMPDIR/$SLURM_JOB_NAME rsync -avzIi $SLURM_SUBMIT_DIR/forcefield $SLURM_TMPDIR/$SLURM_JOB_NAME tar -xzf targets.tar.gz datadir=$(pwd) mkdir -p $SLURM_SUBMIT_DIR/worker_logs echo $(hostname) > $SLURM_SUBMIT_DIR/host export OMP_NUM_THREADS=1 export MKL_NUM_THREADS=1 if ForceBalance.py optimize.in ; then tar -czvf optimize.tmp.tar.gz optimize.tmp rsync -avzIi --exclude="optimize.tmp" --exclude="optimize.bak" --exclude="fb_193*" --exclude="targets*" $TMPDIR/* $SLURM_SUBMIT_DIR > copy.log rm -rf $TMPDIR fi echo "All done"
Because I wanted to run the submission scripts from the fb-fit
directory instead of the scripts
directory, I also edited the call to wq_worker_local.sh
in scripts/submit_hpc3_worker_local.sh
to point to ../scripts/wq_worker_local.sh
. I also commented out the CONDA_ENVIRONMENT
stuff therein and ran the script from the proper conda environment. Some of these changes will probably be reflected in the repository in the future.
Submitting
With these preparations done, you can run
sbatch ../scripts/hpc3_master.sh
Once that starts running, it will create a host
file containing the HOST
argument for submit_hpc3_worker_local.sh
. You can obtain the PORT
argument from the ForceBalance optimize.in
file. Then, run the worker script with
../scripts/submit_hpc3_worker_local.sh HOST PORT
Or for a “one-liner”
../scripts/submit_hpc3_worker_local.sh $(sed 1q host) $(awk '/port/ {print $NF}' optimize.in)
Monitoring Progress
If you use the master script from above, the primary output from ForceBalance
will be directed to master.out
. This is where you will see output like this:
Thu Jul 20 06:27:50 2023 : 1/10 workers busy; 1/2 jobs complete
while jobs are running, and hopefully a steady stream of output like this:
Task 'opt-geo-batch-126' (task 1449) finished successfully on host hpc3-22-07.local (1801 seconds) Task 'opt-geo-batch-83' (task 1406) finished successfully on host hpc3-14-07.local (2524 seconds) Task 'opt-geo-batch-101' (task 1424) finished successfully on host hpc3-21-15.local (1773 seconds) Task 'opt-geo-batch-88' (task 1411) finished successfully on host hpc3-14-06.local (2017 seconds) Task 'opt-geo-batch-107' (task 1430) finished successfully on host hpc3-22-07.local (2085 seconds) Task 'opt-geo-batch-102' (task 1425) finished successfully on host hpc3-22-07.local (1956 seconds) Task 'opt-geo-batch-103' (task 1426) finished successfully on host hpc3-20-07.local (2272 seconds) Task 'opt-geo-batch-127' (task 1450) finished successfully on host hpc3-14-06.local (2423 seconds) Task 'opt-geo-batch-111' (task 1434) finished successfully on host hpc3-22-07.local (2992 seconds) Task 'opt-geo-batch-120' (task 1443) finished successfully on host hpc3-14-06.local (3048 seconds)
when tasks start finishing. However, if you instead see output like this:
Task 'torsion-18536176' (task 18161) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18164 Task 'torsion-18536217' (task 18162) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18165 Task 'opt-geo-batch-113' (task 18163) failed on host hpc3-21-15.local (2 seconds), resubmitted: taskid 18166 Task 'opt-geo-batch-144' (task 18157) failed on host hpc3-21-23.local (4 seconds), resubmitted: taskid 18167 Task 'opt-geo-batch-62' (task 18160) failed on host hpc3-21-23.local (4 seconds), resubmitted: taskid 18168 Task 'torsion-18536176' (task 18164) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18169 Task 'opt-geo-batch-61' (task 18143) failed on host hpc3-22-07.local (11 seconds), resubmitted: taskid 18170 Task 'torsion-18536217' (task 18165) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18171 Task 'opt-geo-batch-113' (task 18166) failed on host hpc3-21-15.local (2 seconds), resubmitted: taskid 18172
it’s time to investigate why the tasks are failing, immediately in this case.
Debugging Failing Jobs
Using our current pipeline, the log files for workers are stored in the /tmp/$USER
directory on the compute node. Unfortunately, the log for a process is deleted as soon as that process fails, making debugging tricky.
First, log in to the worker node. SSH access to the compute nodes is turned off on HPC3 by finding the jobid of the worker, and submitting an interactive job attached to that job:
srun --pty --jobid jobid --overlap /bin/bash
Navigate to /tmp/$USER
and into the directory of the worker job you are interested in (by default named something like worker-2491543-177014
). Note that there may be many directories and you'll have to identify which one is of interest. In the worker directory there should be a directory called t.XXXX
, where XXXX is the task ID of whatever task is currently running, and inside is that task’s log file.
You can use this script to monitor the log file of a running process. It should be run in the worker directory, and the one command line input is the task ID. The script prints the content of the file as soon as it is written, so that it will be preserved when the process crashes. Because it monitors the file in real time, it may take a while to print something useful.
#!/bin/bash # Brent's script to isolate a failing ForceBalance job to try to find the root cause # $1 is the task number to look for while true; do dir=$(find . -name t.$1) if ! [ -z $dir ]; then echo found $dir tail -F $dir/rtarget.out exit 0 fi sleep 1 done
Once the final output is printed and the task crashes, you can use the error message to debug the issue.
One common reason for failed jobs, especially if all the jobs are failing, is the presence of constraints in the initial force field.
After the run finishes
ForceBalance output files are long. When your ForceBalance run finishes “normally” (e.g. the job didn’t crash), it will print All done
at the end of the file. This doesn’t mean the optimization converged.
If your optimization converged
To find out if the optimization converged, search for Optimization Converged
. If it converged, congrats! ForceBalance will print the final value of the objective function in the output file, and will save the final force field to result/optimize/force-field.offxml
(unless you specified another location). Time to benchmark your run.
If your optimization didn’t converge
To find out if the optimization failed, search for Convergence Failure
. ForceBalance should print the reason for the failure right before the line where it says Convergence Failure
. Now it’s time to debug the failure.
Step size is too small
If ForceBalance prints something like Step size is too small to continue (4.371e-07 < 1.000e-06)
, it means that it has tried to take many steps in parameter space, but has determined that the optimal step is very small and thus there is likely a problem.
Try running grep "Hessian diagonal search" master.out | less
. It should print something like this, for every step. Find the last few steps and examine the output.
Starting Hessian diagonal search with step size 1.9969e-04 Hessian diagonal search: H+77321405.2954*I, length 1.9969e-04, result 1.6949e+01 Hessian diagonal search: H+1237353531.9447*I, length 1.2867e-05, result 2.2369e+01 Hessian diagonal search: H+1148869094.2832*I, length 1.3855e-05, result 3.7423e+01 Hessian diagonal search: H+56433052.7945*I, length 2.7047e-04, result 3.1612e+01 Hessian diagonal search: H+356098901.8934*I, length 4.4477e-05, result -4.5025e+00 Hessian diagonal search: H+629945332.6519*I, length 2.5222e-05, result 9.1052e+00 Hessian diagonal search: H+288436713.8387*I, length 5.4817e-05, result 1.3866e+01 Hessian diagonal search: H+451171109.7705*I, length 3.5158e-05, result -6.5705e-01
In this part of the calcuation, ForceBalance is trying to determine what size step to take. Here, “length” refers to the proposed step length, and “result” is the resulting change in objective function. For this step, most of the proposed step sizes lead to an increase in the objective function, rather than a decrease. Additionally, there is a lot of variability in the objective function change, even when the change in step size is very small. This suggests we are on a weird part of the surface.
One way to fix this is to simply restart from a checkpoint, sometimes it converges. Another way is to restart the run with a smaller initial step size. We use finite_difference_h 0.01
as a default, try decreasing to finite_difference_h 0.001
.
Restarting from a checkpoint
Force balance may stop mid-way through an optimization, because the job was killed, the maximum number of steps was reached, etc. Luckily it saves a checkpoint file, called <inputfilename>.sav
that records the last round of parameters and can be used to restart the job. Make sure the initial force field is the same as the initial force field from the previous run, as the parameters in the sav
file are changes relative to the initial force field parameters.
Assuming your input file is called optimize.in
and output files are called master.out
and master.err
, you can use the following script to move all the files from the previous run to a directory called round1
to keep for your records, and rename optimize.sav
to optimize.in
to facilitate resubmitting.
#!/bin/bash mkdir round1 mv optimize* round1 mv master* round1 mv slurm* round1 mv worker_logs round1 mv host round1 mv copy.log round1 mv result round1 cp round1/optimize.sav ./optimize.in