Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

This tutorial assumes you’ve generated ForceBalance inputs using my valence-fitting repository through step 04 and have the corresponding fb-fit, parameters-to-optimize, and scripts directories.

Preparing Files

Copy over

  • fb-fit/

    • forcefield/force-field.offxml

    • optimize.in

    • targets.tar.gz

    • oe_license.txt (probably not needed, but I had it)

  • parameters-to-optimize/

  • scripts/

The valence-fitting scripts produce the directory fb-fit/targets, so either tar them up to use the list above or wait for the whole directory to transfer.

Preparing Scripts

The scripts/hpc3_master.sh script is a template file, so you next have to modify it with your account/environment details. Specifically, I updated the #SBATCH -p line to use the free queue instead of the standard queue, and I added my UCI email to the --mail-user line. I also added the -o and -e lines to capture the output and error streams separately in files of my choosing. At least for my first attempt, I also commented out the COMPRESSED_CONDA_ENVIRONMENT stuff, leaving me with a somewhat simpler script than the one currently in the repo:

#!/bin/bash
#SBATCH -J valence-fit
#SBATCH -p free
#SBATCH -t 72:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10000mb
#SBATCH --account dmobley_lab
#SBATCH --export ALL
#SBATCH --mail-user=bwestbr1@uci.edu
#SBATCH --constraint=fastscratch
#SBATCH -o master.out
#SBATCH -e master.err

rm -rf /tmp/$SLURM_JOB_NAME
source $HOME/.bashrc
mamba activate valence-fitting

rsync  -avzIi  $SLURM_SUBMIT_DIR/optimize.in  $SLURM_TMPDIR/$SLURM_JOB_NAME
rsync  -avzIi  $SLURM_SUBMIT_DIR/targets.tar.gz  $SLURM_TMPDIR/$SLURM_JOB_NAME
rsync  -avzIi  $SLURM_SUBMIT_DIR/forcefield  $SLURM_TMPDIR/$SLURM_JOB_NAME

tar -xzf targets.tar.gz

datadir=$(pwd)
mkdir -p $SLURM_SUBMIT_DIR/worker_logs
echo $(hostname) > $SLURM_SUBMIT_DIR/host

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

if ForceBalance.py optimize.in ; then
   tar -czvf optimize.tmp.tar.gz optimize.tmp
   rsync  -avzIi --exclude="optimize.tmp" --exclude="optimize.bak" --exclude="fb_193*" --exclude="targets*" $TMPDIR/*
$SLURM_SUBMIT_DIR > copy.log
   rm -rf $TMPDIR
fi

echo "All done"

Because I wanted to run the submission scripts from the fb-fit directory instead of the scripts directory, I also edited the call to wq_worker_local.sh in scripts/submit_hpc3_worker_local.sh to point to ../scripts/wq_worker_local.sh. I also commented out the CONDA_ENVIRONMENT stuff therein and ran the script from the proper conda environment. Some of these changes will probably be reflected in the repository in the future.

Submitting

With these preparations done, you can run

sbatch ../scripts/hpc3_master.sh

Once that starts running, it will create a host file containing the HOST argument for submit_hpc3_worker_local.sh. You can obtain the PORT argument from the ForceBalance optimize.in file. Then, run the worker script with

../scripts/submit_hpc3_worker_local.sh HOST PORT

Or for a “one-liner”

../scripts/submit_hpc3_worker_local.sh $(sed 1q host) $(awk '/port/ {print $NF}' optimize.in)

Monitoring Progress

If you use the master script from above, the primary output from ForceBalance will be directed to master.out. This is where you will see output like this:

Thu Jul 20 06:27:50 2023 : 1/10 workers busy; 1/2 jobs complete

while jobs are running, and hopefully a steady stream of output like this:

Task 'opt-geo-batch-126' (task 1449) finished successfully on host hpc3-22-07.local (1801 seconds)
Task 'opt-geo-batch-83' (task 1406) finished successfully on host hpc3-14-07.local (2524 seconds)
Task 'opt-geo-batch-101' (task 1424) finished successfully on host hpc3-21-15.local (1773 seconds)
Task 'opt-geo-batch-88' (task 1411) finished successfully on host hpc3-14-06.local (2017 seconds)
Task 'opt-geo-batch-107' (task 1430) finished successfully on host hpc3-22-07.local (2085 seconds)
Task 'opt-geo-batch-102' (task 1425) finished successfully on host hpc3-22-07.local (1956 seconds)
Task 'opt-geo-batch-103' (task 1426) finished successfully on host hpc3-20-07.local (2272 seconds)
Task 'opt-geo-batch-127' (task 1450) finished successfully on host hpc3-14-06.local (2423 seconds)
Task 'opt-geo-batch-111' (task 1434) finished successfully on host hpc3-22-07.local (2992 seconds)
Task 'opt-geo-batch-120' (task 1443) finished successfully on host hpc3-14-06.local (3048 seconds)

when tasks start finishing. However, if you instead see output like this:

Task 'torsion-18536176' (task 18161) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18164
Task 'torsion-18536217' (task 18162) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18165
Task 'opt-geo-batch-113' (task 18163) failed on host hpc3-21-15.local (2 seconds), resubmitted: taskid 18166
Task 'opt-geo-batch-144' (task 18157) failed on host hpc3-21-23.local (4 seconds), resubmitted: taskid 18167
Task 'opt-geo-batch-62' (task 18160) failed on host hpc3-21-23.local (4 seconds), resubmitted: taskid 18168
Task 'torsion-18536176' (task 18164) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18169
Task 'opt-geo-batch-61' (task 18143) failed on host hpc3-22-07.local (11 seconds), resubmitted: taskid 18170
Task 'torsion-18536217' (task 18165) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18171
Task 'opt-geo-batch-113' (task 18166) failed on host hpc3-21-15.local (2 seconds), resubmitted: taskid 18172

it’s time to investigate why the tasks are failing, immediately in this case.

Debugging Failing Jobs

  • No labels