Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The scripts/hpc3_master.sh script is a template file, so you next have to modify it with your account/environment details. Specifically, I updated the #SBATCH -p line to use the free queue instead of the standard queue, and I added my UCI email to the --mail-user line. I also added the -o and -e lines to capture the output and error streams separately in files of my choosing. At least for my first attempt, I also commented out the COMPRESSED_CONDA_ENVIRONMENT stuff, leaving me with a somewhat simpler script than the one currently in the repo:

Code Block
languagebash
#!/bin/bash
#SBATCH -J valence-fit
#SBATCH -p free
#SBATCH -t 72:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10000mb
#SBATCH --account dmobley_lab
#SBATCH --export ALL
#SBATCH --mail-user=bwestbr1@uci.edu
#SBATCH --constraint=fastscratch
#SBATCH -o master.out
#SBATCH -e master.err

rm -rf /tmp/$SLURM_JOB_NAME
source $HOME/.bashrc
mamba activate valence-fitting

rsync  -avzIi  $SLURM_SUBMIT_DIR/optimize.in  $SLURM_TMPDIR/$SLURM_JOB_NAME
rsync  -avzIi  $SLURM_SUBMIT_DIR/targets.tar.gz  $SLURM_TMPDIR/$SLURM_JOB_NAME
rsync  -avzIi  $SLURM_SUBMIT_DIR/forcefield  $SLURM_TMPDIR/$SLURM_JOB_NAME

tar -xzf targets.tar.gz

datadir=$(pwd)
mkdir -p $SLURM_SUBMIT_DIR/worker_logs
echo $(hostname) > $SLURM_SUBMIT_DIR/host

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

if ForceBalance.py optimize.in ; then
   tar -czvf optimize.tmp.tar.gz optimize.tmp
   rsync  -avzIi --exclude="optimize.tmp" --exclude="optimize.bak" --exclude="fb_193*" --exclude="targets*" $TMPDIR/*
$SLURM_SUBMIT_DIR > copy.log
   rm -rf $TMPDIR
fi

echo "All done"

...

Code Block
languagebash
../scripts/submit_hpc3_worker_local.sh $(sed 1q host) $(awk '/port/ {print $NF}' optimize.in)

Monitoring Progress

If you use the master script from above, the primary output from ForceBalance will be directed to master.out. This is where you will see output like this:

Code Block
Thu Jul 20 06:27:50 2023 : 1/10 workers busy; 1/2 jobs complete

while jobs are running, and hopefully a steady stream of output like this:

Code Block
Task 'opt-geo-batch-126' (task 1449) finished successfully on host hpc3-22-07.local (1801 seconds)
Task 'opt-geo-batch-83' (task 1406) finished successfully on host hpc3-14-07.local (2524 seconds)
Task 'opt-geo-batch-101' (task 1424) finished successfully on host hpc3-21-15.local (1773 seconds)
Task 'opt-geo-batch-88' (task 1411) finished successfully on host hpc3-14-06.local (2017 seconds)
Task 'opt-geo-batch-107' (task 1430) finished successfully on host hpc3-22-07.local (2085 seconds)
Task 'opt-geo-batch-102' (task 1425) finished successfully on host hpc3-22-07.local (1956 seconds)
Task 'opt-geo-batch-103' (task 1426) finished successfully on host hpc3-20-07.local (2272 seconds)
Task 'opt-geo-batch-127' (task 1450) finished successfully on host hpc3-14-06.local (2423 seconds)
Task 'opt-geo-batch-111' (task 1434) finished successfully on host hpc3-22-07.local (2992 seconds)
Task 'opt-geo-batch-120' (task 1443) finished successfully on host hpc3-14-06.local (3048 seconds)

when tasks start finishing. However, if you instead see output like this:

Code Block
Task 'torsion-18536176' (task 18161) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18164
Task 'torsion-18536217' (task 18162) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18165
Task 'opt-geo-batch-113' (task 18163) failed on host hpc3-21-15.local (2 seconds), resubmitted: taskid 18166
Task 'opt-geo-batch-144' (task 18157) failed on host hpc3-21-23.local (4 seconds), resubmitted: taskid 18167
Task 'opt-geo-batch-62' (task 18160) failed on host hpc3-21-23.local (4 seconds), resubmitted: taskid 18168
Task 'torsion-18536176' (task 18164) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18169
Task 'opt-geo-batch-61' (task 18143) failed on host hpc3-22-07.local (11 seconds), resubmitted: taskid 18170
Task 'torsion-18536217' (task 18165) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18171
Task 'opt-geo-batch-113' (task 18166) failed on host hpc3-21-15.local (2 seconds), resubmitted: taskid 18172

it’s time to investigate why the tasks are failing, immediately in this case.

Debugging Failing Jobs