...
The scripts/hpc3_master.sh
script is a template file, so you next have to modify it with your account/environment details. Specifically, I updated the #SBATCH -p
line to use the free
queue instead of the standard
queue, and I added my UCI email to the --mail-user line
. I also added the -o
and -e
lines to capture the output and error streams separately in files of my choosing. At least for my first attempt, I also commented out the COMPRESSED_CONDA_ENVIRONMENT
stuff, leaving me with a somewhat simpler script than the one currently in the repo:
Code Block | ||
---|---|---|
| ||
#!/bin/bash #SBATCH -J valence-fit #SBATCH -p free #SBATCH -t 72:00:00 #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=10000mb #SBATCH --account dmobley_lab #SBATCH --export ALL #SBATCH --mail-user=bwestbr1@uci.edu #SBATCH --constraint=fastscratch #SBATCH -o master.out #SBATCH -e master.err rm -rf /tmp/$SLURM_JOB_NAME source $HOME/.bashrc mamba activate valence-fitting rsync -avzIi $SLURM_SUBMIT_DIR/optimize.in $SLURM_TMPDIR/$SLURM_JOB_NAME rsync -avzIi $SLURM_SUBMIT_DIR/targets.tar.gz $SLURM_TMPDIR/$SLURM_JOB_NAME rsync -avzIi $SLURM_SUBMIT_DIR/forcefield $SLURM_TMPDIR/$SLURM_JOB_NAME tar -xzf targets.tar.gz datadir=$(pwd) mkdir -p $SLURM_SUBMIT_DIR/worker_logs echo $(hostname) > $SLURM_SUBMIT_DIR/host export OMP_NUM_THREADS=1 export MKL_NUM_THREADS=1 if ForceBalance.py optimize.in ; then tar -czvf optimize.tmp.tar.gz optimize.tmp rsync -avzIi --exclude="optimize.tmp" --exclude="optimize.bak" --exclude="fb_193*" --exclude="targets*" $TMPDIR/* $SLURM_SUBMIT_DIR > copy.log rm -rf $TMPDIR fi echo "All done" |
...
Code Block | ||
---|---|---|
| ||
../scripts/submit_hpc3_worker_local.sh $(sed 1q host) $(awk '/port/ {print $NF}' optimize.in) |
Monitoring Progress
If you use the master script from above, the primary output from ForceBalance
will be directed to master.out
. This is where you will see output like this:
Code Block |
---|
Thu Jul 20 06:27:50 2023 : 1/10 workers busy; 1/2 jobs complete |
while jobs are running, and hopefully a steady stream of output like this:
Code Block |
---|
Task 'opt-geo-batch-126' (task 1449) finished successfully on host hpc3-22-07.local (1801 seconds) Task 'opt-geo-batch-83' (task 1406) finished successfully on host hpc3-14-07.local (2524 seconds) Task 'opt-geo-batch-101' (task 1424) finished successfully on host hpc3-21-15.local (1773 seconds) Task 'opt-geo-batch-88' (task 1411) finished successfully on host hpc3-14-06.local (2017 seconds) Task 'opt-geo-batch-107' (task 1430) finished successfully on host hpc3-22-07.local (2085 seconds) Task 'opt-geo-batch-102' (task 1425) finished successfully on host hpc3-22-07.local (1956 seconds) Task 'opt-geo-batch-103' (task 1426) finished successfully on host hpc3-20-07.local (2272 seconds) Task 'opt-geo-batch-127' (task 1450) finished successfully on host hpc3-14-06.local (2423 seconds) Task 'opt-geo-batch-111' (task 1434) finished successfully on host hpc3-22-07.local (2992 seconds) Task 'opt-geo-batch-120' (task 1443) finished successfully on host hpc3-14-06.local (3048 seconds) |
when tasks start finishing. However, if you instead see output like this:
Code Block |
---|
Task 'torsion-18536176' (task 18161) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18164
Task 'torsion-18536217' (task 18162) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18165
Task 'opt-geo-batch-113' (task 18163) failed on host hpc3-21-15.local (2 seconds), resubmitted: taskid 18166
Task 'opt-geo-batch-144' (task 18157) failed on host hpc3-21-23.local (4 seconds), resubmitted: taskid 18167
Task 'opt-geo-batch-62' (task 18160) failed on host hpc3-21-23.local (4 seconds), resubmitted: taskid 18168
Task 'torsion-18536176' (task 18164) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18169
Task 'opt-geo-batch-61' (task 18143) failed on host hpc3-22-07.local (11 seconds), resubmitted: taskid 18170
Task 'torsion-18536217' (task 18165) failed on host hpc3-22-07.local (2 seconds), resubmitted: taskid 18171
Task 'opt-geo-batch-113' (task 18166) failed on host hpc3-21-15.local (2 seconds), resubmitted: taskid 18172 |
it’s time to investigate why the tasks are failing, immediately in this case.