Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

\uD83D\uDDD3 Date

\uD83D\uDC65 Participants

...

\uD83D\uDDE3 Discussion topics

Item

Notes

Updates from MolSSI

  • BP – No real updates. Still working on next branch. Took some time off last week. Server seems to be running fine, ran backups over the weekend. Still have plenty of space.

  • BP – Made an offer to a postdoc candidate, he accepted. There are some details about his project that aren’t settled yet. He may be working on this project, he may be working on another. Will begin in the next two months.

    • JW – Let me know if you’d like OpenFF’s involvement in the management of this postdoc.

Infrastructure advances

  • BP – Same as above.

  • H – I couldn’t get the master branch to work at all, but I got the next branch working.

Throughput status

  1. New OpenFF sets from Jessica:

    • OpenFF multiplicity correction torsion drive data v1.1 - 92/99 TDs done in v1.0, remaining errors were either due to in-ring torsions and strong internal Hbonds. PB made a new submission after getting feedback from Jessica. Total torsion scans would be 131 on the new set, which includes the previous 99 as well.

  2. New OpenFF sets from Chapin:

    • OpenFF Protein Capped 1-mers 3-mers Optimization Dataset v1.0 - remaining 3 were geometric convergence errors, moved to end-of-life.

    • OpenFF Protein Capped 3-mer Backbones v1.0 - 0/54 TDs complete. 91884 from 19493 opts in last two weeks. May need deeper inspection.

  3. SPICE sets: around 73K calcs last two weeks

    • SPICE PubChem Set 4 Single Points Dataset v1.2: 20 persistent errors, 6 stale jobs.

    • SPICE PubChem Set 5 Single Points Dataset v1.2: 122976 from 80892, around 21 errors and 153 stale jobs (incomplete).

    • SPICE PubChem Set 6 Single Points Dataset v1.2: 31041 from 0, 92K remaining

    • BP – I’ll run the stuck jobs script to resume the stale jobs

User questions

  • H – Questions about how server interacts with managers - Couldn’t find documentation on how the server decides where to send jobs. Eg, fireworks has lots of capabilities about workflow management, can we use all of those?

    • BP – The server itself doesn’t “send out” tasks. Instead what happens is that you start a manager, and the manager contacts the server and pulls down tasks. It’s up to the manager to decide how to run those (could be fireworks or DASK or Parsl).

    • H – So, if you have multiple managers running, do they just pull down the next available job? Or can they inspect their own resources to decide which job to run?

    • BP – each manager has a list of tags, which are freeform strings. They can use these tags to decide which tasks to pull down.

    • H – Regarding Parsl - We see that that’s recommended for HPC environments. In using it, I see that it has a node limitation. Is ther a reason it’s recommended instead of DASK?

    • BP – I’m not aware of the parsl limitation. On my HPC with slurm it’s always worked.

    • H – I’m interested to see how other people do it. Heejune (LPW lab) submits managers as jobs.

    • BP – That also works. Our recommended usage requires having a manager running on a head node, which a lot of HPC admins don’t like

    • PB – I generally submit manager using the “pool” setting

      • Github link macro
        linkhttps://github.com/pavankum/blogpost/blob/main/Step3_dataset_submission_on_cluster.ipynb

      • BP – That works, though it will be limited if the worker nodes can’t reach out to the QCF server.

    • BP – The big mistake that I see a lot is that people try to run the server on a worker node. This is bad because the server will sometimes store the database on a networked disk. This means slow disk access, which causes a lot of trouble, since postgres needs fast disk access.

      • H – What would be symptoms to look for here?

      • BP – I can’t exactly recall, but the place to look would be the postgres logs.

  • JW - Plot of throughput by month/day for the past two years?

    • BP – I can do that - Could run a query for job completion by hostname and count how many openff jobs there are.

    • JW – Thanks! My talk will be in 7 days, so if you can send that this week it would be super helpful.

Science support needs

✅ Action items

  •  

⤴ Decisions