These will be limited to 100 characters and are all lowercased
The 100 character limit will fail for molecules with 30-40 heavy atoms
The lowercasing will mangle SMILES, because lower/upper case indicates aromaticity. We could reconstruct the aromaticity by doing explicit H SMILES, but this would hit the character limit and require some weird interpretation.
For now, molecule name will be canonical isomeric smiles
We’re making a “polishing” script and will commit that into branch.
Should we include hessians?
JW – Let’s not, for now, since this is a bit experimental
There are lots of options in Josh’s JSON user optimzation_procedure. Should we include all of them?
Yes
Why are IDs being appended with characters? (like a, b, c)? See polishing scripts increment_mapping and such.
We’d prefer for these increments to be integers (like, when the same molecule appears many times), but there’s a comment here that indicates that this is a SECOND layer of mapping, on top of integers. So we’re going to leave this as-is.
Notes for Josh
BP – program, method, basis, driver keywords should move into QC spec.
TG – Should be able to support multiple specs in one submission.
Where should metadata be injected? (it’s probably obvious, we just haven’t dealt with it)
Notes from David + Trevor:
Trevor:
we want to aim for COMPLETE on all datasets
for cases where COMPLETE is not possible for a molecule, drop in a new version of dataset
major version e.g. 1.0 would give us an indication of complete
Add Comment