MT – Could be handy if you have existing code that does this/shows what you want.
LM – So far, I do SMIRKS search of QM dataset to build these filtered sets. BW also had something that slices by parameter.
BW – Yeah, I go through QCA IDs and search by SMILES. This would seem to be best as a post-processing step, since otherwise it’d just use a lot of disk space.
MT – We could implement tagging by parameter, or searching by SMIRKS. Wasn’t thinking of storing this in database, rather a second location.
LM – Yeah, agree this would be best in post processing.
JW – Is OFFMol I/O being slow an issue here? I’ve put very little work into optimizaing this.
BW – My normal thing is to … Takes 45 mins on 9800 mols.
LW – 45 minutes is a bit long, 5 minutes wouldn’t be bad
…
(slow part is molecule.from_smiles
JW – 5 min for 10k mol is quite slow for toolkits like RDKit. It might be easier to not go through OpenFF
BW – I’d tried using RDKit before but there was lots of back-and-forth
LW – slow/hard part was kekulizaton
BW – I use addhs, sanitize
LW (irrelevant side note): – ff.label_molecules(Topology.from_molecules[]) could be hacked to create a super RDKit mol that could be labelled all at once