Probably not worth spending too much extra on this
BW: the scaling difference is what worries me most, it’s probably not a unit issue since it’s not constant, which also means coming with some kind of scaling modifier wouldn’t be straightforward
BW will upload scripts Scripts + necessary files to this document for searchability
View file
name
msm_torsions.tar.gz
BW: one possibility is espaloma
Torsion splitting in 2.2
BW: split the torsions and set up the valence force field for it
BW: ended up with ~38 new parameters
LW: next step: check that everything has data
Training and benchmark
LW: there are some torsions in Sage that don’t have commas but could cover multiple multiplicities, which weren’t picked up earlier by Meghan so worth going back to have a look
BW: I have access to the folder. Haven’t looked at them myself. Should be pretty straightforward
LW: next step after that probably a refit + benchmark
QCA dataset statuses.
All done!
Fragment dataset curation code
BW: inserting RecapDecomposition at the start of the process
BW: previously loading molecules from ChEMBL, storing SMILES in database. Then run query for SMARTS of interest, doing fragmentation on the fly. However, Recap is very slow. 180 molecules took 15 min.
LW: last time we discussed a project using a different fragmentation algorithm – RecapDecomposition gives us larger molecules or fragments than we’d like. Something that fragments more on rotatable bonds may give us smaller fragments. Ideally what we’re after is something like the XtalPi used to generate their elementary and secondary fragments
BW: all my dataset processing code is in Rust. I can write a Python interface to the database.
LW: OpenFF will find it hard to maintain any code outside Python.
BW: it’s a little hard for me to wrap my head around preferring Python. When I was first looking at dataset curation, I couldn’t parse the SDF file with our toolkit. Using RDKit directly, processing the SDF file and converting everything to RDKit Molecules + sanitizing would take 36-48 hours. Whereas using C++ or Rust, it would have taken 8 minutes.
LW: let’s stick to Python for now.
LW: next steps – focus on fragmentation algorithms, take small subset of 1000 molecules or so and check which algorithm gives best results. May have to write our own.