2021-07-07 Wagner/Dalke Check in

Participants

  • @Jeffrey Wagner

  • @Andrew Dalke (Deactivated)

Goals

  •  

Discussion topics

Item

Notes

Item

Notes

Wrapping up in-progress work

  • Coverage tool

    • JW – Would be willing to merge this today

    • AD – Based on my new understanding of OFFTK (that from_file, from_file_obj, and from_object do different things), the previous results may not be what we thought they were. So I worked on this some more, but there’s more work to be done. Also, I’m finding more interesting cases of failed comparisons, which I could write up.

    • (General) – AD will work on polishing this with his remaining time, and documenting how JW can re-run the toolkit molecule comparison to start debugging the different categories of failures.

  • Coverage-based feature reduction --> Ready for review?

    • AD – This is ready for review+merge.

    • JW – I’ll take over this branch + PR, and will review + merge it when ready.

    • AD – This also doesn’t process molecules until they’re needed for a test, and handles datasets as dict-like objects containing the raw bytes of each record, instead of loading+processing the all molecules at once.

  • Safely handling multicomponent, or otherwise unusual SMILES

    • Not necessary to do everything – We’d be happy to just get multicomponent filtering.

    • AD will not work on this, and instead work on wrapping up the coverage tool + providing instrucitons to rerun his analysis.

  • Stereo issues (#1011) discussion

  • Multicomponent molecule input

    • (General) – AD will open an issue describing the approach for this.

Safer processing of SMILES files?

  • AD – from_smiles and from_file (given a SMILES file) are doing different things. The best solution would be to parse the SMILES file yourself and then feed each entry into from_smiles.

    • (General) – AD will write up an Issue describing the approach he’d take to this.

  • AD – Two styles: OE and RDK

    • Variations:

      • Whether there’s a header (then we know what the columns will be)

      • If there’s no header - Each line is SMILES[whitespace]identifier

        • Some complication around whether the identifier can have a space in it

        • In CXSMILES the first whitespace will have a tab, then after that info may be split by spaces

Action items

Decisions

Â