03-05-2020: Data set clustering & data set selection

Participants: @David Mobley @Hyesu Jang @Lee-Ping Wang @Christopher Bayly @Daniel Smith (Deactivated) @Jessica Maat (Deactivated)

 

Discussion:

DLM: Quantify diversity using graph similarity then eventually incorporate WBO

CB: LINGOs - graph based similarity based on the SMILES string (Citation to LINGOs method: Link)

DLM: Why should we use LINGOs versus graph fingerprint similarity?

CB: We used smiles for everything, might be simpler. Links to data representation to clustering. Graph similarity is better for 2D similarity.

DLM: Then let’s proceed with graph fingerprint similarity.

 

 

DLM: We have Bayers patented collection and has higher similarity within its data set to the other data sets.

DS: We are running 1000 torsion drives for Silicon Therapeutics in QCA. It could be a good data set to use for upcoming fitting, although we might want to consider running Fragmenter on the data set.

DLM: We should consider updating the training data set for future releases.