AshSage updates LW will upload slides here JR (slide 9) – Can you put some bromoalkanes in the training set to fix issues? JR slide 10 – What are the green series above the line? ANd would be helpful to show r2 JW slide 12 – Could the nonzero values in the bottom two rows have been avoided? JC 19 – Difference between ash-nagl charges for alcohols compared to previous partial charges (not that partial charges would be worse, but less “enmeshed” with the non-bonded parameters)? Also wonder if differentiation between primary/secondary alcohols is sufficiently baked into NAGL models. LW – I haven’t looked at charges for these mols specifically, but ash-nagl charges have performed well for alcohols in general, instead things like sulfurs might make them worse. But I’d fully expect NAGL to learn difference between primary/secondary/tertiary alcohols.
JR 22 – Plotting against RMSE - What is the actual value of the density (to understand the magnitude of the error)? JW 25 – These seem to strongly imply that we should pull more of the validation set into training set JW 28 – Seems like there were some really large differences between training and testing - often going in the wrong direction altogether. Do you think that that indicates a larger training set is in order? DM: The issue is when looking at really generic functional groups like alkanes, is that whatever is attached to it will dictate the amount of error, so it’s not that the alkanes themselves are off LW: Like DM said, DM – In doing similar analysis, looking at whether a compound was an alkane or aromatic was incidental since most of the error was contributed by other components/functional groups. So it may be useful to do an analysis to see whether ADDING a functional group ADDS error. JC – In genentech work, we found that train-test set contamination can be really deceptive, and come to a completely different conclusion given the same data depending on how its split. LW – Since our vdW terms are so general, we’d hope that overtraining would be easier to keep at bay. So I still think that doing stuff like ensuring we have a mixture of alkyl bromides and alcohols in each set. JR – NAGL charging only looks at one molecule, so it’s not aware of the other components of the mixture PB (chat) – reg. David's idea of plotting overrepresented parameters, for Sage we were looking at the ratio of (fraction of mols a parameter is matched to in the subset with large errors, say ddE > 5 kcal/mol or any other metric to pick the subset) to (fraction of mols the same parameter is matched to on the whole set). So, any parameter having a value greater than 1 is heavily represented in the large discrepancy set and the higher the value the parameter is overrepresented.
JR 50 – More mols here than in enthalpies of mixing. Not surprising since phys prop data is harder to come by. Wonder whether training on this high a number of compounds is better. PB: reg. David's idea of plotting overrepresented parameters, for Sage we were looking at the ratio of (fraction of mols a parameter is matched to in the subset with large errors, say ddE > 5 kcal/mol or any other metric to pick the subset) to (fraction of mols the same parameter is matched to on the whole set). So, any parameter having a value greater than 1 is heavily represented in the large discrepancy set and the higher the value the parameter is overrepresented.
|