2020-02-21 ThermoML API Meeting notes

Date

Feb 21, 2020

Participants

  • @Jeffrey Wagner

  • Demian Riccardi (NIST)

  • Chris Muzny (NIST)

  • @Simon Boothroyd

  • @Owen Madin

  • @Daniel Smith (Deactivated)

  • @Michael Shirts

  • @John Chodera

Goals

  • Help ThermoML design API/standards that enable us and other to use their data more effectively

Discussion topics

Time

Item

Presenter

Notes

Time

Item

Presenter

Notes

 

 

 









DR – Shows slides, legacy of NIST TRC. ThermoData Engine drives data discovery and formatting. Main infra is Oracle database. ThermoML is expensive to collect and maintain, so it was “hard to get” – XML format, uncertainties not released. The tarball of this is available and free, but hard to access. Working on making it available via JSON-LD, which is embedded in webpages, and searchable by datasetsearch.research.google.com. Looking at adding API for searching using Cordra, which offers REST API, user authentication. Currently defining JSON schema for datasets.

Each object has citation, compounds, and dataset.

Looking to have API prototype flying by June

DGS – Cordra seems cool, never used it before, but it seem handy.

JC – Is there a data volume problem? Is there a query/download limit?

CM – It will be access token validated. Anyone can get a token, but we need contact information. We will have some throttling up front to protect against naive users.

DR – Cordra has built-in user authentication, so we can set different policies for different users.

CM – What kinds of searches would you want to do? Would you want a new zip every quarter? Plan is to put up a new complete archive and a diff each quarter. API will expose the most up-to-date version each quarter.

SB – We would prefer to use the REST API. Currently we download tarball and it’s a pain to search using python

CM – What kinds of searchers do you do?

SB – Looking for properties in certain ranges. Eg densities around reasonable pressures, temperatures. The same compound having several measurements.

JC – It’s frequently useful to do population analysis on the results we COULD pull. So, getting results count without getting whole records.

CM – Re: making this data public – We want to make sure that users cite the ORIGINAL data, which is why the citation tag goes everywhere. Lifecycle hooks are a method to keep citations attached, even as people strip and segment data.

JC – Strongly agree about important of citations. Manubot lets us throw in a table of DOIs and we can have a dynamically generated references table.

CM – Currently we’re making about 30% of our data publicly available – Only data from journals, which have citation records.

CM – What about chemical descriptors? Right now we have InChI.

SB – InChI sometimes doesn’t have explicit stereochemistry, and we either assume racemic or discard. What can we expect on that front?

CM – When possible, we include InChI that has specific stereochemistry. We leave it undefined when we’re unable to determine the stereochemistry.

JW – Protonation state?

CM – We basically don’t know the protonation state any better than the experimentalists, so we don’t provide it.

CM – Looking at having testing access available in April. We have to go through a complex approval process for posting externally, which may push things back a bit.

JC – We need to talk to advisory board about convincing NIH to support this sort of effort.

CM – NIST recieved some AI funding this year, and we may be able to pitch ThermoML archive as a resource for AI community to get some of this funding. OpenFF is a sort of machine learning, and we would benefit by showing that you’re using this.

JC – Absolutely. We can provide evidence of the utility of this data in whatever format you like.

JW – Our ability to cite/republish this data?

CM – Each tarball with be available for a long time (“forever” as long as NIST is around). Each one will be citable with a DOI

JC – License?

CM – I’m speaking with lawyers about which license to use. We want to make it fully open, but I’m basically going to be told which license we’ll use.

MS – What other ways can we show support for ThermoML? Letters of support? Direct funding?

CM – We should talk about transferring money separately. But internal and external publicity is really good for us, so I can draft a writeup for an internal publication here and send it to you for review. You should imagine that this might go on the front page of NIST and/or in C&EN.

MS – Yes, I’ll be the point of contact for this.

MS – You can ping us for testers as well, or just come to U Colorado Boulder.

 

 

 

 

Action items

Coordinate with Demian Riccardi (NIST) on thermoML API feedback and testing - @Simon Boothroyd@Owen Madin
Coordinate with Chris Muzny (NIST) on publicity for OpenFF use of thermoML - @Michael Shirts
Follow up with Chris Muzny on OpenFF funding for ThermoML - @Michael Shirts

Decisions