...
Participants
Demian Riccardi (NIST)
Chris Muzny (NIST)
Goals
Help ThermoML design API/standards that enable us and other to use their data more effectively
Discussion topics
Time | Item | Presenter | Notes |
---|---|---|---|
DR – Shows slides, legacy of NIST TRC. ThermoData Engine drives data discovery and formatting. Main thing is Oracle database. ThermoML is expensive to collect and maintain, so it was “hard to get” – XML format, uncertainties hard to get. The tarball of this is available and free, but hard to access. Working on making it available via JSON-LD, which is embedded in webpages, and searchable by datasetsearch.research.google.com. Looking at adding API for searching using Cordra, which offers REST API, user authentication. Currently defining JSON schema for datasets.
Each object has citation, compounds, and dataset.
Looking to have API prototype flying by in June
DGS – Cordra seems cool, never used it before, but it seem handy.
JC – Is there a data volume problem? Is there a query/download limit?
CM – It will be access token validated. Anyone can get a token, but we need contact information. We will have some throttling up front to protect against naive users.
DR – Cordra has built-in user authentication, so we can set different policies for different users.
CM – What kinds of searches would you want to do? Would you want a new zip every quarter? Plan is to put up a new complete archive and a diff each quarter. API will expose the most up-to-date version each quarter.
SB – We would prefer to use the REST API. Currently we download tarball and it’s a pain to search using python
CM – What kidns of searchers do you do?
SB – Looking for properties in certain ranges. Eg densities around reasonable pressures, temperatures.
JC – It’s frequently useful to do population analysis on the results we COULD pull. So, getting results count without getting whole records.
CM – Re: making this data public – We want to make sure that users cite the ORIGINAL data, which is why the citation tag goes everywhere. Lifecycle hooks are a method to keep citations attaches, even as people strip and segment data.
JC – Strongly agree about important of citations. Manubot lets us throw in a table of DOIs and we can have a dynamically generated references table.
CM – Currently we’re making about 30% of our data publicly available – Only things from journals, which have citation records.
CM – What about chemical descriptors? Right now we have InChI.
SB – InChI sometimes doesn’t have explicit stereochemistry, and we either assume racemic or discard. What can we expect on that front?
CM – When possible, we include InChI that has specific stereochemistry. When the underfined stereochemistry
JW – Protonation state?
CM – We basically don’t know the protonation state any better than the experimentalists, so we don’t provide it.
CM – Looking at having testing access available in April. We have to go through a complex approval process for posting externally, which may push things back a bit.
JC – We need to talk to advisory board about pushing advisory board to convince NIH to support these sorts of effort.
CM – NIST recieved some AI funding this year, and we may be able to pitch ThermoML archive as a resource for AI community to get some of this funding. OpenFF is a sort of machine learning, and we would benefit by showing that you’re using this.
JC – Absolutely. We can provide evidence of the utility of this data in whatever format you like.
JW – Our ability to cite/republish this data?
CM – Each tarball with be available for a long time (“forever” as long as NIST is around). Each one will be citable with a DOI
JC – License?
CM – I’m speaking with lawyers about which license to use. We want to make it fully open, but I’m basically going to be told which license we’ll use.
MS – What other ways can we show support for ThermoML? Letters of support? Direct money?
CM – We should talk about transferring money separately. But internal and external publicity is really good for us, so I can draft a writeup for an internal publication here and send it to you for review. You should imagine that this might go on the front page of NIST and/or in C&EN.
MS – Yes, I’ll be the point of contact for this.
MS – You can ping us for testers as well, or just come to U Colorado Boulder.