Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A lot of these MuPT needs overlap with tool requirements for OpenFF. Proteins/nucleic acids/glycoproteins are a subset of polymers . . . What are those overlaps/opportunities for shared tooling? DISCUSS! 

 Afternoon

Meghan Osato: Partial Charge Variability

  • TG: When you report max partial charge diff, is that on a single atom?

    • MO: Right, the value reported for the molecule is the highest value of any atom in the molecule

  • JC: What’s the history behind this problem?

    • MO: We found several papers in the literature that documented similar problems, e.g. on sugars

    • DM: Can you explain how ELF works?

    • MO: (explains ELF) essentially it looks for molecules where the molecule is most spread out

    • DM: CB had found that AM1-BCC doesn’t generally have much conformational dependence, but it can when there are strong internal electrostatic interactions.

  • CC: Why did NAGL have any variation at all in free energies?

    • MO: That’s the baseline that comes from other sources

  • TG: Were you aware whether AT was using a threaded or single core library?

    • MO: Not sure

    • JW: The toolkit doesn’t change AT settings from default

    • TG: You could set #threads to 1

  • DM: The big picture is that these are more reasons to use NAGL.

Jen Clark: Dataset Archival

  • JE: Each entry in the list corresponds to a single molecule?

    • JC: Correct

  • Proposal: Just store the SQLite file on Zenodo

  • (room consensus) That sounds great

  • CI: if QCA makes changes to spec, would you be able to import those directly? Would it have to be migrated? Also, is the idea to back it up or to make it easy for others to use? It’s trivial to dump and store sqlite files, but you could have another file format that’s easier for people to use.

    • JW: we don’t have a lot of clarity of use cases on what format would be most useful to us today. But inevitably we would have to convert sqlite → qcfractal → endpoint.

    • DM: our most obvious use case is that when we make a release, it relies on data in QCArchive that may not always be present and may get retired. We need some place to put it.

    • LW:

  • CI: with the QCArchive records, these can more or less be dumped to a dictionary, instead of relying on internal schemas (which are just pydantic models). Maybe dumping this into dictionaries but could be driven by the code.

    • JCl: the dict representation is what I was referring to with a JSON file. The sqlite contains this but with redundant info removed. There’s definitely a higher barrier to access this, but you don’t need anything from molssi to use this file, just an understanding of how databases work.

    • JW: Given that Zenodo is cheap/free, could we add both representations?

    • JCl: there’s still an issue of structure. Converting it to a JSON file meant I had to invent my own schema.

    • JE: to understand the best way to use this we also need an understanding of the use case.

  • TG: I really like this. This is also language-agnostic, I could write my own queries with e.g. C, would be much faster. Will the sqlite database have the same representation as the postgres server that molssi actually uses or is this a simplified version?

    • CI: I believe these are the internal representations – so they’re the QCFractal representations, so you would need QCFractal to understand them.

    • JC: No, you can query it independently.

    • JW: can you query the molecule info without QCFractal or do you get bytes?

    • TG: you get arrays. The schemas are typed.

  • JCl ~2.47pm PT: awesome live demo

    • TG: this looks much more friendly than the internal postgres

  • JW: The compelling case against sqlite and JSON would be that there’s no agnostic way to export to JSON…. is that correct?

    • JCl: yes. MolSSI have schemas for individual records and objects, but how you piece those together with the metadata is not agnostic.

  • JE: is it correct that the problem we’re trying to solve that data will not always be on QCArchive, i.e. may disappear?

    • JW: yes, that’s one of the major problems about this

    • LW: also, not having to install an entire software stack to use some data

    • DM: options seem to be either download the software stack, or pull the data ourselves, or we put effort into fixing both and there’s no clear current use case

    • JM: wondering about the BLOB types, are we sure we can decode the BLOBs without software?

    • JN: that’s possible.

    • TG: that looks like a binary blob.

    • JW: if it is a blob and you need QCFractal to decode it, does that impact our conclusion?

    • (General): is there a spec that would define how to decode the blobs?

      • JW: unlikely, this is a very recent feature

    • JM: one solution is to store a Docker image

    • MG: or a reader

    • JW: a Docker image sounds like a good image. There’s probably a whole discussion to be had about the best format, e.g. about the image. I propose we short-circuit that by just doing a Docker image and waiting for someone to complain

    • MT: this reminds me of discussion of MD simulation reproducibility; it could go on forever.

    • DM: we should do the simplest thing. If someone comes along later with a convincing argument, we can do that instead.

  • JM: …

  • JC: if they’re binary blobs, instead of doing a Docker image we could replace them with JSON strings

  • (General): ask Ben about blobs

  • CI: if the main goal is just data isn’t stranded on QCA, publishing SQL on Zenodo seems to solve that, even if you need to install QCFractal. Other discussion of formats could be more of a user problem, adding examples would be helpful but you can’t solve everything for everyone.

  • JW: agree, this is a good option.

  • (General): agree the below sounds good:

    • SQlite + docker image

  • TG: I’ve been using the DES dimer dataset and that’s a big CSV file. They provide a helper Python script to get xyz files.

  • DM: sounds like it requires human time to understand the dataset. Using the existing format like QCSchema means the work is already done.

  • TG: CSVs are immediately readable

  • JW: CSVs sound good to me. If we were willing to put in the work, I think CSV would be the way to go. But sqlite is free

  • LW: Agree, pros/cons of CSV sound the same as JSON to me.

  • (General): anyone else released an sqlite database? how did they do it?

    • (General): not that we know of immediately. It’s only recently people put effort into releasing data at all.

    • CI: I’ve seen a lot. XYZ files, XYZ + CSV. Even CSV files, which are much easier to work with, once you add CSV with many other non-standard files, that gets annoying. Most datasets that aren’t awful to work with have been hdf5. There’ve been some database formats, but they’re all different. I worked with lib-something that was meant to be lightweight and you could go through entry by entry. In general one unified single format with standardised keys is better than a random xyz or text file. SQL is totally fine; should be fine to provide examples on how to extract data.

    • DM: https://xkcd.com/927/

    • CI: for a lot of things, it’s just that there’s no good examples currently on how to present data in a way that’s easy for people. People know how to use XYZ files so they use XYZ.

  • JN: A few things. From my perspective, sqlite is pretty common for databases to be distributed in. Right now the sqlite docs is under the caching section of the QCArchive docs, so that’s likely the intended use and the reason for the QCA objects vs general representation.

  • JN: Also, CSVs are maybe too simple to represent all the complex data available in the QCArchive with all the properties, etc.

  • PB: will this happen for all datasets on QCArchive?

    • JCl: just what we use for released force fields. Currently the JSON representation we publish requires pulling records from the server to recapitulate the dataset.

    • TG: is that a QCSubmit or QCArchive collection?

    • JCl: QCArchive collection.

    • JW: we did have a conversation with BP about dataset lifecycles where datasets would start on hot storage and maybe eventually move off the server and onto permanent end-of-life storage.

    • PB: why not do it for all the datasets we submitted

    • (General): discussion about releasing FB targets

  • JCl: my action items are:

    • Docker image

    • Talk to Ben about interpretability of BLOBs

    • Follow up with NIST contact on storing these as SQLite