BP – Took a break from database migration to focus on some new features. For example, sometimes operations take a long time (like submitting a dataset), and I am working on a wayto run this asynchronously (like, you don’t have to maintain a connection to the server). This arose from my experience trying to delete a dataset from the experimental server. Issue was that, since the operation took so long, you’d sit there waiting for the server to respond while it processed the request, and the connection could drop while you wait.
DD – For deleting datasets, is there a concern if one user tells it to delete the dataset, and another asks to list the datasets?
BP – It should be a problem, all of this is atomic. So you can’t catch it mid-deleteion, you’ll either get the dataset list before or after the deletion.
BP – Last bit of hard work to do is the reactiondataset migration
BP – Server seems to be running just fine other than that.
DD – Sounds good. I’m work on testing our qcsubmit and qca-dataset-submission funcitonality.
Working on qcsubmit here:
DD – I recall thhat you were checking out export pathways, like sqllite.
BP – Two features being discussed here, which the initial implementation of QCFractal conflated into one. One is “I want to download a dataset locally and use it without connecting to the server repeatedly. The sqllite files are big, but that’s manageable.
BP – The other issue is “I have a dataset to use for ML, and I want to export this for use by pandas or something else”. It’s ambiguous what this means - hdf5 isn’t a molecule format, and everyone will need something different.
DD – On other projects, I’ve worked on having a REST API for accessing individiual records, but then also having a bulk API for for downloading whole databases or large chunks of it.
BP – Yeah, the latter is what I’d like for the sqllite version. I basically take the mkodels that come out of QCF, convert them to messagepack, and compress them. So this requires users to use the QCFractal models, and it’ll be tricky to parse in another program.
DD – That’s better than what I was proposing
BP – The hard thing with ML dataset standards is that everyone has different requirements, and our efforts to define a standard haven’t borne fruit so far.
DD – BP, are you up for a working session later this week?
BP – Yes.
BP – I’d be interested in user feedback from using the new verisons+host
PB – I’ll give this a shot on Friday and post on the QCFractal channel.
CC – I can try this out as well.
DD – Great, you should be able to follow BP’s pinned post on the MolSSI QCFractal channel
Infrastructure needs and advances
DD worked on #715
DD – I’ve opened a PR to fix, but it was never built to work in the way they wanted. JH and SB had a way to pull down datasets and save them to disk, and I think that’s a more suitable start for this functionality. So I asked for more details about what an ideal solution here would look like.
DD – Working on getting MPI compute online. I have institutional account access, but not cluster account access. So I’m working with Bert do Groot on this.
Throughput status
OpenFF Protein Capped 1-mer Sidechains v1.2: 42/46 from 42 last week
stuck here
CC – It seems like, of the 4 that are incomplete, 3 are close enough to move ahead with. But one of them is just roughly midway through, so I will run locally to determine what’s going on there.
PB – Can I help with the local testing?
CC – I’ll give it a shot first, then I’ll let you know. CC – I think we can move this to scientific review until we figure out that’s going on
SPICE PubChem Set 2 Single Points Dataset v1.2: 121383 from 121086, almost complete, around 100+ remaining.
PB – Suggest moving this to EOL next week.
SPICE PubChem Set 3 Single Points Dataset v1.2: 49181 from 13219
PB – Was running fast until the last few days, but now it’s slowed down.
DD – That’s because I’ve had to pull back on lilac - The admins reached out to me and said that I was clogging up the system. PRP also fluctuates a lot
PB – How many array jobs are you using? We had trouble at UCI with loading python envs.
DD – 12. This is a technical issue on their end, but not related to the file system as far as I can tell.
JW – We have TG running a reduced number of QC workers at UCI, to make room for MT’s interchange testing. So I’ll check with MT tomorrow about wheteher we can give those cored back.
User questions/issues
Science support needs
DOI minting
DD: BP how do you envision DOI minting work? One challenge is at submission the dataset isn’t complete at all. Dependent on point-in-time access.
BP: may only make sense for datasets that are frozen, which we don’t have yet
JW: I’m envisioning it as a downloaded dataset blob uploaded to zenodo (with a conda env yaml and a “date accessed”) and generate a DOI since that would be immutable.
BP – I think MolSSI HAS the ability to mint DOIs, but I don’t know exactly what that means.
DD – Yeah, like, how does that connect to, for example, a DOI search engine?
BP – How, in general, does a DOI apply to datasets?
DD – With the new REST API, if we do have a mechanism to download datasets, could there be an API point that takes a DOI as input?
BP – That sounds interesting, but kinda frightening.
DD – If the data could be output in any form as a response to a DOI-based query, that would be cool.
BP – I think the issue of forward-compatibility is a hard one. Like, do we keep dragging datasets forward as standards update?
DD – There had been earlier talk of Frozen Datasets, might that still go forward?
BP – That’s possible, but it’ll also be really complicated to keep it from changing if it has any connection at all to the original database.
BP – Seems like we’re trying to accomplish a lot of different things:
JW - If it is a living database I would still prefer making a hardcopy of the datasets that won’t change over time with new standards update/modifications.
BP: I will check on how to cite a living database, where forward compatibility is not guaranteed
JW: For our FF releases we have a practical solution of zipping up the molecules used in force balance fits and strapping a DOI.
PB – Looking ahead, with the sqllite download functionality in the next branch, can we use that to download datasets for static recording on zenodo
BP – That would work. But I don’t think that we actually want to do that in the long run, I’d prefer to have a pathway that’s directly provided by MolSSI.
DD – One good way to narrow the scope of this would be a data retention policy - That can limit the amount of forward-compatibility that we’d need to support.
✅ Action items
David Dotson will set up a working session with Ben Pritchard, work through remaining hurdles with openff-qcsubmitnext
Pavan Behara and Chapin Cavender will try retrieving sample datasets from the next server; instructions for environment install in QCArchive slack, qcfractal_next channel
Add Comment