2022-12-13 Protein-ligand benchmarks meeting notes

Participants

Goals

DD : current sprint status - deadline today 12/13
- architecture overview : https://drive.google.com/file/d/1Elw5vWYXuGKSuO-E3jNMxkYnSQaioVF8/view?usp=share_link
- coordination board status : fah-alchemy : Phase 1 - MVP
- updates on In Progress cards
- seeking volunteers for unassigned cards in Available
- please try to finish out any issues/PRs you are currently working on before the holiday
- next sprint will begin January 10
DD : fah-alchemy 0.1.0 milestone
DS : deployment - CLI for service startups
MH : deployment - Docker image build and push
LN : security - scope enforcement by APIs
IP : Nonequilibrium Cycling Protocol (perses#1066) update:
IA : protein-ligand-benchmark : blockers and priorities
DD : cards in play

Discussion topics

Item	Notes
DD : current sprint status - deadline today 12/13	architecture overview : https://drive.google.com/file/d/1Elw5vWYXuGKSuO-E3jNMxkYnSQaioVF8/view?usp=share_link coordination board status : fah-alchemy : Phase 1 - MVP updates on In Progress cards seeking volunteers for unassigned cards in Available please try to finish out any issues/PRs you are currently working on before the holiday next sprint will begin January 10 Decision – There will be a project shutdown from Dec 16 to Jan 10. Nobody will be expected to do work in this window/no deadlines will be set in this timeframe. (JW + IA + JC approve)
DD : `fah-alchemy` 0.1.0 milestone
DS : deployment - CLI for service startups	DS – Still in progress. We’ve realized that we need to have a few more parameters avaialble for each command. So I have that kinda sketched out, but I’d like to talk to you, DD, about that. And the changes have broken all the tests. DD – Let’s meet this afternoon DS – Sounds good
MH : deployment - Docker image build and push	MH – Done! MH – Still in progress. Big question is “what’s the threat model” - Can we trust the host env? This will affect how we do secrets. If we’re ok with low security, we can do env variables. But new-and-fancy best practice seems to be filesystem mounts. But the latter involves a lot more devops work. Big threat model for the former is privilege escalation, and using an external file mount could be encrypted which provides another level of security. DD – Let’s assume we can trust our hosts for the MVP, and open an issue recording the potential threat. MH – Gotcha, then the only other thing that makes this complicated is that I’ve merged other in-progress branches into this branch, so it’s in kinda a messy state, and I may copy the changes into a new branch later on.
LN : security - scope enforcement by APIs	LN – Ready for review. This goes into the API and adds checking against the provided scope and the user’s token. Testing requires checking cases of correct and incorrect scope. I think the testing foundation is pretty good, and that should make it easier to add more tests. Should note that scope only handles 1:1 matching, so there’s no wildcards allowed yet. DD – Great. We do want to do wildcards in the future, so I have fah-alchemy #42 open, but this is great for now. LN – OSX testing should work now. DD – Though it may have broken linux? LN – Linux should be fine, I think I worked around that. DD – I’ll review this today.
IP : Nonequilibrium Cycling Protocol (`perses`#1066) update:	IP – Still in progress Met with JC, DD, DSlatest changes fr and MH. Waiting on discussed changes in the `Settings` objects in `gufe` to be able to fully test the protocol. Studying the analysis of results based on Will create new API points in current `analysis` module in perses for this purpose, avoiding redundant code. DD – We’d also met and talked about the ProtocolResults get_estimate method IP – Yes, that’s shown in the notebook above. JC – I’m happy to take another look at this asynchronously. IP – I’ll open a PR/ask you for review once I have it ready DD – Also, re: resolution on `Settings` - Do we have a clear path forward/approach on how we want `Settings` to interact with `tokenizable`s? DS – Yes, this is almost ready, it just needs tests. MH – That’s waiting on me, I need to make the tests and incorporate the JSON encoder. DD – So path is to merge… DS – gufe#101 is slightly different. Serialization is handled in the #105? though. MH – #101 and #105 are loosely coupled - They’re both on my plate and they kinda need to move forward together. … JW – Feel free to contact MT or me directly if you need changes to openff-models. You’re the only users of it so we don’t need to worry about breaking other peoples' code. MH – Sure, right now we’re kinda discovering the changes that we’ll need so I’ll queue those up before I open a PR.
IA : `protein-ligand-benchmark` : blockers and priorities	IA – Still in progress. Path is straightforward, just need to do it. Should be open for review in the next day or two. DD – IP, could you be reviewer? IP – Sure, but I’m offline next week, so I’ll need it before Friday. JC – and IP, could you provide instructions on how to run this in your absence? IP – Sure, will do - It’s handled by a bunch of scripts on Lilac and I’ll point you to those.
DD : cards in play	DD – This is to store anything that’s a file-like object. So we push protocol objects to S3 and get file references. Then neo4j stores those references. Architecturally, the compute services push things to the compute api, and the compute api pushes them to S3. The user-facing API will then be pulling from the same. …. JC – Will this contain the aggregated work values before they’ve been analyzed or the post-analysis object? DD – (see fathom transcript) JC – May be good to implement caching in the future, so that people monitoring jobs/current estimates can get prompt results and don’t gum up the system DD – That should be totally possible to implement once we need it. https://github.com/openforcefield/fah-alchemy/issues/40 DD – When it comes to rpotocol results, we have “clones” and “gens” (these are F@H conventions) - and these distinguish between replicas and extensions. DS, you and I had discussed whether a protocol results should know that it’s an extension of something else, or whether only the owning structure should know that. So this touched on whether something.gather should take a list of lists… DS – I actually made some slides on this… DS will post slides here DD – Slide 5 - With the list-of-lits structure, it would best accomodate a hierarchical model. So a given replica would only have a chain of “gens”, and it would never split. So this architecture will decide whether replicas can “split” JC – Could we get both? There would be use cases for both, though the immediate use is to just have a big bucket of labels. And then we could just have different frontends that provide different views. DS – In some contexts, “generation” doesn’t mean anything. JC – A lot of the results will just be “give me a bag of data and I’ll figure out what to do with this”. A lot of use cases won’t want to traverse a chain/lineage. I’d be happy to provide user input/discuss likely use cases…. (see fathom notes) DD – JC, could you record these use cases as issues on fah-alchemy?
Transcripts with no edits	F@H interface meeting - December 13 VIEW RECORDING - 58 mins (No highlights) @00:00 - David Dotson You make it so you have some. Oh, perfect. @00:03 - Levi Naden I think it's set up so that it defines a scope that we can actually use. There are several scopes established. @00:08 - David Dotson There are several scopes that the token is not defined that are established. So that's all tested again. @00:13 - Levi Naden So a lot of the PR is also, I can say it for the meeting. @00:17 - David Dotson No, no, that's excellent. No, I read through your summary. So I think I know exactly which is I would get to really review the code changes. But it sounds like you've done a great job here, but I said, thank you so much for that. You've got all the right ideas here. @00:32 - Levi Naden You want to say bye bye? @00:34 - David Dotson Bye. @00:35 - Levi Naden All right, I'll get a low plug. No worries. Mm. @00:55 - David Dotson Hey, everyone. Morning or whatever time it is for you, everyone. a. Thanks for joining. Sorry, I'm a little under the weather. SCREEN SHARING: David started screen sharing - WATCH I was out sick yesterday. So, still a bit of a sore throat. But we'll go ahead and get started. I think we've got most of our folks here. And I think Richard's on vacation. So, oh, there he is. Richard, aren't you on vacation? @01:20 - Richard Gowers So, a long vacation from being in charge. @01:24 - David Dotson Oh, I see. Okay. All right. Well, thank you for joining. Appreciate it. Okay. Thanks. Thank you, Jeff, as usual. Others who want to take notes. Go ahead, Jeff. @01:35 - Jeffrey Wagner Yeah, Richard. Do you want to point out anybody else to be your approver today? Or do you want to be? Do you want to keep that? No. @01:42 - Richard Gowers Is there a fan here? Okay. @01:46 - Jeffrey Wagner Okay. Or fan, are you good with that? @01:48 - Irfan Alibay Sure. He doesn't get to choose. @01:51 - David Dotson The illusion of choice feels good, right? Okay, cool. @01:58 - Jeffrey Wagner Thank you. @02:00 - David Dotson Oh, thanks for Jeff, anything else? @02:03 - Jeffrey Wagner Nope. @02:04 - David Dotson Okay. Okay, thanks everyone. We'll go ahead and look at the agenda. Before we jump in, is there anything folks want to add to this list of discussion items? Mostly just operating in script mode here. Okay, if you think of something, let us know. So start off with current sprint status. So we have a deadline for this sprint today. So what we'll do is we'll close out today's sprint. We'll look at what we've got in progress. Look at what we've got in review. Also look at what we closed out and celebrate a bit. I want us to make sure we're celebrating our victories here. So, and then also working through any problems that we still have. So I will say that I'd ask that folks try to finish out any PRs or issues that you're currently working on before the holiday. If you can, because that'll help me. I'm still aiming to. to deploy our first deployment by the end of this year. So I'll happily take on that myself. But that would really help me if you've completed anything you've got in flight. We will start, we'll continue this using a sprint approach in the new year for next major milestone 0.2, or next major release starting January 10th. At least that's my plan so far. Any questions on that? @03:34 - Jeffrey Wagner I think that makes sense. I wonder if there's... It sounds like you implied this, but I would like to make it formal. Perhaps there is a day at which we discontinue work or we don't expect work from anybody until January 10th. I think the sprint thing is great for moving quickly, but it's not great for stopping. And so it may be good to make sure... that people are like all unassigned or something so that there's no pressure to work over the holidays. @04:12 - David Dotson Yeah, I don't, I'm, what I want to make clear is that I'm happy to continue working myself, but I'm not expecting that of anyone else here. So I would ask if you can try to finish out any issues or PRs great if you can't that's also fine but just do the best you can. I'm happy to take anything wherever it's at and then continue forward with it. But Jeff would you like to set a particular cutoff date if that makes it better for for one here? @04:44 - Jeffrey Wagner Yeah, but you guys give me a quick show of hands in the in the reaction thing if you're going to be off next week or giving me a thumbs up if you'll be on next week. I'll put it that way. @05:01 - John Chodera (he/him/his) I'm on, but most of the lab is off. @05:03 - Mike Henry (he/him) Yeah, I think honestly choosing the 16th is probably safe, but like I'm also going to be like working and stuff, but like, I think that's when most people forget to stop and we start. @05:17 - Jeffrey Wagner Okay, perfect. Yeah, so maybe, um, maybe we plan on sort of a workshop down starting the 16th and then we'll come back on. Did you say January 10? Yes. @05:32 - David Swenson Okay, for myself, I'm actually that might not be best for me because I'm off. Wednesday, Thursday, Friday this week. So I'm from tomorrow until the 16th. I'm off. I'm happy to contribute some next week, though, to this. Okay, I'm off in a way of going really unavailable. @05:53 - David Dotson Understood. Okay. So, um, so once in, would you be willing to meet with me today just, um, I think we can. knock out the rest of your PR and flight. @06:03 - David Swenson I was going to suggest that when it came to my bullet point. @06:05 - David Dotson Yeah. Perfect. OK, thank you. OK, cool. Any objection to basically nobody in the working group is expected to do anything after the 16th this year until we start back up in the 10th? OK. Do we get any thumbs up from our approvers on that? @06:30 - Jeffrey Wagner Thumbs up from me? Shown, yes, Irfan, yes. @06:34 - David Dotson OK, thank you. @06:35 - Jeffrey Wagner It's official. @06:36 - David Dotson Thanks, everyone. Thank you. And thanks, everyone, here for all the work you've done. I mean, this working group has been fantastic. And we've gotten a lot of, we've come a long way in a relatively short period of time. So I just want everyone to at least feel like they've contributed. And it's made a difference here. So thank you. And we'll start again in the new year. Okay, thank you. We'll go ahead and if, as far as architecture reviewed, does anybody feel they need a refresher on what the overall architecture is? Okay, if not, we'll go forward. So, we'll hit board status here. So the objective for at the moment is just walk through and tell me if we need to change which column. If any of these that are currently in progress need to basically go and review will hit each one that is in progress individually, so I want to make sure folks get a chance to stand out here. We'll also cover any closed items so we can celebrate those. So, in progress column starting at the top of goofy number one on one, this is improvements for settings. So, realization, is this still in progress or is this ready for review and some of it's still in progress. @07:55 - Mike Henry (he/him) I'm the current blocker on that one. @08:00 - David Dotson Ask question. Thanks Mike. Also, Docker compose. Transcetis. @08:04 - Mike Henry (he/him) It's also still in progress. @08:07 - David Dotson Perfect. Thank you. So once in, see a live service start. @08:10 - David Swenson So now we just mentioned it. Yeah, still in progress. @08:13 - David Dotson Okay, hopefully we can get that done today. Perfect. Thank you. Is it on with us? @08:19 - Mike Henry (he/him) He said he might be late, but I can also mention that this is now on me as well. @08:25 - David Dotson He's here. Well, uh, current status. Someone progress or? @08:33 - Iván Pulido (John Chodera) Yeah, I'm studying the. The notebook for analyzing the results that IV has shared and will be implementing the last part of the protocol results that we're missing. @08:49 - David Dotson Okay, thank you. Your phone. Percy's some protein, the convention mark minimum spanning graphs. Sorry, that's still in progress. Okay, I think you should guess something very on the week. Okay, thank you. And then I've got several in flight as well for me. So I'm working on these all in primarily the main branch of Falcomy. So hitting them all and sort of sweeping across them. So except for this one, this one is a PR. So thanks folks. I think then the board is capturing current state. Levi, I've put the scope enforcement PR into in review. @09:25 - Levi Naden I think that's accurate. Okay. @09:27 - David Dotson Cool. Thanks again for that. It's looking good. That's fantastic. And then we've also had some things that were completed, which we'll talk about in a bit. So thanks folks. Any questions or comments on the current state of the board? Okay. Thank you. All right. As I said, you know, if you can, please try to finish out any issues or PRs. You're currently working on me as much as possible. My objective is still to finish out the really. easier than one milestone before the end of the year. So most of this will fall on my shoulders. That's perfectly fine. But anything you can do to help me out in the near term will make that much easier. OK. Milestone is at 36% complete. We've got PRs open on many of these items. Like I said, I'm sweeping through these three in the middle on the main branch. I think we're actually in a pretty good spot given all of this. I can't think of anything outside of this that is really needed for an MVP to be deployed and is deployable. So yeah, we're on the home stretch here. So thanks, everyone. Let's start off in detail. David, do you want to talk about what's needed for the CLI service startups? @10:45 - David Swenson Yeah. I mean, just to give the context of what's been done, one of the things we realized is that we do need to have a few more parameters available than we had initially done for each command. So I think now I have something in place that has all the parameters that we need. OK. that you had talked, David, about having a sort of a different way of wiring that in. And so I want to talk to you in detail about that when we have a chance. And also because we've added more parameters, this has, of course, broken all of the tests. That needs to be fetched. But I also wanted to check with you how you want to do some of the testing on a few of those aspects. @11:19 - David Dotson Okay. Yeah. I took a look at it yesterday. And so I think we're getting something also broken when I try to actually run the APIs. So whereas that wasn't there before. So we'll work through this in our session. @11:31 - David Swenson I think that's because things aren't wired. That's because things aren't actually wired in right now. @11:37 - David Dotson OK. Yeah. But otherwise it was looking visually OK, like just from my inspection, like visual inspection. It looked like everything was as I would expect it. So is there a time today that works for you? @11:47 - David Swenson We can also do this whenever in the afternoon, whenever it works for you back to this meeting. @11:53 - David Dotson We'll schedule something after this. Thank you. Any questions for David? Okay, thanks. Thanks for watching. Thanks. Appreciate it. Mike. So on the deployment front, you completed the document image building push. So thank you for that. That's definitely, we should celebrate our victories here. So, so thanks for doing that. I know you're not currently working on the dark compose pieces. @12:26 - Mike Henry (he/him) Do you want to talk about that? Yeah, really my big, I think my big question for it is like for MVP, like what's our threat model like, can I trust the host environment that we're on the server? Like really the only little bit to decide on the darker compose is the best way to handle the secrets. Environmental variables are the like way to kind of do it, but the new hotness with Docker is to use file mounds because environmental variables can be leaky with like. Docker PS and other like ways like it. So. how to contain it in the dumpy MV. So if we're okay with assuming that the host would apply this on, we trust other users on it. I think for the MVP, it'll be easiest to just use an environment secrets file because in order to do the more sophisticated ways of showing secret between containers using file system maps, it just requires like another level of DevOps on top of that. So I think if we're okay for the MVP, but we can like trust our hosts, then I don't think is really any major, more re-aging to go on this. @13:37 - David Dotson Okay, cool. Yeah, I agree with you. Let's go ahead and operate in the assumption that we can trust our hosts for the MVP. And you make an issue that articulates what you just told me here of, you know, next step would be to go beyond that would be to use, you said, host mounts or, you know. @13:55 - Mike Henry (he/him) So basically, yeah, basically there is a like directory that you'd that's encrypted and then that way the secrets are never like in like it prevents it from being leaked with a lot of the doc utility commands which theoretically you're not supposed to be running Docker's route so that it's in user land so then if someone can do a return like code execution in your environment then they can sometimes get Docker to leak those like those keys but if you do a if you have it encrypted on the file system then that helps cut those off but it requires using like Docker swarm or like just like another level of it which like is way overkill for what we're trying to do but I just wanted to make sure that we're okay with that going like going forward because if we want then I want to make sure we do it right the first time because it's a lot easier to like get security right like be do be as restricted you can then to like be really lazy and then later find all these issues with your deployment process. No. @15:00 - David Dotson Because this is a Docker compose as a single host solution, right? Like if you're if you were deploying to something like Kubernetes or Dr. Swarm or some other container orchestrator, you would use the secrets mechanism that that orchestrator provides, right? Yeah. So I think it's fine for the Docker compose approach, the single host approach to just proceed with the environment file. Like you said, the single host will think here. @15:26 - Mike Henry (he/him) Yeah. Then, yeah, really the only the bit that makes this PR kind of gross is I had to merge in a couple of PRs that aren't merged in yet in order to like get, have the environment CLI and things like that. So, um, I might just end up like nuking this once the other PRs get done and merged in. Um, I might need this one. Should we start it to make, keep the get history nice? @15:49 - David Dotson No, I appreciate that. Yeah. So once in a while, try to get the CLI stuff wrapped up that'll bring this up for you. @15:54 - Mike Henry (he/him) Cool. @15:56 - David Dotson Awesome. Thanks for your work here. Any questions for Mike? Okay, thank you, Mike. Levi, do you want to give us. Tell us a bit more about the scope enforcement. @16:17 - Levi Naden Yeah, so what I've done with the gut PR for it. And so what it is, it goes into the API and adds effectively checking against the provided scope. The requested scope against what the user provides as part of their token. And a lot of that had to do with adding in a fair amount of test infrastructure as it were. So to actually get all the correct settings and everything to make sure we can test not only correct scope, but incorrect scope and all the infrastructure involved for doing that. And also make it further testing the API easier going forward, because of all the extra infrastructure work that put in behind the scenes of this PR. But yeah, otherwise, that's what this PR does and it should, it provides a layer of success. security that prevents discoverability. If somebody tries to use a scope, they don't have access to as part of their token. It doesn't do anything beyond direct one to one matching. So you can't do why like what you can't have a wildcard scope, for example. Eventually, when the query networks is enabled, the wildcard scope will be checked against, but it'll still only ever find things as part of your scope, at least the way it's implemented now. But in the future, should people decide, should everyone here decide that they want to make things at the ability to have wildcard tokens, the infrastructure there should be easier to set up and implement going forward, and also test against. So. @17:48 - David Dotson Yeah, we do intend to add wildcard support in the future. So for example, I've got this number 42, which would be a follow on to this right for the zero to two milestone, which is. @17:56 - Levi Naden And I had been looking at that one as part of considering this. So. Hopefully all the extra test infrastructure I've added in there will make that easier to actually check and make sure it's correct in future PRs and coding. So hopefully that the process should be faster. @18:12 - David Dotson Perfect. Excellent. This is awesome. Thank you. Yeah. Thank you for doing this. And I know this is also your first intro to the code base. So fantastic work given that you just jumped into it about a week or so. Well, do you have any questions for the working group that you'd like to pose or? @18:32 - Levi Naden No, none right now unless they have questions for me. @18:36 - David Dotson Any questions for Levi? @18:42 - Jeffrey Wagner No question. Just a big thank you for hopping in and working on this. That's awesome. He got this over the finish line. @18:48 - David Dotson Cool. Oh, yeah. Side note, as part of this OSX testing will work now. Good. Yeah. I managed to find that. I'm a little worried that we may have broken it on the next panel. think so, but that's the other PR that work out. So we'll see. @19:04 - Levi Naden Yeah, I don't think so, because all the CI tests for GitHub actions with this PR, which implements that fix for Linux works and that's tested against the Unix VMs they have. @19:17 - David Dotson So, Okay, sweet. I think we're in this shape then. Yeah, I was just a little worried when I was reviewing another PR that looks like something else. @19:23 - Levi Naden Should be okay. But we can always add it need be worst case scenario. It's an if statement around that why for checking into the OS. @19:32 - David Dotson Exactly. Perfect. All right. I'll give this a review today. Like I said, it's it's all my plea to do here. @19:37 - Levi Naden So thank you. Great. @19:43 - Iván Pulido (John Chodera) Yvonne, do you want to give us an update on non-equilibrium cycling? Yeah, I already commented some things in the notes. But basically what I said is I'm working on the analysis of the results with with this notebook that I share the link in the notes. And what I plan to do is extend the analysis module that Pertis has to include this for non-equilibrium cycling because I want to avoid redundant code because we will need that in Pertis not only for the goofy protocol. I'm also waiting on the changes that we discussed last week for the settings objects. I think Mike is working on those and it's not a blocker but I do need them to fully test things that are running. Other than that, that's the status. @20:39 - David Dotson Okay, thank you. And I know you and I met. I think we figured out what we want to do with the protocol results for the get estimate method. In particular, from John's comments on Tuesday, we only want to be using a single single accumulated work for a given protocol to act forward in reverse. one of each and we want to take a whole set of those and pop those into bar basically. @21:09 - Iván Pulido (John Chodera) Yeah, this is what the notebook actually does. I just was misinterpreting what it was doing. But I'm starting in detail and this is what it does. @21:19 - David Dotson So, yeah. Perfect. Thank you. @21:23 - John Chodera (he/him/his) I'm happy to take another look at this or asynchronously if you just want to send that my way. @21:31 - Iván Pulido (John Chodera) I think when we have, when I have implemented something, I will create a PR or maybe it is the same PR here but I will point to the changes for you to review them. Please. Thank you. @21:47 - David Dotson So, you did mention you need, you also need some resolution on the settings. I just wanted to, I know we're jumping back into this one but I did want to ask, do we, I know Mike, Ivana. And so once I met last week on this, do we have a clear push forward on how we want to do goofy? How the settings will interact with goofy tokenizables? I think it was using the Json and Coder. @22:23 - David Swenson Yes, so serialization is to the Json and Coder. There's actually another PR open on goofy. I put it in a separate PR. It just needs tests, I think. I don't know if Mike has time for that or otherwise I can try and do that. @22:44 - Mike Henry (he/him) Yeah, no, this is, it's on me to finish up these tests and then incorporate using the Json and Coder this way into the PR that Yvonne opened. @22:55 - David Dotson Okay, so it looks like you've already got some review on this. So it's the path forward then. to merge this one. And then would this actually fully address number 101 here? @23:10 - David Swenson Not entirely because there are some things in 101 that deal with the fact that there are defaults that are not set currently. And I think there's actually, that sounds from our conversation that we had, that that's going to actually be also some sort of refactoring, complete changes to the settings objects, especially in terms of how it handles force fields. That was in the discussion that we had last week. So those things, I think, are kind of part of 101. But the actual serialization aspect, I think, is handled in the other one. @23:44 - David Dotson OK, cool. Do you feel like, once in you have what you need a number 105 then, or do you just need more review, or do you need someone else to review as well? @23:57 - David Swenson There's a need for tests. And there are a couple of little things I noticed. Okay, so the one you're pointing out there is asked to do, I think actually with the problem in the defaults that are set in 101 because they're not there's something in the original object doesn't carry units. And when you go through a serialization cycle, it's up to units. @24:16 - David Dotson But I think these are things were added in 101 that didn't do that. @24:21 - Mike Henry (he/him) Yeah, yeah, 101 in 105 or like, you see coupled like 105 splits out the work of getting the Jason and code it or what it needs to be versus like 101 is in like implementation a lot of stuff so that's why they're both kind of on my play right now to get them like the size. @24:38 - David Swenson If you scroll up a little bit there's the one comment I have there's that I think might need to be dealt with in a little bit more detail. There's one one like review comment I made against the code that right there. So, this is what I have here I realized after I wrote it is assuming just for basic unit. I don't think this. is going to work if you have units that are square or meters per second square or something like that. So the actual details of what these two dict and from dict functions are here might need to change. But the premise, the idea of how this should work would be the same. @25:22 - Mike Henry (he/him) Yeah, this will be easier than to test once the other one or whatever is like finished because then we'll be able to play around with this and just make sure that we call the right bit of code to get the units parsed correctly. @25:38 - David Dotson Mike, do you mind if I add you as a reviewer, officially? Oh, not at all. Not at all. Okay. Okay. But it sounds like you've got a handle on this. @25:48 - David Swenson Yeah. Okay. Yeah, I think Mike's actually working at moving forward, right, actually. @25:53 - Mike Henry (he/him) Yeah, I'm going to take David should be taken off of this PR and it's my responsibility to get this now. to the finish line. @26:02 - David Dotson Oh, okay. I can also just switch that over to you then. @26:05 - Mike Henry (he/him) Yeah, that would be. @26:06 - David Swenson I think it's an assignment there. @26:08 - Mike Henry (he/him) Yeah, there it is. Yeah, it's like when you open a PR, it doesn't assign you to it. So yeah. @26:16 - David Swenson But obviously if there's anything that you need help with on that, let me know. @26:20 - Mike Henry (he/him) Yeah, and I know that you're gonna be out like this chunk the rest of the week. So yeah, cool. @26:27 - Jeffrey Wagner Yeah, and Matt. Matt Thompson is pretty jazzed that anybody's using OpenFF models at all. So if there's anything you need from his end, feel free to contact him directly or contact me and I'll make sure that we prioritize any fixes you need. @26:41 - Mike Henry (he/him) Yeah, and I owe him some stuff. But I now that once we're starting implementing this and things are kind of in flux, I want to wait to like loop back onto that PR for OpenFF models until this stuff is all like kind of set because this is all the changes we needed to do to make this work. So once we're happy with how this works, then I can. move back and get like the tests and make sure that like things are performing things. @27:05 - Jeffrey Wagner Okay, cool. To the best of my understanding, this is the only use of open FF models in the wild. And so you're not going to break anyone else's code if you just push stuff into into main over there. @27:15 - Mike Henry (he/him) Cool. Good to know. @27:20 - David Dotson Well, thank you. Yvonne, is there anything else you need for this monoclebrum cycling protocol as far as you can tell? @27:30 - Iván Pulido (John Chodera) Maybe just a quick request, Mike, we made some changes to the versus code for the protocol. Can you upload those even if they are not, you know, definite, but just to for me to know which settings will actually be there because I don't remember by heart now. @27:54 - David Dotson Okay. Yeah. Okay. Okay. Any other questions for Yvonne? Okay, thank you. Irfan on the protein ligand benchmark front looks like we got merged in number 82. So thank you for that. Do you want to give us some discussion on number 83? @28:18 - Irfan Alibay Yeah, no, it's essentially three years just waiting on me. I just need to get my workstation back up to kindly everything. It just takes a little bit of time and then I've started working on doing all the fixes for proteins so that should be available soon. By fixes I mean renaming for example the various strategies to be used on a compliance. So that's it's all waiting on me unfortunately. @28:51 - David Dotson Okay, is there anything anyone from this group can provide you that would help or no? @28:55 - Irfan Alibay Probably not at this time I think I'll be looking for reviews maybe tomorrow. @29:00 - David Dotson Okay, and Yvonne, are you still available to review anything that their phone puts up? @29:07 - Iván Pulido (John Chodera) Yeah, if it is this week, sure, because I'll be on vacation next week. But yeah, basically I'm just waiting for him to regenerate the edges and I will just run the, try to run the whole dataset to see if it runs. @29:24 - Irfan Alibay Yeah. Okay, I'll try to get this to you in a few feet time tomorrow. So that should be middle of your day. @29:33 - John Chodera (he/him/his) Okay. And Yvonne, can you make sure to post instructions on how to do that in the Percy Slack so we can take a stab at that if you're over the holidays? I'd love to just make sure I can set these up and run them, but you shouldn't need to supervise that. @29:48 - Iván Pulido (John Chodera) Oh, yeah, that's a good idea. @29:50 - John Chodera (he/him/his) Yeah, I'll will. We'll have the computer time at least and I'm happy to babysit. @29:54 - Iván Pulido (John Chodera) Yeah, I have everything kind of well organized on LY like I just point to the relevant scripts. Don't either. @30:01 - David Dotson Okay, cool. Airfon, is there anything else that you'd like to talk about on the protein ligand benchmark front? @30:09 - Irfan Alibay Another one more. @30:12 - David Dotson Okay. Any questions for Airfon? Okay, thank you. Thanks again, Airfon, for all the work you've done on this front. I know that PLB is not the most glamorous work, but we do appreciate it. It is a key component of everything. Okay, last but not least is my cards in play, so I can talk a bit about this. So I've been doing a lot of this work on main, so it doesn't have a clear PR showing where things are happening, but I've added object store support to each of the APIs. So I can talk a little bit about this in pieces here. So, object store is in general anything that can store file.objects. And so, we have one for a start here that, um, officer AWS S3. So we now have a way to push protocol deck results to it and get back an object store ref that gets stored in Neo4j. That's our index, that's our database, that's our state store. So we don't store the results themselves in Neo4j, we store them in S3, but we have a reference to them. So we know exactly where they are. And then likewise, given that same object store ref, we can then retrieve it from S3. And so, architecturally, only when it comes to the compute services, those things will be pushing things to the compute. Pute API, including results in the compute API, will then push those things to the 2S3 using that S3 App Explorer interface. And the user-facing API will likewise be pulling. And so the Compute Client has a way to push. Let me see here. Oh, it's in set-task result here. So it'll take a given protocol, bag result, and then push these to its corresponding API point. The user-facing API will be able to use the same API as the user-facing API. And then the user-facing API will be able to use the same corresponding API point. Will service is what pushes these results forward. So it calls its own client set task result, and this makes its way through the API 2S3. And then likewise on the user facing end, we have get transformation results. So this is what YENCA for example, this is what you'd be using to pull a result for transformation that you've previously submitted. You could call get transformation result, give it the transformation you're interested in, scoped key. And it will give you back the protocol tag results. Corresponding to that transformations protocol. So it'll give you back like the single thing that you need to get a delta G out of. So it'll orchestrate all of the pulling from S3 to do that. @33:59 - John Chodera (he/him/his) So this is what you're interested in. Does this contain, like for example, for the non-equilibrium cycling, would this contain the aggregated work values before they've been analyzed or the post-analysis? Here's the free energy estimate and the uncertainty and the associated metadata with that. @34:12 - David Dotson So if you say return, so we get two options here actually. So you could say, by default, it's going to return the protocol result object, which has three methods on it, at the very least, get estimate, get uncertainty and get rate of convergence. So it'll only have at that point the things that are needed to get to do those operations. That may include the whole list of accumulated work values for non-equilibrium cycling forward and reverse. But you'd have to pull those out yourself. If you decided you wanted the more raw results, you could say, just give me a list of lists of protocol lag results. And then that would, instead of passing them in through here. they'll just spit back out the raw pieces. So you could then do custom stuff with that if you wanted. Does that make sense? @35:08 - John Chodera (he/him/his) Is the overhead for retrieving this result pretty substantial, or is it pretty lightweight to pull this back? Like if you're grabbing all of the results for a graph, for example. @35:20 - David Dotson All of the results for an entire network. Yeah. So this is the single transformation case. And we have yet to see what the performance bottlenecks will be of all of this. I'm expecting initially we're going to have tons of performance bottlenecks and we'll have to address those subsequent releases. So if you wanted to grab multiple transformations, like full data for a whole network, you could walk through for each transformation, basically do a loop on this, or you could do it in parallel to. I mean, it would be perfectly appropriate to do this with like a. with a process pool or with a desk or something like that to grab multiples at once. We can then add conveniences to the client itself to make it do more multi-threading requests. There's also more opportunities for optimizations here. Like I just noted here as I was writing it that what we like to do is make the requests as we're processing them. So that way you know we're not wasting compute cycles on the client side while we're just waiting for things to come back. I don't think that is a question I'm just pointing out. There's ways to optimize this. @36:36 - John Chodera (he/him/his) Once we start to get a feel for how clients will access this and what the access patterns will look like we may want to institute some sort of caching or other things that allow us to because you'll often want to say well what's the current estimate and then what's the next step you know later on what's the estimate and you may not need to reprocess everything if nothing has changed for example. @36:57 - David Dotson Exactly you don't want to have to pull down 90% of stuff you've already pulled down. Or if that's, because if you're just adding increments, then most of what you'd be pulling down would be stuff you'd already had. @37:07 - John Chodera (he/him/his) Or even for the example here, if a bunch of edges are complete and you're just waiting on a few stragglers, you don't have to do the bulk of all the work on things that have literally not changed at all. @37:19 - David Dotson Yeah, so at the moment the user-facing client doesn't have any caching mechanisms built in, but that is something that we could add. That's basically where it holds on to anything that's pulled before result-wise. The good news about architecture we've done here is these results don't change, right? So once you've pulled a given result and has a UUID attached to it and all sort of stuff, this could be at the protocol-tag-result level. That would be the appropriate spot. Once you've pulled it down, then there's no reason to pull it down again, because it's not going to change on the systems of record, right? So that is a clear opportunity for caching. We can put it down. in here, it would speed up pulls, subsequent pulls for any given transformation. And we can do that in a way that a user doesn't have to think about it. It's not, they don't have to manage their cash manually. It just, like the client handles that. Does that satisfy you, John? @38:19 - John Chodera (he/him/his) As long as there's a path to that, it sounds great. We should begin as you point out, amortize the cost of implementation of anything for addressing performance model nexus they arise. @38:29 - David Dotson Yeah, I think there is a clear path for that, though. It is another layer of things on the client, but that won't be in the first pass of the client, but it can be in the second or third pass of the client. So as I said, I'm sweeping through current strategy I'm taking for getting all these things out the door is to sweep through these three issues in particular. because they're kind of all related as I go back and forth through them, but getting near the home stretch here. So my next step is to add in tests for the client result pulse that I just pointed out just now. So any questions for me? I know I kind of just threw a lot at folks here. Okay, thank you. I did want to ask this is more of a technical aside and I think we've discussed this before. I just once and then Richard. So when it comes to protocol lag results, we have in some sense a concept of clones and then gens right that's the folding at home terminology of these where. You can have, um, I think probably better effect draw picture, but, um, it's basically do I have replicas or do I have extensions and I know that once in a minute, you know, I've debated on do we go on a like a label based basis where a given protocol deck result knows that it's an extension of something else. Um, where do we go with the full like they don't know anything about where they come from and it's dependent on the storage system to know how these things are related. Um, because I was, as I was building all of this apparatus out, it was starting to make more sense to me to for a, um, for protocol dot gather to take maybe a list of lists of protocol tag results where each each. top level list is a replica and each list inside of there is a chain of gens basically. Does that make any sense or am I? Is it clear as well? @41:14 - David Swenson So actually if you don't mind, I can share my screening. It happens that I put together some slides on this after you and I had a really good conversation about this and I was waiting for the right time to have this conversation with the rest of the teams. SCREEN SHARING: David started screen sharing - WATCH I put a couple of things. What we talked about was this idea that we need a label for result. Kind of the two models that we had discussed and I guess through all this most of the structure is clear. We know we can label using goofy keys. We know we can have UUIDs as part of the label in there as well. These aren't super useful from a human readable standpoint but they are unique identifiers and that's what we need. We're talking about these two models. I think it sounds like you're falling more. towards the hierarchical model, which is funny because you argue the topological and then after we discussed it, I kind of moved from hierarchical to topological myself. @42:09 - David Dotson Oh, shoot. We switched places maybe. @42:11 - David Swenson I don't know. @42:12 - David Dotson Pretty much. @42:13 - David Swenson Yeah. So in that sense, you're thinking that a result would be labeled by the transformation that came from the clone, the generation, that is enough to give you a data result, then the units themselves can be labeled by the unit, the retry number, and the name of the specific data within it. Like that would be a file name because a given unit could create multiple files. So the difference here is between this clone generation and talking about the key for this one and the its parent. @42:45 - David Dotson So can I stop you there real quick though? Just a quick question. So yes and no. So in the sense of with the list of list structure, I just. Now that would best accommodate, yeah, you write a hierarchical model, it's like saying, for a given replica, then it only has a chain of gems. We never do any sort of splitting within a chain, end up with this tree of things, right? So I'm fine with enforcing that. I think we, in our discussion, we talked about maybe we had forced that first start because the use cases for everyone else is listening and say, the idea is, let's say you say, I want to run this transformation. So I create a initial task to run it once, and then you say, okay, I want to extend that. I'll extend it once, okay, so I've got a chain of two things. And then you say, well, I'd also like to just do another extension of the first one. So then you've got it now one followed by. two. So you could really do a branching if you wanted. The model is flexible enough to handle that. The question is whether that's even useful for folks and that was something we debated. @44:10 - David Swenson Exactly the question had there but the point I pointed out the the one thing that I thought about is that this really reminds me of a 7-7 quantum Monte Carlo with the birth and death of walkers. It seems like I don't know if this is something that is in any way ever going to be useful to anyone in this space but there are domains where this idea shows up. And so that's worth being aware of that we're cutting off some things by reducing the flexibility potentially. @44:42 - John Chodera (he/him/his) Is there a way to have have archaic and eat it too by having I mean essentially there this could be useful for very fancy adaptive like I'm actually analyzing with Minecraft state models to explore and reallocate trajectories to underexploid regions. I can see that. being very useful, but the immediate use is like, I just need a big bucket of things where they all have different labels. I may or may not care about them and I may need to do filtering. So you can have the same information, have multiple different views, like a topological and a hierarchical view of the same data by providing an API that allows you to explore that in different ways. And the complexity of these could be hidden behind the user is something like that possible. I know this is again, you're looking at how you're storing things in terms of the gigantic model, but if you could provide some, you know, access accessor functions that provide different views on it, you might be able to get both worlds. @45:41 - David Swenson So the challenges are that in a context like this, it's unclear that generation means anything. It certainly is not a unique identifier for anything. And so that's, that is, if you provide that API and you have something like this, what do you expect out of it? @46:00 - John Chodera (he/him/his) But I think you would expect that you'd just get some different copies that flatten to the same information, right, the same generation or something like that. So that's okay though, right? I mean, you know that for the specific, like the protocol knows how to, a protocol knows how to analyze its own data. It may or may not care about the complexity here. These are just convenient ways to access the data for the programmer of the protocol, and they'll know what to do with whether the representations are, you know, meaningful, but you still get back all the data. It's just that you get multiple copies of the same thing. @46:32 - David Swenson And the other thing that is a little tricky here, so you're thinking that you'd get like, you'd get from here and you'd get this whole chain back and you'd start. @46:41 - John Chodera (he/him/his) But maybe with the unique hash associated with each one of them, if they have the same collision on Gen and clone or something like that, right? @46:52 - David Swenson I mean, if you ask for, if you ask for a specific thing, I want, you know, this is one clone. So I get this clone. So keep Yuri on camera. And I ask Jen to, do you expect to get two results back then in this case? @47:05 - John Chodera (he/him/his) Yes, with different hashes attached to them, right? I mean, you just do the best possible thing you can do with what was requested with that particular view of the data. But then, you know, you would know that you really want to explore this through the topological model, right? If you really want all the details of the branching history, etc. @47:23 - David Swenson Would you think then that in the more common basic case that you would still be returning than a list to list of one item for that? Or would you? @47:32 - John Chodera (he/him/his) You know, honestly, like most of the analyses will be just give me a bag of results and with some annotations, right? Or give me a bag of results with, like, I want to discard the first iteration or Jen and then take everything beyond that, right? So much of the analysis will all be, I just want a bunch of the results together to analyze collectively and you're just doing some basic filtering of that. So the detailed, like, you know, model of working down the tree is actually... actually super inconvenient for most of that, because you just want the bag of data. So most of the XSEs will give me a bunch of data, but there might be different convenience accessors for how you do the filtering or how you do the obtaining of the data, and then how you label the data when you get back, in case you need to do more advanced processing. @48:24 - David Swenson Yeah, I guess I don't have a full sense of what the analysis things would be there and how to make that work from an API standpoint that isn't potentially confusing. That's the one thing I was concerned about here, because here obviously the more general thing would be that the generation would give you a list of multiple items, but that would be a weird to me as a user who's just thinking of it in this case. @48:47 - John Chodera (he/him/his) Do you wanna game through some use cases that might help you make this decision a little bit easier, because we can do that offline if that would be helpful to just get like a bunch of use cases for different algorithms? @48:58 - David Swenson Yeah, I think that would be helpful here. @49:00 - David Dotson Yeah, I think the challenge that David and I had in aiming some of this out is just trying to see like, well, where's the line of usefulness, right? Because of course, it's a classic case that we can come up with an arbitrarily complex model that tries to capture a bunch of theoretical use cases, but if we're trying to capture the whole universe and we don't really need to capture the whole universe, we just need to. We need this over here. And that makes life much easier for us. But if there are really use cases for the topological model where you want to do this sort of splitting, we'd like to know that sooner than later so that we don't paint ourselves into a corner. @49:34 - David Swenson And I mean, the other challenge with anything that's based on a topological model with this kind of is the information that you're storing is that the to obtain the generation, you can write something that does that, but it requires all the data. You need to load up all the data that has been stored from this, all the results that have been stored and reassembled the directed acyclic graph. and generations just to breadth for search, right? But it does have a, there's a, there's a, we, we, or at least, if not generating all the data, you have to reload all of these labels that give you the information to generate the graph. Which I think, one thing to point out is that, I think, with the way we've been thinking about it, at least in terms of doing adaptive things, is going to involve, you know, adaptive sampling is going to involve a lot of very short units. And so that can become costly at some point. @50:32 - John Chodera (he/him/his) But again, most of the analysis will involve, just give me a big bag of data, and then I'll do some crunching on it, maybe using that label information, probably not though. And then I might spawn off some new trajectories. @50:44 - David Dotson For example, John, like Conne Gullerman cycling, if even if we did a bunch of combination of extensions and replicas, ultimately the, the get estimate method that runs where and and I wanna write It really just needs to flatten this whole bag of things, take the four rewards for reverse works from each one, and just toss them into bar, right? @51:09 - John Chodera (he/him/his) Exactly. So most of the filtering I might do is I'm going to exclude the first iteration or I might exclude the last iteration or I might only take the first iteration as a generation. @51:19 - David Dotson So you would need to have some information about what order these things happened in. If you've struggled. Possibly. @51:25 - John Chodera (he/him/his) Or if I even better would be just as asking for that when I'm asking for the data to say filter out, give me your gens one one through whatever. @51:35 - David Dotson OK, I see OK. @51:37 - John Chodera (he/him/his) The simplest thing is just give me everything. The second most simplest thing is give me everything subject to a very simple filtering through one of these like either hierarchical or topological kinds of APIs. @51:49 - David Swenson I see. Yeah, it's just that filtering will in the case of. You don't want the first generation. That's easy because you can filter out the one that doesn't have any parent. But for any other kind of filtering on that, even I don't want the first two generations, you need to actually reconstruct entirely, which is an ordering process. It's not that bad, but. @52:14 - David Dotson Yeah, I mean, I can say from the file can be signed, you know, we talked a bit about how we don't store the protocol DAG results in Neo4j, we store them in S3, but what we store in Neo4j are references to those objects. So walking the graph of objects, I mean, it's literally a graph and the graph database will be pretty fast, right? So it would be very performant to say, give me the topology, give me this right here, and then choose from that what you actually decide to pull from S3. So we wouldn't have to actually grab, you know, a bunch of fat result objects to do that from filtering out from, which is good news. @52:52 - John Chodera (he/him/his) That would be a great advanced way to do it, is to offer people that model for API access. But again, to give me everything, and then let me, let me. tell you what I want are maybe two very flexible ways to implement that. @53:05 - David Swenson So I guess maybe something for us. If we implement something like this in OpenFV that gives us this little, we'll actually be saved in one place. And then, you know, it's another search to get the actual data that's associated with it. @53:21 - David Dotson It's not. Yeah, but at that point it's to keep value, uh, grab, right? Yeah. Yeah, it's just in the same way as possible in the data. Yeah. Yeah. It's the same model. It turns into just a key value access at that point. Okay. Cool. David, did that give us something to go off of? @53:44 - David Swenson I think so. I think it sounds like we have a slightly better sense of, it sounds like this topological approach is reasonable, but we want to have a simplified API to do certain kinds of filtering on it, but it sounds like it's of interest at least. @53:59 - David Dotson You're fun. Yeah. @54:02 - Irfan Alibay mentioned you can think of some use cases. Sorry, I think John had mentioned what I was thinking of. @54:13 - David Dotson Very cool. John, I think you had mentioned you'd be willing to game out some use cases with us. So I think that would be valuable for a start. It sounds like we already want to go that direction. But you can get those use cases out sooner than later than that would happen. @54:30 - John Chodera (he/him/his) Where do you want me to put them as the question? @54:32 - David Dotson Issues are fine, even if they end up just on fall. I'll come for a start, even if they're relevant for goofy, fundamentally. Like, it's that we can route the effort accordingly. OK, thank you. I think we can proceed with what we have for now, or without that, even without a glibre recycling. So for a start, I'll probably just, I mentioned earlier I have a year. I'll share my screen one more time. SCREEN SHARING: David started screen sharing - WATCH And I know we're on the same time. I was operating with doing maybe a list of lists of critical tag results. And that was taking the approach of you've got each, the outer list is just, each element in that is a replica and each list inside of that is a chain of things. But of course, if those chains then of course have their own splitings and that becomes list, list, list, you can build a data structure that captures that, but it's not necessary for a start. So what I'll probably do for a start is just flatten this out, just assume it is what we currently assume for a gather down here, which is that this is just a list. It's a bag of critical tag results. If on this is something you and I can talk about of innonychlear recycling, it sounds like we don't really care about the order of these things. So we'll be able to get away with this just fine. But later on we may evolve the model. So that may change with some time. Any questions on this? I know we kind of got into the technical weeds here. Okay, so with that, we've got through the agenda. Are there any other items folks want to discuss? This is our last meeting of the years. @56:30 - Jeffrey Wagner I would just toss in as the person responsible for the open FF share of your time, David. I had intended you to be included in the in the work shutdown. And so, I mean, if you're rearing to do this and you wouldn't be doing anything else over the holidays, then by all means, but. I would much rather that you not do this and enjoy this time of year and come back fresh in January. @56:53 - David Dotson Okay, well, I'll still enjoy the holidays. I promised that. So, but no, I have plenty of time allocated for this ready. for the rest of the month. So yeah, it's not a problem. Like I said, I don't expect this of anyone else. It's just not, this is, these are separate things. So, but I still have an, as my objective to deploy at MSKCC2 by the end of the year. I'm also excited to- I'm not sure if that doesn't work out. Okay, I'm also excited to finally get to deployment. So, you know, I'm also personally kind of invested in this as well, so. @57:30 - Jeffrey Wagner Maybe, Al, come in and you're stalking. @57:32 - David Dotson I'll take it. No, but thanks everyone for a fantastic nine months of working on this. It's been, it's been a pleasure. I know we're not done yet, but I'm very pleased with how far we've come. Yeah, thank you all. @57:53 - Iván Pulido (John Chodera) Cool. @57:55 - John Chodera (he/him/his) Thank you. @57:55 - Mike Henry (he/him) 2023. @57:56 - David Dotson Yeah, see you. Have a great holiday. @57:58 - Jenke Scheen (John Chodera) All the days. @57:59 - David Dotson We'll see you.

Participants

Goals

Discussion topics

Action items

Decisions