2022-09-13 Protein-ligand benchmarks meeting notes

Participants

@David W.H. Swenson
@Jeffrey Wagner
@Irfan Alibay
@John Chodera
Jenke Scheen
@Diego Nolasco (Deactivated)
@Iván Pulido
@Mike Henry
@Richard Gowers
Levi Naden
@David Dotson
Melissa Boby

Goals

DD : announce - conference travel through the 9/27; @Jeffrey Wagner will run meetings on 9/20, 9/27
DD : fah-alchemy - current board status
- fah-alchemy : Phase 1 - MVP
- 2 weeks out from 10/1 deadline for ASAP, biopolymer benchmarking
- @David Dotson development effort now focused on Executor (FahAlchemyAPIServer), Client (FahAlchemyClient), and Scheduler (FahAlchemyComputeServer)
- AlchemicalNetwork storage into ResultServer::neo4j fully roundtrips
- Able to request and fully reconstitute AlchemicalNetworks via HTTP against FahAlchemyAPIServer pulling from neo4j. Can do the same for any other GufeTokenizable, in particular Transformations (edges) and ChemicalSystems (nodes)
DS, BR, RG : Protein serialization update (gufe#45):
IP : Nonequilibrium Cycling Protocol (perses#1066) update:
MH : ProtocolSettings taxonomy (gufe#37) update:
MB, IA : protein-ligand-benchmark update (protein-ligand-benchmark#52):

Discussion topics

Presenter	Notes

Presenter	Notes
@David Dotson	announce - conference travel through the 9/27; @Jeffrey Wagner will run meetings on 9/20, 9/27
@David Dotson	`fah-alchemy` - current board status fah-alchemy : Phase 1 - MVP 2 weeks out from 10/1 deadline for ASAP, biopolymer benchmarking @David Dotson development effort now focused on Executor (`FahAlchemyAPIServer`), Client (`FahAlchemyClient`), and Scheduler (`FahAlchemyComputeServer`) `AlchemicalNetwork` storage into `ResultServer::neo4j` fully roundtrips Able to request and fully reconstitute `AlchemicalNetwork`s via HTTP against `FahAlchemyAPIServer` pulling from `neo4j`. Can do the same for any other `GufeTokenizable`, in particular `Transformation`s (edges) and `ChemicalSystem`s (nodes) (DD runs live demo hosted on choderalab server) JC – This is awesome - One risk is, when specifying transformation, does that object/request/protocol encode the atom mapping? It’s important that atom mapping be linked between solvent and vaccuum for appropriate transformation to occur. DD – It’s not fundamentally linked in the data model, but we could ensure that the software that submits these jobs gets both of these assigned the same atom mapping, possibly even by just having them reference the same mapping object in the database. JC – And will there be a way to pull out matching atom transformations under different conditions? DD – Yes, the query language is designed to be flexible and performant, so this should be straightforward. DD – Next steps for me are to continue building out the client so that users have a can connect to the data server. I think that the results objects will be added later, as well as the strategy components.
@David W.H. Swenson Ben Ries	Protein serialization update (`gufe`#45): RG – Two facets to this - Roundtripping of a protein once it’s loaded into a chemical model. This is mostly finished (GUFE#45) modulo a few quibbles. RG – There’s a larger question about how to load a protein from PDB. We’ve got an open PR to OpenMM which is limited in scope to standard amino acids. JC – Do you intend for this to always be limited to standard AAs or be extensible later? RG – Probably the latter. We’re thinking that we need a trinity of bond orders, formal charges, elements - should be able to handle a lot but we do need to have some guide rails for more ambiguous cases in the future. JC – what aromaticity model are you using, or are you using kekulized form? RG – Kekulized form JW – related question: how does the OpenMM PR (#3770) correspond to this? RG – we’re trying to upstream the changes we’ve vendored in JW – I think this is a funny game of telephone, where we (Mobley, Chodera) requested bond orders be added to OpenMM but we didn’t make use of it until now; however, it proved to be incomplete for our needs RG – I think longer term I want to rip out `PDBFile` from OpenMM; isn’t dependent on heavyweight dependencies that OpenMM is JC – plan is to standardize on `openmmforcefields` JW – if CUDA as a hard dependency was removed, could that help? RG – think we’re actually okay with being the custodian of `PDBFile`, since we’ll be so dependent on it JC – if OpenFE wants to be the custodian of `PDBFile`, in particular its processing for things that are missing, that would be a huge help to OpenFF as well JW – yeah, OpenFF tries to hew to the position of not making assumptions on inputs, and `PDBFile` runs counter to that philosophy; if OpenFE is more comfortable occupying that space, then I’m in support of it (and happy to help where needed) RG – yeah I think we’re in a better position to take that responsibility, and we also need to have the ability to move quickly on making it work for users MH – agree, and think over time we can create some high quality standards through our own iteration process RG – Expecting GUFE45 to be merged by end-of-week. MH – Re: OpenMM PR - We’re close to OMM 8.0.0 - So we may want to make sure this gets in under the cutoff RG – I don’t think it’s essential - We need the GUFE PR, not the OMM PR. MH – True, OMM isn’t too strict about semver so PDBFile updates can come in later.
@Iván Pulido	Nonequilibrium Cycling Protocol (`perses`#1066) update: IP: Perses objects rely on input files, so we need to refactor the code/API so that we can have a pure python library API which can skip the parsing of input files IA: If you’re planning on refactoring HTF, I’ve got some ideas on it if you want to meet about it IP + JC – Yes, would love input on openmmtools. The hybridtopologyfactory JW – would we want interchange objects to interface with HybridTopology objects? JC – perhaps yes, but will take some discussion on standardizations at the engine level
@Mike Henry	`ProtocolSettings` taxonomy update (`gufe`#37) update: MH – I have a meeting with MThompson tomorrow. We’ll evaluate whether the remaining stuff we need to do to get the model to work can be finished by EOW. MH – Chatting with LNaden was really helpful - Discussed that, in addition to helpful pydantic education initiative, it would be great to emphasize educating trainees into something like messagepack serialization. DD – (General) – Data serialization can be highly performant if needed - by a combination of using fixed units in the data model, optimizing data structures for known fields in serialization.
Melissa Boby	`protein-ligand-benchmark` - 0.3.0 update MB – Brief update is taht all of the files + systems that we decided to keep (ie which didn’t have egregiously different assay considtions) are done being prepped using the schrdoinger commandline tools for consistency. IA is going to run through and check this. IP, you had some comments? IP – In the recent commits, some binary files were uploaded, these are a few hundred MB. Are these needed? Or are the scripts a 1:1 mapping to these output binaries? JC – Could upload these large files as an artifact and attach to release? This is assuming that we can quickly regenerate things using the latest version of Schrodigner suite. MB – Looks like the big files are docking files and grids. I’d think that we just need to keep the input scripts, reference ligand, grid+docking input+execution scripts, maegz docking results. JC – So how about we store the following: In git: input scripts, (with docs), pdb and sdf files In release tarball: Docking grids, docking outputs, other raw results JC – Issues with repo size after addition and deletion? DD – If we squash-merge this won’t be an issue. IP – I saw that some of the ligands changed - I know that we’re using a single ligand file - But with CDK8 there’s one more ligand. MB – I think this is OK - I think it’s just that the commandline prep may have resolved a water clash or something and let one more ligand be docked. IA – For the sake of reproducibility, could we rerun the scripts and ensure that we get the same number of ligand as output? MB – Can do. We shouldn’t expect different results - I explicitly used glide XP because it shouldn’t be stochastic - This is probably due to the prep of the active site. MB – Is it just that CDK8? IP – I didn’t check others. MB – I think I saw it happening in a couple cases and noted this, could find this note if needed. IP – I think it could be a good thing if we have JW – Is this “more ligands relative to the Hahn benchmark” or something else? MB – No, it’s “more ligands relative to earlier states of this PR” - Could be due to active site waters now getting removed or better handling of tricky edge cases of atom mappings during constrained docking. IA – I’ll go through these shortly.
Transcripts with no edits	F@H interface meeting - September 13 VIEW RECORDING - 64 mins (1 min of highlights) @00:00 - David Dotson Okay, thank you. We'll jump into board status. So I've updated the in progress column. So we've got everything that we're currently tracking here that we'll discuss in the agenda is a ticket on the sport as well. So all the work you folks are doing is also tracked here since it's relevant for this project. So thank you for that. I'll talk a little bit about what I'm currently doing. So, just as a reminder, we're two weeks out from the October 1 deadline for ASAP. I think we may still optimistic, but we may still actually be able to make this at least for some kind of working system. I'm optimistic, but we'll see why. So I'm currently focused on pull up this diagram, which shows all the components here. So I'm currently working on the API server as well as the clients that users or any other automated. Rated system will use to submit all chemical networks to the system. And then I'm also working on the computer server components. I have Neo four J working with full round trips. So we're able to store all chemical networks and pull them back out. And that was the key component because that is the state of the system. So that seems to be working just fine. Now. Also able to request and fully reconstitute all chemical networks via Http coming out of the API server. So we're able to request things from your four J. We're able to pull them via the API server. And so that's a key component for users to be able to they're going to submit their networks. They also need to be able to pull them back and then that allows them to iterate they can build new networks based on the existing ones. I wanted to do a brief demo just to show kind of the stuff working, because it actually is working, if you'll bear with me. So on the. Left for this first notebook, I'm going to kind of breeze through this. So, apologies. I'm just going to show shoving things into Neopolit. So this is not the user view of it, but this is I'm building a chemical network, and then I'm popping it into our database. Okay, I'll do this in a second. I want to spend no more than ten minutes on this. Anyway. So this database is running on Ms. KCC Two. So that's a folding at home work server in the Cadera lab environment. That's what we're using as our dev host for all of this infrastructure. I can create enough chemical network, as we've done before. So I'm basically building in this case, I'm building a set of chemical network of benzene modifications. Toyota and phenol. Benzene and do others. And I'm building Two sets of two sets of chemical systems, one set that's sold at. So I'm basically saying here's we're sticking Our ligand in. We're setting the solvent to be water. And so this is a set of just solid benzene variants. We'll also create vacuum systems. So this is just ligand, no solvent at all. And so we've got A set of vacuum systems. And now I'd like to actually create a star map for each set of these things with benzene centered. What I'll do is I'll connect Those star maps with the transformation. So the transformations Are edges in these networks, and those have A protocol. In this case, this is currently. A placeholder for the protocol we actually want to run in production, which is open amendment, not a good room cycle protocol. We can talk about the name later if you want. Anyway, we build up our solvent transform. So this is a transformation. Given our benzene salvated system, that's the list. And then the other one, as we iterate through all the salvation systems, we can do the same thing with the vacuum transforms and then I could actually connect these two with a so called absolute transform. I'm going to slap this protocol in. For now, this is just a placeholder, but ideally, what we would probably do for this to say if we have if we have an experimental result for the salvation free energy of benzene, which we could pop that in instead. So one thing we'll probably put on the list here's. A protocol for experimental results so that they sort of behave like a protocol, but we don't actually compute anything. They just back out with the experimental results. But anyway, I'm just using this to connect these two star maps, the solvated ones and the vacuum ones. And so I can take all these transformations that I've put together here, just these edges. Salivated transforms, the vacuum transforms the absolute transform. Slap a name on it and I'll create my network. We can visualize it if we want. So it is indeed two star maps with one connection in between. And now we can try serializing these things. So I'm going to go ahead and delete my full database. I'll reconstitute in a second. So I have this new four J object, this new for J store. This is in Alchemy. So this is one of our objects I can call Create network. I can feed in the network. We just. Created here locally. I'm going to slap a few labels on it and say, this is for open FF. This is for this campaign one. This is for this project Tesla. These are just scoping things that are going to be important. Since the system's going to support multiple organizations, multiple campaigns, and probably multiple projects. It also has an impact on duplication. If I try creating the same network again, I'm going to get an error, and that's deliberate. This tells you that this already exists. So you should do something else about that. Or if that was intentional, then you could use Update network instead. And what this will do is it will merge whatever. It will take any objects that already exist in the database and just say these already exists. I'll use those. So in this case, it's an item put into operation. You can run it over and over again if you want. The effect of this is that if I hop over to. Just to visualize in the we can visualize our full this is what our database looks like right now. So this is what we just put in just now. This is our chemical network object. It's connected to every other object in the system, including these transformations. Those are the edges of our network. I should mention that the reason why transformations, they're edges in our network model. But there are actually nodes in neo four J and that's due to the fact that they need to point to other nodes. You could think of this graph here as a representation of all the goofy objects. So each node is a goofy the object in and out itself. So it's basically a visual map of what we've put together in memory. And these carry along with them things like.org. Project campaign so those are queryable components I can go a little further. I can create a new network just of the solve it transforms and I could submit that if I run this again. We'll see that we've now got two distinct networks in our database so this is the one we just added down here it only has solved cases if I go ahead and say alright. What if I add in the vacuum transforms? We can't modify networks in place, we can only add new networks so notice here that I'm creating an Alchemical network with vacuum transforms and submitting it and so now we get on the left here's what we just submitted. Here's one of the Alchemical networks this might be the Solvated case and then here's the other one. This is the vacuum case. Notice they're connected in by their protocol. So all these things are duplicated. And we could go a little bit further and say, let's add back in an absolute transform. This is the transform we just added in right here, so you can see it. This alchemical network points to a single transformation object, but it points to the pre existing. This is benzene in water. This is benzene and vacuum. And so we put things in, we can pull things back out. So I can say query networks, and I could use the name here. And this will get me back any network that matches that query. In this case, it's just a single one. So I can go ahead and drive. We can pop. That we pull directly from you for a j. In this case, just to show you that we're actually pulling from the database. I can do this from a fresh process. So this is a completely distinct notebook and python interpreter. So I can connect to my database. I can query just for all networks in this case. And this will pull them all in. It will build the goofy objects from them. So it will recursively walk those graphs that you saw in the visualization. It will pull up all the objects and will deserialize and build python objects under it. I can query by name if I want, and I could draw it again looking get different pieces of it. We can also query just transformations. So if I was only interested in transformation that's edges of the network, I could query them on these quarrels. And then I can also do the same thing for. Chemical systems. In this case, I get all chemical systems. If I don't put anything in there, I'll add pagination and things like that later. But this is the basic idea. The last thing I'm going to show is this was connecting through this was connecting directly to E or J. This is not what we having users do. What we're having users do is doing something closer to making Http requests to our service API. So just to look at the architecture again real quick, so users will talk through a client, which will be a Python client. They'll talk directly to this API server, which itself will talk to you. For J. I've got this in here. We may remove this. It may be necessary to see but the point is, users don't talk directly to Neil for Jay. They talk to this. And so to show that I don't have an Alchemy client yet, but I do have requests, And so I could say request get I've got a fast API instance running at this port and I've got here it is. So it has an endpoint called networks. And we can say, okay, if you ask for network with this name, this query parameter, say benzene Barry installation, we can tell the API to go fetch from the networks with that name and then yield the dictionary forms of these and split these back and they'll get serialized into Jason and then sent back to us. So over here I can make the request and then here, this is stuff that would happen in our client that we've to get to build but this is showing the base idea. We get back to single alchemical network and we can so we're able to fetch fully the thing that we put into the unfortunately anyway, that's the end of my demo. Just wanted to show you where we're at. Any questions for me? @13:03 - John Chodera (he/him/his) This is fantastic. First of all, I think everybody is extremely impressed by it. But the only risk I was looking at is when specifying a transformation, is it the case that the transmission transformation and all of the other details of that transformation already fixed and frozen in that request or in the encoded in that protocol? Because it's important that the atom mapping, for example, be linked between the different phases. So between solvent and vacuum in order for the appropriate cancellation to occur. @13:39 - David Dotson Hold on, let me think about that for a second. So between solvent and vacuum? @13:44 - John Chodera (he/him/his) Yeah, I just have to use the san can add a mapping, but I think that's already frozen in the transformation object, right? @13:51 - David Dotson Is that correct? Yeah. Let me just pull up a visualization here. So this is visualizing more the alchemical network itself. So. This is our two star maps. Basically, there's two of them because we did two different submissions here in my demo. So over here we've got benzene and vacuum. Over here we've got benzene and water. What you're saying is that if I want to do let's see, let's do an A sole vacuum. If I want to do the salvation free energy for this, I need to be using the same mapping for this transformation as I am for this transformation. @14:36 - John Chodera (he/him/his) Is that what you're saying? Correct. Yeah. They need to be generated with the same mapping. So I presume, though, that's already encoded in the configuration of that transformation object, so they can be emitted at the same time by a higher level process that generates the added mapping at the same time? @14:52 - David Dotson Yes. So what I would say we would do in practice is that when you create these transformations, you'll. You notice I put in mapping on, because these are just placeholders for now. But what you would do as a user is to use the same mapping for those corresponding pieces. And the results of that, because the mapping would be a goofy object. Richard you can correct me if I'm wrong. These would actually point to the same you can't see it here, but if I double click these, you can see everything they're connected to. They would point to the same mapping in the database, right? @15:29 - John Chodera (he/him/his) Exactly. @15:31 - David Dotson So that would work. So functionally, that would work. Now, there's nothing about our model that says you must do that, but as a user, you would have to do that to achieve that effect. @15:41 - John Chodera (he/him/his) There could be other non Alchemical Adam mapping based methods that we have in the future that are supported, that don't have that construction range. But the other part of that is just when doing the analysis, you'll need to be able to extract the pair transformations so that you can analyze them together, but brutally. There's also going to be a way to do that to make sure you can pull out fishout. Matching pairs of these at the end. @16:01 - David Dotson Exactly. So that's something that like the data models intended to support all of these things. And then we would have obviously, it takes if you wanted to let me just rerun this to make this clear. If you wanted to calculate the salvation free energy of anisol, then you have to take both this transformation and this transformation and then take a difference. Right. And piecing all of those together is something that should be a downstream process. @16:42 - John Chodera (he/him/his) Yeah, I think it's just a matter of offering the right way to iterate over the objects and the results in the downstream analysis. @16:48 - David Dotson But I think it's fantastic. Okay, cool. Yeah. As long as the data model as we currently conceived it is capturing what you think it needs to capture, then I think we're in good shape. Any questions for me? Okay, thank you. I think we're at the 22 mark, so we've made it. Next steps for me then, is to continue on this path. So I need to continue building out the I need to continue building out the API server that serves the Http requests. I'll be working on a client that gets the Python interface to that. That's what users will be using. So Yanka, for example, you'll be using client to submit all chemical networks to the system and then building out anything we need on the neo forge side to give the client what it needs. And then I'm also working on the compute component. So that will also be talking talking to the API server to get what it needs, the way results are going to be handled and all of this. My current conception is that as transformations. We still need to I still need to devise and approach for how strategies get processed. So when we submit an Alchemical network, we don't just submit the network, we also submit a strategy for how to compute it that needs to make its way into here. And what I don't have represented here yet is basically a concept of like a strategist, something that is a process that runs continuously and chooses which transformations to run next. Based on the strategy couples to give an Alchemical network. But our storage system should be able to support all of these complex relationships. This is why we made this choice of using you for today and we should be able to dangle results off of these graphs with different relationships. So anyway, I just wanted to lay that out that there is a path forward here and it's looking really good. Okay all yields. Any final questions for me? @18:59 - Jeffrey Wagner A lot of questions. But this is super impressive. It's really neat to see this coming together. And I think a lot of the other folks agree in the chat. @19:08 - David Dotson Oh, I didn't check the chat. Okay, cool. Yeah, great. I'm happy to hear that. We've we put a lot of effort into the data model precisely for this reason. Right. These goofy objects are very easy to just pass around because of the approach we've taken for how they're represented, how they are connected to each other, and how we can serialize and decorate. This is why we spent so shallow, because it makes this far simpler. The code base for what I just showed you is the neo for JSTOR is only less than 400 miles long, and most of that is probably due to black just putting things in one line. So converting to and from neo forget has been remarkably complex. There's a lot of little details here because the models aren't exactly. But, yeah, it works. So I'm free to be spoken. So thank you to everyone here who's been involved. So all the work we've been doing has made this tail end far easier. We're not done yet, but it's exciting. Okay. Sorry. Swanson. This is actually Ben. Is Ben here? @20:29 - Iván Pulido So I wanted to mention I invited Melissa. Maybe we can talk about Portugal benchmark. @20:37 - David Dotson Let's do that first. Hey, Melissa. Thank you for joining. So Melissa and Irfan actually, do you guys want to give an update on loan? @20:51 - Melissa Boby Sure. Give me just a second. So I guess the brief update from me would be that. All of the files. So all of the systems that we decided to keep, so the ones that didn't have very egregiously different asset conditions from our prep conditions are done being prepped using Schrodinger's command line interface for the protein prep wizard so that it's replicate able. All of that has been pushed in the repo now and I think Earphones just going through it for checking it now. I know. Yvonne, you said you had some comments to be addressed. @21:40 - Iván Pulido Yeah, I guess there are two things. One is in the latest commits, you uploaded, like binary files and there are like a few hundred of megabytes. Let me see. @22:01 - Melissa Boby It is hard to check the changes because it's so big, but in all of the files that I uploaded that had the prep stuff right? @22:20 - Iván Pulido Yeah. So one question is, do we need these files on the repo or if it's just enough to have the common line arguments, and these are like a one to one relation, so we wouldn't need the binary files. Is that correct? @22:42 - John Chodera (he/him/his) What about putting the binary files in an artifact so when the whole release is done, you can take all the binary files and shove them up as a giant downloadable zip files? Should you get into the artifact. That work. Presumably the process for releasing will be like automate the re prep of everything with the latest version of truth, which is now possible, given that there's a command line version of that, if I understand correctly. @23:20 - Melissa Boby Think that that should be reasonable. @23:22 - John Chodera (he/him/his) I'm just pulling up the up now because if they're huge, we don't want to go back to the large files and we don't need them for checking out the repo and doing other stuff for replicating. So they're just there for convenience, for exploratory examination, right? @23:41 - Melissa Boby Basically, yes. So the files that have included are basically the directories in which I have all the accessories, the input files, the executable script and the logs and reference logins and everything related to the docking, the grid generation. Those are the two big ones, actually. That's what's taking up so much space. @24:07 - John Chodera (he/him/his) The input scripts and the output PDB and SDF files. I think that's the ones we really need to keep, necessarily. @24:14 - Melissa Boby Yes. So the input script and the reference ligands would be the ones we need to keep and the grid so that's for the docking. We need the input script. The reference ligand. And for the grid generation. I think I think the input script. The reference. The input script. The execution script. If you want it. And then maybe the magz file. Which is because it's based on coordinates within that particular file. So you wouldn't be able to just necessarily apply those. Coordinates to any random. One H, one Q PDP. Right. @25:09 - John Chodera (he/him/his) Sorry. The coordinates originally came from prepping the PDF file, though, with the script. @25:14 - Melissa Boby That's true, actually. Yes. I guess as long as that's the same, it should be fine. So just the input scripts and reference logins versus the docking should be fine. @25:24 - John Chodera (he/him/his) Would everybody be okay with just keeping the I think we just have to make a list of this. But like, the input scripts that did all the preparation, obviously, the documentation about that, the PDB and SDF files that are portable, being the ones that are retained in the repo. And then the rest of it can get pulled into a giant tarball of here's all the log files, and here's all the intermediate files and final micro files that you can inspect. If you really want to figure out how this specific thing ended up for this release. But we can just usually those aren't what you're working with. So this would get uploaded as a big zip file artifact. @26:12 - Iván Pulido Yeah, I agree with that. @26:19 - Melissa Boby Cool. @26:20 - John Chodera (he/him/his) I can just pull the relevant files then, and then I can you want to pull everything together into some useful directory structure and compress it. But if there's been big files that are checked in, you have to be a bit careful about how to undo that, because otherwise the get history becomes enormous. And even if you've removed them, then it still takes forever to check the repository. @26:46 - David Dotson What we'll go ahead and do is we'll squash merge this, and so that should help. So I wouldn't worry too much about trying to rewrite history, but if we can make whatever the last commit is, The contents as compressed as possible. That would help. Is that what you're getting at? Sorry, John. Melissa? @27:12 - Melissa Boby Yeah, I think that's the best. @27:15 - David Dotson Yeah. We'll squash merge at the end, so we should be okay. @27:19 - Iván Pulido Yeah. And I think the branch will be deleted after March, right? @27:24 - David Dotson Correct? Yeah. So, Melissa, do you feel like you have what you need from this discussion? @27:35 - Melissa Boby Yes, that should be fine. Yvonne, did you have any other comments? @27:41 - Iván Pulido The other comment is more about so I checked. I only checked. I'm sorry that I didn't wrote the feedback on this in the thread. I just haven't had the time. But I found that some of the league's changed, so I know we're. Now using a single league on files for all the legends but I saw that the number of legends also changed namely for CDK eight one legend added compared to before this coming so I wonder why is that and that's expected and okay. @28:25 - Melissa Boby I think it's fine I can pull up CK really quickly to see what it looks like I suspect it has something to do with the do you want to share the screen with them? Yeah, I can share my screen just let me pull up I have to get CDK which means switching my store environments me just a moment do not have too many projects open on schrodinger it kind of freaks out. I suspect what's going on there is just that with the command line prep, the active site might be just a little different. We might have a water that is no longer longer in the way or something like that. So we might have gotten one more ligand actually successfully, and is probably what's going on there. So it doesn't concern me that we have one extra ligand popping in CDK, which otherwise seems fine. SCREEN SHARING: Melissa started screen sharing - WATCH @29:41 - Irfan Alibay Sorry, go ahead. I was going to say, for the sake of predictability, can we just rerun that and make sure that if we rerun your scripts, you still get the same number of logging? Just to make sure that I don't know how expensive this is. I'm just wondering, if we did all this work, can we do command lines? It's actually stochastic then we have a problem. @30:03 - Melissa Boby I can do that right now actually. So it shouldn't be stochastic. I know some docking methods are I specifically used to guide in general. Isn't I specifically because that explicitly is not stochastic. So unlike some of the open eyes docking where you have to set the speed so it isn't so it should be fine. It shouldn't be sarcastic. I actually double checked that it wasn't. So I think this is just a difference in the prep of the active site is that there's just a little maybe something moved a little differently. But I'll check that like before the end of the meeting actually. Take 5 seconds. Is it just that one? @30:57 - Iván Pulido Yvonne I haven't checked. Back to the other. Sorry, but I noticed that one. I can run, like, a quick script to check this and let you know in a few minutes. @31:13 - Melissa Boby Okay. Yeah, I would not be surprised. In fact, I think I did note that it happened in a couple of cases and it wasn't a cause for concern for me, because they all looked good. @31:27 - Iván Pulido Okay. Yeah. For me, in my opinion, it's better that we get more league's. Of course. But I just wanted to check I mean, maybe what Irfund commented on reproducibility. That was one of the concerns. @31:43 - Melissa Boby Yeah, no worries. @31:49 - Jeffrey Wagner Do that when we say more ligands. Just for my understanding, this is relative to the old state of the set. Like the year old state of the set and not just like a more recent state, right? @32:06 - Melissa Boby No more leggings relative to the last push that I had done with this because we had some fall through with some fallout out of some of the ligands during the docking stages. Mostly due to an issue of when we're using the maximum common substructure from the reference again to constrain the docking positions. It's overlap of that we're constraining it to a maximum point 25 extra room difference. And so there are some ligands that fall out due to either not being able to get into that position. But more often than not, I think what I'm seeing things falling out for are instances where the MCs like the core actually is broken and it's like smart string. So if, for instance, you have a purity and that's substituted. The four and the three position. And you have a perioding that substitute at the one in the fourth position at different or two and four position with the same substance. It has a different four substructure now, and so it gets confused and breaks fix it. And it isn't always able to do the alignment. And there is perhaps a way to get around this that I'm playing with right now. But that's just a limitation of the docking methodology that we're working with at the moment. @33:30 - Jeffrey Wagner Okay, cool. Yeah. Then it's awesome here that we're getting more looking then. Thank you. @33:37 - Melissa Boby Sure thing. @33:42 - David Dotson Excellent. Your phone. As a reviewer, do you have anything you need? @33:47 - Irfan Alibay We don't have yet, so I've not had time to look into things yet. I'm technically going back to work tomorrow. @33:55 - David Dotson No worries. @33:58 - Irfan Alibay I joined in just to make sure. There wasn't any burning fires really to do, so tomorrow I'll go through it. I suspect everything that's left to do now is I finish some questions about the SDFs. What I might do is just add some CI bits to make sure that everything loads properly, at least in open Ms, I'm an Audikit, and then essentially the PDB fixes from number 58, and then hopefully that should all be fine. @34:25 - David Dotson Fantastic. Thank you. Thank you both. Any additional questions or comments? @34:36 - Melissa Boby No, not for me. @34:37 - David Dotson All right, thank you so much. @34:39 - Melissa Boby Sure thing. @34:40 - David Dotson I'm going to share my screen. I don't think we've got Ben here, but David or David. @34:55 - Richard Gowers So, yeah, Protein Serialization has sort of got two facets to it. They're. Is the round tripping of a protein. Once you have it sort of loaded into some sort of chemical model, I think that's finished and we just sort of there's a crop that's mostly finished. I think we just got a few quiver also that sort of code style and that sort of level, but that's essentially finished. And then related to this, there's the question of how you can load a protein in for a PDF file because that's a nontrivial operation. So we're also looking at essentially an on ramp for this sort of ideal version of a protein that we want. I think currently we've got a solution using Open and working, but we're are just checking is completely viable. This is sort of limited in scope to standard amino acids, obviously. Good. I think that's also getting measured in the same PR. @36:05 - John Chodera (he/him/his) When you say limited and scoped to the standard amino acids, do you mean that there's a route to making it usable for the whole thing in the future? Right. @36:14 - Richard Gowers Or do you and tend to swap it out wholesale, but keep the API, probably the last, because I think that the API for allowing kind of arbitrarily anything will be not the same API we currently have. What we've currently done is we've managed to manage to make openm's PDB file, assign bond orders to the structures that we put in. And I think we can also get charges for the standard amino acids at least off the back of that. And so that will give you a data representation of the protein that has sort of your chemical graph, your bond orders and your charges. And that's sort of the holy trinity of what we think is complete. Obviously if you start doing formal charges in this case, okay, obviously if you start doing weird and wonderful things and open them, open them sort of properly assign all of that and you'll need to sort of start thinking of a different API where you can supply extra information about the nonstandard stuff in your system. I think that's a fair compromise. I think we're going to try and make that as easy to do as possible because I think there is a lot of people in the world that just want to throw a PDB fellow and have it magically work somehow and we're going to try and make that dream come true. But there's also going to have to come a point where things are ambiguous and we have to sort of say, no this is ambiguous. You have to sort of specify these parts of the system system for us somehow. But for now we've got the standard stuff I think done. @38:04 - Jeffrey Wagner Could you clarify? You said we think that we need a trinity of bond orders, formal charges and something. And I missed the last something. @38:12 - Richard Gowers Formal charge, bond orders, chemical, graph maybe be the thing. @38:16 - John Chodera (he/him/his) I said topology, it's just elements bonds with on orders. And so the big question also that I have is what aromaticity model are you using for that? Or are you using the Kecklist form that's portable between different toolkits? @38:32 - Richard Gowers I think we're talking about calculus form. @38:35 - John Chodera (he/him/his) Okay. Alternating integral bond orders. @38:38 - Richard Gowers Yes. @38:41 - John Chodera (he/him/his) And the last one is the formal charges for Adam. @38:52 - Jeffrey Wagner I know that you and Ben and I have a quick chat about this tomorrow. Are you open to doing? Doing formal charge in the okay, so we're talking about two things there's the open NPR that you guys have open and there's the goofy PR stuff. We'll both of these have elements bond orders and formal charges or will one of them only be working with bond orders? @39:23 - Richard Gowers No, I think the openmpr is an attempt to upstream the changes we made. So we sort of temporarily vended in a large part of openmun. But I think because it was useful to us, it's nice to upstream that because I think they are identical. But don't quote me on that because I didn't write either of the PRS. @39:44 - David Swenson Okay, I'm not entirely sure but I don't know if the formal charges asked spec is. I think that may be in our thing. @39:52 - Richard Gowers That might be because I am has put the formal charges in a different place in their data structure. So that's on the. @40:01 - David Swenson And that's something that comes later with open right? So this is something that the open topology object had a slot for bond orders that was not being used and so the key thing we're doing that we're trying to upstream there is is to use that slot for the known residues. @40:21 - Jeffrey Wagner I think this is a funny historical game telephone where I think mobile and Codera years ago asked for the bond order slot to be put in possibly before they knew the importance of also having formal charge and now that we've had years of experience now we're like oh. We also need this but you're seeing this historical artifact from an earlier state and inferring very logically from it what? @40:46 - David Swenson From seeing that the slot existed but was unused? @40:50 - John Chodera (he/him/his) Correct. That's what we have failed to make it work because we forgot the formal charges. @40:56 - Richard Gowers I think this is all piping level for adding another field and doesn't. Break any sort of API things. It's not really a drama to add in an extra field, but we'll see. @41:05 - David Swenson Well, and again, even if it was the binary interface just had an extra field that was always not it was there already. @41:14 - Jeffrey Wagner Yeah. If the purpose of the bond order stuff is just to upstream some nice changes but this isn't the final state, then I'm totally fine with that. But if Peter has a limited number of API updates or behavior changes that he's willing to make and we either have the choice if we're only given a budget of one behavior change and that could have formal charges or not have formal charges. But I'm going to push pretty hard for it to have formal charges because short of that. @41:43 - Richard Gowers I think we get in a lot of trouble if that's if we don't end up in a state where that's in longer term. I think eventually I'm going to want to rip out PDB file from openm because it is an openm and it uses. That's sort of a horrible dependency stack. And I think the reading of the PV files shouldn't require that. It's not using any of Open Man to actually do any of that. And so I'm kind of mind to do that. That might be a bit controversial, but I think that's sort of what we might have to do longer term. @42:18 - John Chodera (he/him/his) The Open and then long term vision is to replace all of the force field and file reading stuff with a force field tools. So please don't depend upon Open Mm, the ultimate version. It's all meant to standardize on the Open force field tech. @42:39 - Jeffrey Wagner Okay, well, we should figure that out because my long term vision was relying on the Open Mmpdb file class. @42:47 - David Swenson All right, let's make that a separate repository. Right, this is what we suggested before, and I think that in the short term, you can do that and actually tests. Against the code and open. Mm. So if there any changes upstream, you see that in the long term, then we can reverse the dependency order. @43:08 - Jeffrey Wagner So it's not the worst thing. I mean, the shorter path with I think a nearly equivalent outcome would be richard, you said the primary issue is the onerous openm dependency stack. If Kuda were removed from that stack, would that also be a good solution for you? @43:26 - Richard Gowers Yeah, sort of. But then I also don't like that you have these bills which sort of don't have good, but then you might sometimes need it. Had to switch between the two, I think. Open ends unit system. We're trying to also sunset that, aren't we? And replace that with maps openf units. I think the whole vector think is kind of annoying when you want to work with NumPy erase. It's just like a lot of niggles that if we just ripped out PDB file it would make our lives easier. I think we're happy to maybe be the custodians of PDB file. The class with capital PDB. If you guys are sort of scared of taking something off because we need it more than anyone, right? We need to read these stupid files. @44:08 - John Chodera (he/him/his) They need to end up as a topology opens up topology to be able to assign parameters. There's a processing step intermediate, right? Right. You read a PDB file that's missing critical information that needs to be prepared into some format that has fully saturated atom. So it could be that if open free energy wants to be the custodian of the modeling part of the pipeline to complete the missing bits and then hand it off to open force field that would be a great decision of responsibility. @44:37 - Jeffrey Wagner Yes, I'm super interested in things as soon as we have formal charges and bond orders but I think when of our sort of guiding principles has to be our software can't infer stuff atom types don't exist for us. In the minute that we start doing things based on atom types or residue names they opened. Field, everything becomes much, much more complicated and we start having a second entire design philosophy that makes a lot of complexity. So, yeah, I think this basically means I'm in agreement. If you guys are up to take the lead on maintaining, then I would love to help build this stuff and I can help build it and I can help offer advice on ensuring that will be interoperable. @45:36 - Richard Gowers Sounds good. I think maybe we're a bit more pragmatic about the fact that we're going to have a bunch of people that have PDB files and they want them right in and they don't really care about our quibbles about technicalities knowing to inspire. We might have to be a bit more aggressive and that makes things work than you get to be. But yeah, I get your point of view. @45:56 - Mike Henry (he/him) Yeah. And I think because we're all in our development cycle, we can be more active. Agile. So I get like not wanting to like I think this part of success for open source field has been the commitment to like the stabilities and those kinds of changes. But since we're not there I think us kind of doing some of the PDF these stuff it's still kind of in flutter with exactly what the final product wants to look like. I think it makes sense then for us to kind of take that, figure it out and then I think once the dust kind of settles around that we can develop some high quality standards that way. Whether it's to with us or wherever it does. It's something that's like robust now but I know it takes some iterations to figure that out. @46:40 - Jeffrey Wagner Sounds great to me. Thank you. @46:42 - Richard Gowers Cool. So this should be merged sort of by the end of the week. Thanks. Optin to your patience on that. The reason why he has this silly salvation system is because there's no proteins yet but that will be fixed soon. And then the sort of on ramp to that is there's sort of a shaky on we're not shaky. There's a limited scope on ramp and we're. Be looking to broaden that out, obviously with time progressive. @47:05 - David Dotson Yeah. By on ramp, you mean from the five to the data models. Got it. Okay. Yeah. Making it kind of easy for users. And that would be probably an open F feature, right? Kind of beyond ramps. @47:19 - Richard Gowers Yeah. That's what we've been talking about. Yeah. @47:23 - Mike Henry (he/him) Thank you. And I should just real quick mention what we're talking about. This openmpr that were really close to cutting, like openm 80 beta. So we might want to figure out if we can get this in before that's not a drama, because we've ventured in all of the that's right. @47:42 - Richard Gowers Because we expected that they have a slow feedback site when the crows controversial. Not controversial, but sort of out of the blue. And that's okay. No rush on that. @47:53 - Mike Henry (he/him) No worries. @48:00 - Jeffrey Wagner Wait. It seems like maybe yes worries if we're going to make a major behavior change and we have a major version coming. I guess it's not a retroactive about open end silver following and that's that. @48:14 - Mike Henry (he/him) It doesn't for what it's worth, I wouldn't worry about too much trying to get this into a major version bump because it doesn't follow it really. Anyway. So I hate it. @48:30 - Jeffrey Wagner That's fair. And I guess it is strictly an extension if we're filling in fields, but if we're never being filled in before, we're adding a field that wasn't there before. That's not the worst thing. Okay. Thank you. @48:45 - David Dotson Excellent. Any more questions or comments on protein serialization? I have one, just for my information. So do we think we'll get it in? Given. Are there any blockers? It's not clear to me at the moment. @49:05 - Richard Gowers I think we're just at the nitpicky stage, but for the goofy PR it's just nitpicky stuff or like quality of coding. @49:11 - David Swenson Okay. Just cleaning up the code a little bit. I'm doing a more complete review of it. Basically stuff like do we need to include this file? Or things like that. @49:23 - David Dotson Got it. Okay, cool, thank you. Is it reasonable, say maybe by end of week we might have it merged with? @49:30 - Richard Gowers Yes. @49:32 - David Swenson Awesome. @49:33 - David Dotson Fantastic. I'm excited to move the installation for Energy Networks. They're a little boring. Okay, cool, thank you. Alright, let's move. We've got about five minutes left. I think we can cover some ground here. Yvonne, do you want to give us an update on non eglivery cycling protocol on pursuit? @49:53 - Iván Pulido Yes. So last time I was having trouble getting noneclaren cycling. Simulation run to random purchase. We already addressed that, and I have a fully working script that uses purchase for non Eclearium switching simulation. We met with David Did last week to talk about how to include this with the Goofy objects. I'm working on that. One of the main limitations is that the person's objects relied on input files. And when we want to write this for Goofy, basically we don't want that. There are ways to work around that, but I think it's better if we don't want that. And that will also help us refactor the person's API, which is a mess, and this is a great excuse to do that. So I'm working on refactoring this core part of the API of Persons API to handle. Setting up relative free energy calculations from objects like molecule objects or from objects themselves, not only from files. That's basically the the update. @51:20 - David Dotson Excellent. And I won't be able to meet in our session after this call today. I have a doctor's appointment, so we'll have to move that either later today or tomorrow. @51:27 - Iván Pulido Is that okay? Yeah, that works. Feel free to change it. @51:31 - David Dotson Will do. Okay. Any questions for Ivan? @51:35 - Irfan Alibay Ivan, I actually just talked to you about something similar. If you're planning to look at HTF and refactoring that, if you're willing to have the input. I'd like to be involved in that conversation because think so we vended some things on the side of things that I think we said about six months and we want to get rid of stuff. Look at the open migration we had a chat about a segment ago. But yeah, if you're looking it up, I'd be interested. @52:04 - John Chodera (he/him/his) All input in how to make it more usable and modular and remixable is very much welcome. I would highly recommend that. @52:12 - Iván Pulido Yeah, totally. That's very useful feedback. So for now, I'm prioritizing the things that needed to be changed for the non clearance protocol with goofy objects, but sure. I think the hybrid quality factory, I don't see that as a blocker right now, but maybe I'll let you know. And thank you for the feedback. @52:43 - John Chodera (he/him/his) It's important also to recognize that the banks of the nice outward API as long as that is stable, then we can do multiple stages of refactoring to get this into open Mm tools. You don't have to do it all at once. @53:00 - David Dotson Well, thank you. Any questions for everyone? @53:10 - Jeffrey Wagner Would it, in the long run, be of interest or hybrid topology objects to interface with interchange objects? @53:21 - John Chodera (he/him/his) Yes, I mean, it would be great if we could accept interchange objects and then zip them up together. I think that would be phenomenal. There's one question, though, about how we want to represent alchemical systems and whether we need an engine independent way of representing it, or if that's even possible. So, I mean, we can certainly accept those as input internally. They might have to be converted to open objects in order to be able to to create these alchemical functions that go between them, because different engines have different ways of representing. How you create the alchemically modified systems, and there's quite a lot of limitations. So I think it's probably another discussion for folks who are interested in what kind of standardization of the engine level must occur. @54:15 - Jeffrey Wagner Okay, cool. Thank you, folks. @54:25 - David Dotson We have about 1 minute left. Any other questions for a bomb? Does anybody have a hard stop? I'd like to hit this last item possible. Okay, we'll proceed. Mike, do you want to give us an update on protocol setting? @54:42 - Mike Henry (he/him) Cool. Yeah. And this will be brief. I've got a meeting with Matt scheduled for tomorrow, which is what I will use basically to evaluate if the changes that we need to do to make the model will stuff work in the current PR or within a few days. To work to tidy up or if it's going to take longer. And if it's going to take longer than I'm just going to go ahead and not worry about it for this first iteration of our settings object, because now it's kind of being needed by, like, Levon needs to kind of interface with it and we just need to keep moving it forward. So that's kind of where I'm at right now to see if it's going to be like, a couple of quick changes and then we can iterate on that or if it's going to be like, okay, maybe we should sit down and plan to sell a little bit more. So we're on that. @55:30 - David Dotson Okay, cool. And I think you're also working with Richard since he's working on a replica exchange protocol as well. Okay, cool. Do you feel like you have what you need from this group? @55:42 - Mike Henry (he/him) Yeah. @55:44 - David Dotson All right. @55:45 - Mike Henry (he/him) Any questions from I should real quick mention that chatting with Levi, I don't know if he's on the call. Yes, he is. Last week was really helpful in terms of some of the stuff that most he's doing for Python's model education. But also kind of clued me into the idea that we should include a machine serialization like message pack for the serializing of objects. So that way you can imagine that the human wants to read a setting input file and we'll modify that. But if we come up with good base models that we might want to use in other parts of the code base, it will be useful to have an efficient serialization method that will be much faster than JSON. So that's something that I'm also going to talk to Matt about because, well, it's not strictly needed for a settings type object thing because that's very human focused. If we want to expand this and start using it in other applications, then it will be slow if we don't have a quick way to serialize them. Yeah, exactly. @56:50 - David Dotson Yeah. Or at least have a recommended protocol for serialization like message pack. Tends to meet the need better than JSON for whatever reason, whether it's. You know, weird data types support. Yeah. @57:06 - Mike Henry (he/him) And like, if we're sending something over the wire, it's silly not to use a binary format. I like to think that human formats are good when things are at rest on disk and you want to be able to look at it or edit. But any other use case? Like, just faster. @57:22 - David Dotson Exactly. Cool. Any other questions for Mike? @57:28 - Jeffrey Wagner If this is the last agenda item, I do have a dumb question about serialization in general. @57:36 - David Dotson Sure. @57:43 - Jeffrey Wagner My view of serialization in practice is basically that you turn your object into a dictionary of basic types and then you call an iPad external serialization library. So when you say emphasis on serializing to message back as opposed. Those to JSON. In my experience, these things aren't different. You turn the thing into a hierarchical dictionary of basic types, and then you say, now go be JSON using some JSON file, or, now go become message pack. Is there maybe something think I'm missing about the difference between serialization formats, where, like, message pack focused education would be different, I guess. @58:23 - Mike Henry (he/him) Quick answer is JSON uses Unicode or ASCII code point encoding, so to represent strings and data. It takes a lot of bits to do that. But when you use a binary format, you can use things in a much more compact way. So it's needs, like, these objects and don't get as big, if that makes sense. @58:52 - Jeffrey Wagner You mean on disk? Yeah. @58:57 - Levi Naden Anecdotally just with some testing we've done with QC. Archive at Molsey message pack. Serialization takes up less space and it's faster to serialize and deserialize than just doing like raw string compression or using the built in JSON libraries, at least for the type of data we send and for the QC Archive project. But if you're sending over the wire, it doesn't really matter how it gets there, so long as when you decorate it, the human can read it, and that's really all that matters. So it turned into a convenient tool to just build in, and then just building that into our Pythonic objects for the data structures just made it easy to serialize and decrease on either end. @59:48 - Mike Henry (he/him) Jeff I would say that your mental model, though, of what serialization is in general is right, whether or not the format on disk is an X based format or a binary based format. It's the exact same process of taking an object decomposing into, like, some sort of key value dictionary type object, and then that's either dumped to text or dumped the binary, and then when it's deserialized, we then rebuild the object. So, like, whether or not it's in message pack form or JSON form or YAML form or CSV or kind of whatever reputation you want, the same basic idea of serialization is the same. Okay, cool, but it's just what is that, like, output? Is it a file that you can head, or is it a file you have to hex dump? @01:00:37 - Richard Gowers I think maybe the missing link is that if you design for message pack, you can kind of include your optimizations for each field as you go. If you look at what Mmtf did they use? Sort of like delta encoding for residue, run length encoding for different things because they knew ahead of time what all the different fields would look like. So you can get really good compression. If you know what your data is going to look like. So that's where you would maybe design for message pack rather than sort of first dumping to Jason, then going sideways from Jason. @01:01:13 - Jeffrey Wagner Interesting. Yeah. This is striking because I've noticed in a lot of our even in memory operations involving copies of objects and stuff right now. I mean, we're coding in Python, so we're not super concerned by performance, but we've done a little bit of optimization, and the next thing that we're running up against is actually the unit operations. It's wrapping and unwrapping and decreasing units is fairly tricky. And that may just be because of some open f choices. Like in our force fields, we don't say all bond lengths must be in angstrom. We say bond length is a string that we interpret at loadtime. And so we have to do lots of lots of unit look ups. @01:02:06 - Mike Henry (he/him) It's slow. Unit stuff is slow. So I would say that like when looking to improve the performance the kind of model I've seen successful is where you kind of have the unit interface be like taking in user code kind of just like if there's a web development like untrusted user input we sanitize once your security stuff but then once it's past that interface we can make assumptions about knowing what either the units are or what unit system it's already in and then we can drop that kind of stuff. Just like I have an open Mm. Once you get to the Python layer and you're in C plus plus the unit system is all set. We can do whatever optimization if we want to do like bit fiddling to do quick math. That's not quite legal but we know what the values are in the ranges of can and so I think just like in the Python land, I think if we make our interface to these maybe peg anticipated.

Meetings

2022-09-13 Protein-ligand benchmarks meeting notes

Participants

Goals

Discussion topics

Action items

Decisions