Protein-Ligand Benchmarks

Code

 

 

Data

(currently on google drive, should be released on github:https://drive.google.com/drive/folders/1A8ncO30eaS1vE1czGl3lwaIkjXgMyc4G?usp=sharing )

Content

  • CB: valid benchmark set:

    • what’s the protein? → identity (EC number)

    • what are the ligands? → smiles

    • what are the activities?

    • SB: state information (temperature, pressure, ion concentration, …)

    • everything else is an interpretation. (methods, ff, poses, charges, ….)

  • Stages

    1. above data

    2. + structures (PDB + poses)

    3. + partial charges, FF parameters

    4. method, method parameters

  • Ligands:

    • Structure as sdf file

      • coordinates

      • partial charges?

      • activity?

      • reference?

    • Charges (CB: if you want to evaluate e.g. another pose, we want to keep the charges constant)

  • Protein:

    • Structure as pdb file (protein.pdb + all other crystal molecules in <find_a_name>.pdb)

      • generated with gromacs gmx pdb2gmx

      • <find_a_name> : ‘water+other’, ‘water+cofactors’, ‘other’?

  • Hybrid:

    • hybrid struct based on ligand A

    • hybrid struct based on ligand B

  • Problems with current version:

    • partial charges

    • boxes (dimensions, number of mols)

    • position of waters and ions

  • Split up data into essential input (coordinates in sdf + pdb) and results (detailed topologies (exact ff version and files), MD engine version, …)?
    Essential input: A file which cannot be generated from the other files in the release data (except if it takes very long/to many resources to generate it)

  • Experimental data → OpenFF-wide centralized generalized storage?

  • conda env export / gromacs exact version

Directory structure

Current directory structure

 

Current directory structure

 

├── <date>_<target_name_1>
│   ├── 00_data
│   │   ├── edges.yml
│   │   ├── ligands.yml
│   │   └── target.yml
│   ├── 01_protein
│   │   ├── crd
│   │   │   └── protein.pdb
│   │   └── top
│   │   └── amber99sb-star-ildn-mut.ff
│   │   ├── topol.itp
│   │   └── topol.top
│   └── 02_ligands
│   ├── lig_<name_1>
│   │   ├── crd
│   │   │   └── lig_<name_1>.sdf
│   │   └── top
│   │   └── openff-1.0.0.offxml
│   │   ├── fflig_<name_1>.itp
│   │   ├── lig_<name_1>.itp
│   │   ├── lig_<name_1>.top
│   │   └── posre_lig_<name_1>.itp
│   ├── lig_<name_2>
│   …..
│   └── 03_hybrid
│   ├── edge_<name_1>_<name_2>
│   │   └── water
│   │   ├── crd
│   │   │   ├── mergedA.pdb
│   │   │   ├── mergedB.pdb
│   │   │   ├── pairs.dat
│   │   │   └── score.dat
│   │   └── top
│   │   └── openff-1.0.0.offxml
│   │   ├── ffmerged.itp
│   │   ├── ffMOL.itp
│   │   └── merged.itp
│   ….
├── <date>_<target_name_2>

 

  • parent folder for target <target_name_1>

    • metadata

      • information about edges / perturbations

      • information about ligands

      • information about target

    • protein data

    •  

    •  

    •  

    •  

    •  

    • ligand data

      • ligand <name_1>

        • coordinate

          • SDF coordinate file

        • topology

        •  

        •  



 

 

  • ligand.yml:

    • currently in

      • name (identifier)

      • smiles: CCOc1c(c(cc(n1)NC(=O)Cc2cc(c(cc2OC)Br)OC)N)C#N

      • (outdated: relative path to sdf file docked: 03_docked/lig_17124-1/lig_17124-1.sdf)

      • measurement

        • activity with error (if available) and unit

        • reference (doi)

        • comment

    • maybe add

      • author (person who transcribed it/added it to the database)

      • data version