Sampling exhaustiveness and precision

Run sampling performance analysis with imp-sampcon tool (described by Viswanath et al. 2017)

  1. Enter the output modelling directory

  2. Prepare the density.txt file:

    create_density_file.py --project_dir <path to the original project dir> config.json --by_rigid_body
    
  3. For complexes containing multiple copies of the same subunit, prepare the symm_groups.txt file storing information necessary to properly align homo-oligomeric structures

    create_symm_groups_file.py --project_dir <path to the original project dir> config.json params.py
    

    By default all molecule copies of the same subunits are grouped together, and this should be sufficient in most cases.

    In some special cases where subunits are directed to specific series, to group by series use --by series option:

    create_symm_groups_file.py --project_dir <path to the original project dir> --by series config.json params.py
    

    To additionally group series into bigger groups use --extra_series_groups option, e.g.:

    create_symm_groups_file.py \
    --project_dir <path to the original project dir> \
    --by series \
    --extra_series_groups NR_1,NR2;CR_1,CR_2 \
    config.json params.py
    

    which will group by series but on top will consider ambiguity between selected series

  4. Run setup_analysis.py script to prepare input files for the sampling exhaustiveness analysis based on your resulting models:

    setup_analysis.py -s <abs path to csv scores file produced by extract_all_scores.py> \
    -o <specified output dir> \
    -d <density.txt file generated in the previous step> \
    -n <number of top scoring models to be analyzed, default is all models> \
    -k <restraint score based on which to perform the analysis, default is total score>
    

    Example:

    setup_analysis.py -s all_scores.csv -o analysis -d density.txt -n 20000
    

    To see available options and default values run:

    setup_analysis.py -h
    
  5. Run imp-sampcon exhaust tool (command-line tool provided with IMP) to perform the actual analysis:

    cd <output dir created by the above setup_analysis.py script>
    
    imp_sampcon exhaust -n <prefix for output files> \
    --rmfA sample_A/sample_A_models.rmf3 \
    --rmfB sample_B/sample_B_models.rmf3 \
    --scoreA scoresA.txt --scoreB scoresB.txt \
    -d <path to density.txt file>/density.txt \
    -m <calculator selection> \
    -c <int for cores to process> \
    -gp \
    -g <float with clustering threshold step>
    

    To see available options and default values for imp-sampcon exhaust analysis run:

    imp_sampcon exhaust -h
    

    In case the analysis will be run on slurm-based cluster then compile a bash script like the following and run with sbatch:

    #!/bin/bash
    #SBATCH --job-name=master_sampling_20000.job
    #SBATCH --output=./master_sampling_20000.out
    #SBATCH --error=./master_sampling_20000.err
    #SBATCH --nodes=1
    #SBATCH --time=10:00:00
    #SBATCH --qos=highest
    #SBATCH --cpus-per-task=15
    #SBATCH --mem-per-cpu=4000
    
    imp_sampcon exhaust -n CR_Y_test --rmfA sample_A/sample_A_models.rmf3 --rmfB sample_B/sample_B_models.rmf3 --scoreA scoresA.txt --scoreB scoresB.txt -d density.txt -m cpu_omp -c 15 -gp -g 5.0
    
  6. In the output you will get, among other files:

    • <prefix for output files>.Sampling_Precision_Stats.txt with estimation of the sampling precision.

    • Clusters obtained after clustering at the above sampling precision in directories and files starting from cluster in their names, containing information about the models in the clusters and cluster localization densities

    • <prefix for output files>.Cluster_Precision.txt listing the precision for each cluster

    • PDF files with plots with the results of exhaustiveness tests

    See Viswanath et al. 2017 for detailed explanation of these concepts.

  7. Optimize the plots

    The fonts and value ranges in X and Y axes in the default plots from imp_sampcon exhaust are frequently not optimal. For this you have to adjust them manually.

    1. Copy the original gnuplot scripts to the current analysis directory by executing:

      copy_sampcon_gnuplot_scripts.py
      

      This will copy four scripts to the current directory:

      • Plot_Cluster_Population.plt for the <prefix for output files>.Cluster_Population.pdf plot

      • Plot_Convergence_NM.plt for the <prefix for output files>.ChiSquare.pdf plot

      • Plot_Convergence_SD.plt for the <prefix for output files>.Score_Dist.pdf plot

      • Plot_Convergence_TS.plt for the <prefix for output files>.Top_Score_Conv.pdf plot

    2. Edit the scripts to adjust according to your liking or needs

    3. Run the scripts again:

      gnuplot -e "sysname='<prefix for output files>'" Plot_Cluster_Population.plt
      gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_NM.plt
      gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_SD.plt
      gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_TS.plt
      

      For example:

      gnuplot -e "sysname='elongator'" Plot_Cluster_Population.plt
      gnuplot -e "sysname='elongator'" Plot_Convergence_NM.plt
      gnuplot -e "sysname='elongator'" Plot_Convergence_SD.plt
      gnuplot -e "sysname='elongator'" Plot_Convergence_TS.plt
      
  8. Extract cluster models

    For example, to extract the 5 top scoring models:

    extract_cluster_models.py \
        --project_dir <full path to the original project directory> \
        --outdir cluster.0/ \
        --ntop 5 \
        --scores ../all_scores.csv \
        Identities_A.txt Identities_B.txt cluster.0.all.txt ../config.json
    
  9. If the exhaustiveness is not met, run more jobs: