Sampling exhaustiveness and precision

Run sampling performance analysis with imp-sampcon tool (described by Viswanath et al. 2017)

Enter the output modelling directory

Prepare the density.txt file:

create_density_file.py --project_dir <path to the original project dir> config.json --by_rigid_body

For complexes containing multiple copies of the same subunit, prepare the symm_groups.txt file storing information necessary to properly align homo-oligomeric structures
create_symm_groups_file.py --project_dir <path to the original project dir> config.json params.py
By default all molecule copies of the same subunits are grouped together, and this should be sufficient in most cases.

In some special cases where subunits are directed to specific series, to group by series use --by series option:
create_symm_groups_file.py --project_dir <path to the original project dir> --by series config.json params.py
To additionally group series into bigger groups use --extra_series_groups option, e.g.:
create_symm_groups_file.py \ --project_dir <path to the original project dir> \ --by series \ --extra_series_groups NR_1,NR2;CR_1,CR_2 \ config.json params.py
which will group by series but on top will consider ambiguity between selected series

Run setup_analysis.py script to prepare input files for the sampling exhaustiveness analysis based on your resulting models:

setup_analysis.py -s <abs path to csv scores file produced by extract_all_scores.py> \
-o <specified output dir> \
-d <density.txt file generated in the previous step> \
-n <number of top scoring models to be analyzed, default is all models> \
-k <restraint score based on which to perform the analysis, default is total score>

Example:

setup_analysis.py -s all_scores.csv -o analysis -d density.txt -n 20000

To see available options and default values run:

setup_analysis.py -h

Run imp-sampcon exhaust tool (command-line tool provided with IMP) to perform the actual analysis:

cd <output dir created by the above setup_analysis.py script>

imp_sampcon exhaust -n <prefix for output files> \
--rmfA sample_A/sample_A_models.rmf3 \
--rmfB sample_B/sample_B_models.rmf3 \
--scoreA scoresA.txt --scoreB scoresB.txt \
-d <path to density.txt file>/density.txt \
-m <calculator selection> \
-c <int for cores to process> \
-gp \
-g <float with clustering threshold step>

To see available options and default values for imp-sampcon exhaust analysis run:

imp_sampcon exhaust -h

In case the analysis will be run on slurm-based cluster then compile a bash script like the following and run with sbatch:

#!/bin/bash
#SBATCH --job-name=master_sampling_20000.job
#SBATCH --output=./master_sampling_20000.out
#SBATCH --error=./master_sampling_20000.err
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --qos=highest
#SBATCH --cpus-per-task=15
#SBATCH --mem-per-cpu=4000

imp_sampcon exhaust -n CR_Y_test --rmfA sample_A/sample_A_models.rmf3 --rmfB sample_B/sample_B_models.rmf3 --scoreA scoresA.txt --scoreB scoresB.txt -d density.txt -m cpu_omp -c 15 -gp -g 5.0

In the output you will get, among other files:
- <prefix for output files>.Sampling_Precision_Stats.txt with estimation of the sampling precision.
- Clusters obtained after clustering at the above sampling precision in directories and files starting from cluster in their names, containing information about the models in the clusters and cluster localization densities
- <prefix for output files>.Cluster_Precision.txt listing the precision for each cluster
- PDF files with plots with the results of exhaustiveness tests
See Viswanath et al. 2017 for detailed explanation of these concepts.
Optimize the plots
The fonts and value ranges in X and Y axes in the default plots from imp_sampcon exhaust are frequently not optimal. For this you have to adjust them manually.
1. Copy the original gnuplot scripts to the current analysis directory by executing:
  copy_sampcon_gnuplot_scripts.py
  
  This will copy four scripts to the current directory:
  
  Plot_Cluster_Population.plt for the <prefix for output files>.Cluster_Population.pdf plot
  
  Plot_Convergence_NM.plt for the <prefix for output files>.ChiSquare.pdf plot
  
  Plot_Convergence_SD.plt for the <prefix for output files>.Score_Dist.pdf plot
  
  Plot_Convergence_TS.plt for the <prefix for output files>.Top_Score_Conv.pdf plot
2. Edit the scripts to adjust according to your liking or needs
3. Run the scripts again:
  gnuplot -e "sysname='<prefix for output files>'" Plot_Cluster_Population.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_NM.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_SD.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_TS.plt
  
  For example:
  
  gnuplot -e "sysname='elongator'" Plot_Cluster_Population.plt gnuplot -e "sysname='elongator'" Plot_Convergence_NM.plt gnuplot -e "sysname='elongator'" Plot_Convergence_SD.plt gnuplot -e "sysname='elongator'" Plot_Convergence_TS.plt

Extract cluster models

For example, to extract the 5 top scoring models:

extract_cluster_models.py \
    --project_dir <full path to the original project directory> \
    --outdir cluster.0/ \
    --ntop 5 \
    --scores ../all_scores.csv \
    Identities_A.txt Identities_B.txt cluster.0.all.txt ../config.json

If the exhaustiveness is not met, run more jobs:

Running more