= Gaussian mixture model based atomic model refinement (2024) =

Refine atomic models to better fit into a given map, and get better PDB validation score. For detailed method (and to cite), see [[https://doi.org/10.1101/2024.09.27.615511 | Preprint here]]. 

== Dataset ==

In this tutorial, we use TRPV1 as an example. Starting from a relatively old PDB model (PDB: [[https://www.rcsb.org/structure/3J5R|3J5R]] ), we will fit it into a newer, higher resolution EMDB structure at a different conformation ( [[https://www.emdataresource.org/EMD-8117|EMD-8117]] ). This is quite similar to the typical modeling work, when we have an existing homolog or predicted model and a newer, better structure. 

In this case, the starting model was built before the PDB validation system came out, so the validation score is not pretty. 

{{attachment:pdbval_pdb.png | validation_pdb |width=400}}

First, roughly fit the model into the map. This can be done in ChimeraX. They have slightly different conformation, so the fit won't be very precise. It is ok to leave it as is.

{{attachment:pdb_fit.png | Fitting model to map |width=400}}


== Compile model parameters ==

First, we compute some parameters from the input model that will be used in the following refinement. 

{{{
e2gmm_model_compile.py --model 3j5q.pdb
}}}

This should create a `gmm_model_xx` folder, which contains information on the stereochemical constraints of the input model. It is critical to always refer to this folder using the `--path` option for the later `e2gmm_model_xxx` runs, so the metadata can be accessed. 

All output will be in PDB format if the input model here is PDB. If the input is CIF, the output will be in CIF. Otherwise, specifying `--writecif` will force output to be CIF.

For your own data, it is worth noting that some non-protein structures are not supported at the moment. It is recommended to remove them from input and add them back after refinement. It is also good to notify me about those so I can add them in the future. 

== Fit to density map ==

{{{
e2gmm_model_fit.py --path gmm_model_00 --map emd_8117.mrc --resolution 3 --writetxt --rebuild_rotamer
}}}

Here `gmm_model_00` is the folder generated by the `e2gmm_model_comple.py` command. This step can be skipped if there is no input map and you only want to refine the stereochemical score of the model. 

Make sure the model does fit in the map in the first place. In this case, the loss from command line output should start from around -0.4 and gradually decrease to -0.6 or lower throughout iterations.

The `--rebuild_rotamer` option will select the rotamer for each residue that fits best into the given map (by map-model FSC). The process can be very slow. It is necessary for this case because there are many sidechain outliers, but can be skipped if the starting model is more reasonable.

The option `--writetxt` is only useful if you intend to do the multi-model refinement later. It saves the amplitude and width of Gaussian functions in a txt file in addition to the PDB/CIF. It also can be reproduced later with another `e2gmm_model_fit.py` run. 

Take a look at `fit_03.pdb` and the model should fit into the map.

{{attachment:fit_map.png | model fit in map |width=400}}


== Full model refinement ==

{{{
e2gmm_model_refine.py --path gmm_model_00 --model gmm_model_00/fit_03.pdb
}}}

`gmm_model_00/fit_03.pdb` is produced by `e2gmm_model_fit.py`. If the fitting step is skipped, simply remove the `--model` option.

Normally, the default option should produce a good enough model. But in this case, since we started from a quite bad position, it would be necessary to run the command multiple times. For the second run, specify `--niter 0,20` to skip the initial iterations, and `--fixrotamer` to force good rotamers for the bad residues.

{{{
e2gmm_model_refine.py --path gmm_model_00 --model gmm_model_00/model_02.pdb --fixrotamer --niter 0,20
}}}

There is some randomness in the process, and the command can be run multiple times. In the end, this should give a relatively good PDB validation score. 

{{attachment:pdbval2.png | pdb validation output |width=400}}

The Q-score should also improve compared to the input model.

{{attachment:qscore.png | q score improve |width=400}}

Additonally, if you use !MolProbity and have concern about the C-beta deviation it reports, adding the `--cbeta` option should solve that. If you don't want the PDB validation server to complain about cis residues, you can enable the `--nocis` option.

== Multi-model refinement ==

To refine a series of models, we will need to process the dataset from [[https://www.ebi.ac.uk/empiar/EMPIAR-10059 | EMPIAR-10059]]. The dataset only contains particles, so one still needs to run the refinement and heterogeneity analysis. A brief and outdated tutorial can be found [[https://blake.bcm.edu/emanwiki/EMAN2/e2gmm_new |here]]. Hopefully, I will update that later...

At the end of the heterogeneity analysis, a stack of volumes is generated by `e2gmm_eval.py` that represents one mode of movement within the system. In this tutorial, we will build models for a continuous conformational change from there. 

{{attachment:vol_mov_trpv.gif | movement volume trpv1|width=300}}

We can actually work within the same `gmm_model_00` folder, although it may be safer to make a copy just in case things go wrong. Here we simply copy `gmm_model_00` to `gmm_model_01`.

To start, run

{{{
e2gmm_model_multi.py --path gmm_model_01 --maps gmm_00/ptcls_00_cls_0.hdf --modeltxt gmm_model_01/fit_03.txt --modelpdb gmm_model_01/model_02.pdb --resolution 5 --ndim 1
}}}

Here, `gmm_00/ptcls_00_cls_0.hdf` is the volume stack input, `gmm_model_01/fit_03.txt` is produced by `e2gmm_model_fit.py`, and `gmm_model_01/model_02.pdb` is produced by `e2gmm_model_refine.py`. The `--ndim` option here corresponds to the `--axis` options in `e2gmm_eval.py`. When a volume stack of a linear trajectory is used, i.e. `e2gmm_eval.py --axis 0`, set `--ndim=1`. The `e2gmm_eval.py` program also supports circular trajectories (`--axis 0,1` will generate a circular trajectory in the 2D space formed by eigen-axis 0 and 1). In this case, set `--ndim=2` for `e2gmm_model_multi.py`.

This will optimize the large-scale morphing and C-alpha model according to the input volume series. Next, we need to refine the full atom models. The same program is used but it has to be split into two separate runs because of the GPU memory limitation. 

{{{
e2gmm_model_multi.py --path gmm_model_01 --maps gmm_00/ptcls_00_cls_0.hdf --modelpdb gmm_model_01/model_02.pdb --modeltxt gmm_model_01/fit_03.txt --resolution 5 --load --niter 0,0,2 --batchsz 8 --ndim 1
}}} 

Note here we use `--load` and `--niter 0,0,2`, meaning we skip the first two refinements (morphing and C-alpha), and load the neural networks from the last run. We also use a smaller `--batchsz` to compensate for the GPU memory. 

Finally, we can generate a series of models at a finer sampling. 

{{{
e2gmm_model_multi.py --path gmm_model_01 --maps gmm_00/ptcls_00_cls_0.hdf --modelpdb gmm_model_01/model_02.pdb --modeltxt gmm_model_01/fit_00.txt --resolution 5 --load --eval 50 --ndim 1
}}}

And visualize the results here. Here every frame should have a relatively good stereochemistry score, and minimal clashing atoms. 

{{attachment:model_movie_trpv.gif | movement movie trpv1|width=400}}

This can also work on much larger system, like the spliceosome from the tutorial [[https://blake.bcm.edu/emanwiki/EMAN2/e2gmmOriginal#Continuous_heterogeneity | here]]. It takes more tweaking to get it to work, again, under the memory limitation of the current GPUs... 

{{attachment:model_movie.gif | movement movie spliceosome|width=800}}