Structural variability analysis with Gaussian mixture model
Available in EMAN release after 10/01/2020. Still under development. For details of the method, refer to Chen2021. Updates:
After 03/22/2021, a new GUI program, e2gmm_analysis is added to visualize the motion trajectory of the Gaussian model after training.
After 05/05/2021, a new --conv option is added that uses a convolutional neural network structure and improves convergence.
If you are interested in how this works, there are now Jupyter notebooks for the two tutorial datasets in Github here. Notebooks of more advanced usage of the method will be uploaded later.
Dataset
First, for the compositional heterogeneity analysis, we use the dataset of L17-Depleted 50S Ribosomal Intermediates from EMPIAR-10076. In this tutorial, we use a subset of only 3000 particles of box size 128. The small dataset can be downloaded here. The particles are artificially selected from the full dataset to keep the tutorial is as simple as possible. Inside the zip file is a minimum single particle refinement project from the new e2spa pipeline. This includes a particles folder that contains the particles, and a r3d_00 folder with particle alignment information, mask, and the 3D structure. There is also a strucfac.txt file that is used for optimal sharpening of the density maps.
If you have previously run e2refine_easy on this data, you can extract the necessary files from a refine_XX folder using e2evalrefine.py with the --extractorientptcl option, and then just use the corresponding threed_XX file as a starting volume for the analysis.
The single model refinement can also be performed with the new e2spa_refine.py pipeline, which directly outputs the aligned list file and average density map that can be used as input.
To convert a refinement from Relion, a script called e2convertrelion.py can be found in the examples folder (after 01/2021) that produces phase flipped particles and the aligned list file.
Build the Gaussian model
First, create a folder to contain related files.
mkdir gmm_00
To limit memory consumption, the model is trained with 2D projections of the model instead of the 3D map. So here we make projections of the averaged map first. If you start from a e2refine_easy run, the projections can also be found in the refinement folder.
e2project3d.py r3d_00/threed_01.hdf --outfile gmm_00/projections.hdf --orientgen=eman:delta=5 --parallel thread:12
Now we begin training the Gaussian model from scratch. We start from a tiny box size and a small amount of Gaussian functions so it converges faster. Then run with a Fourier box size of 48 so the Gaussian model matches the density map at ~15 Å. When increasing the number of Gaussian functions, the new Gaussian will be seeded near the old ones, so the overall shape of the Gaussian model is preserved at each run of the refinement. The mask will exclude the Gaussian centers outside the volume. The last command can be repeated if the Gaussian model does not match well with the density map.
e2gmm_refine.py --projs gmm_00/projections.hdf --npt 512 --maxboxsz 24 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf e2gmm_refine.py --projs gmm_00/projections.hdf --npt 1024 --maxboxsz 24 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf --model gmm_00/model_gmm.txt e2gmm_refine.py --projs gmm_00/projections.hdf --npt 1600 --maxboxsz 48 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf --model gmm_00/model_gmm.txt --evalmodel gmm_00/model_projs.hdf
The --evalmodel option of e2gmm_refine.py will write the projection of the model to an image stack so we can take a look at the 3D reconstruction and check its FSC with the input density map.
e2make3dpar.py --input gmm_00/model_projs.hdf --output gmm_00/model_avg.hdf --pad 128 --mode trilinear --keep 1 --threads 12 --setsf strucfac.txt e2proc3d.py gmm_00/model_avg.hdf gmm_00/fsc_map_model.txt --calcfsc r3d_00/threed_01.hdf
Heterogeneity analysis
Now we have a Gaussian model of the neutral structure, we can look into the heterogeneity of the particles. Run the e2gmm_refine.py command with particle input and --heter option to let the program embed them to a conformation space. The --gradout option writes the gradient with respect to the neutral structure to a file so the step can be skipped next time (read the saved file using --gradin). The middle layer output is saved in mid_00.txt. The --pas option controls the adjustment of position, amplitude, and sigma of the Gaussian functions. Setting it to 010 means changing amplitude only. Here use the new --conv mode to speed up convergence. For this dataset the improvement of the convolutional mode is minimal, but there are cases that it makes a significant difference.
e2gmm_refine.py --model gmm_00/model_gmm.txt --ptclsin r3d_00/ptcls_01.lst --heter --gradout gmm_00/allgrads.hdf --maxboxsz 48 --midout gmm_00/mid_00.txt --pas 010 --conv
Finally, the e2gmm_eval.py script runs PCA on the middle layer output, classifies the particles, and reconstructs 3D maps of each class.
e2gmm_eval.py --pts gmm_00/mid_00.txt --pcaout gmm_00/mid_pca.txt --ptclsin r3d_00/ptcls_01.lst --ptclsout gmm_00/ptcl_cls.lst --setsf strucfac.txt --ncls 3
Continuous heterogeneity
We use another dataset of pre-catalytic spliceosome from EMPIAR-10180 as an example of continuous conformational change. The sample dataset includes 20,000 particles with alignment information in their headers. The temporary download link can be found here. Similar to the ribosome dataset, here we intentionally selected a subset of particles that follows a certain motion trajectory.
First, create a new folder and generate projections of the given 3D structure.
mkdir gmm_00 e2project3d.py r3d_00/threed_01.hdf --outfile gmm_00/projections.hdf --orientgen=eman:delta=5 --parallel thread:12
Now build the Gaussian model. We start from lower resolution and fewer Gaussian functions, then increase both to have better convergence.
e2gmm_refine.py --projs gmm_00/projections.hdf --npt 512 --maxboxsz 24 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf e2gmm_refine.py --projs gmm_00/projections.hdf --npt 1024 --maxboxsz 64 --model gmm_00/model_gmm.txt --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf --evalmodel gmm_00/model_projs.hdf
Check the Gaussian model and make sure it is consistent with the density map.
e2make3dpar.py --input gmm_00/model_projs.hdf --output gmm_00/model_avg.hdf --pad 128 --mode trilinear --keep 1 --threads 12 --setsf strucfac.txt e2proc3d.py gmm_00/model_avg.hdf gmm_00/fsc_map_model.txt --calcfsc r3d_00/threed_01.hdf
Now perform the heterogeneity analysis. Here we look at both movement trajectory and amplitude change, so the --pas option can be omitted.
e2gmm_refine.py --model gmm_00/model_gmm.txt --ptclsin r3d_00/ptcls_01.lst --heter --maxboxsz 64 --midout gmm_00/mid_00.txt --conv
Finally, classify particles based on their distribution in the conformational space. Since this is a continuous motion, we use --mode=regress to generate 3D density maps along the first eigenvector at a constant spacing.
e2gmm_eval.py --pts gmm_00/mid_00.txt --pcaout gmm_00/mid_pca.txt --ptclsin r3d_00/ptcls_01.lst --ptclsout gmm_00/ptcls_cls.lst --setsf strucfac.txt --mode regress --ncls 9