Differences between revisions 9 and 10

Structural variability analysis with Gaussian mixture model

Available in EMAN release after 10/01/2020. Still under development. For details of the method, refer to Chen2021. Updates:

After 03/22/2021, a new GUI program, e2gmm_analysis is added to visualize the motion trajectory of the Gaussian model after training.
After 05/05/2021, a new --conv option is added that uses a convolutional neural network structure and improves convergence.

Dataset

First, for the compositional heterogeneity analysis, we use the dataset of L17-Depleted 50S Ribosomal Intermediates from EMPIAR-10076. In this tutorial, we use a subset of only 3000 particles of box size 128. The small dataset can be downloaded here. The particles are artificially selected from the full dataset to keep the tutorial is as simple as possible. Inside the zip file is a minimum single particle refinement project from the new e2spa pipeline. This includes a particles folder that contains the particles, and a r3d_00 folder with particle alignment information, mask, and the 3D structure. There is also a strucfac.txt file that is used for optimal sharpening of the density maps.

If you have previously run e2refine_easy on this data, you can extract the necessary files from a refine_XX folder using e2evalrefine.py with the --extractorientptcl option, and then just use the corresponding threed_XX file as a starting volume for the analysis.

The single model refinement can also be performed with the new e2spa_refine.py pipeline, which directly outputs the aligned list file and average density map that can be used as input.

To convert a refinement from Relion, a script called e2convertrelion.py can be found in the examples folder (after 01/2021) that produces phase flipped particles and the aligned list file.

Build the Gaussian model

First, create a folder to contain related files.

mkdir gmm_00

To limit memory consumption, the model is trained with 2D projections of the model instead of the 3D map. So here we make projections of the averaged map first. If you start from a e2refine_easy run, the projections can also be found in the refinement folder.

e2project3d.py r3d_00/threed_01.hdf --outfile gmm_00/projections.hdf --orientgen=eman:delta=5 --parallel thread:12

Now we begin training the Gaussian model from scratch. We start from a tiny box size and a small amount of Gaussian functions so it converges faster. Then run with a Fourier box size of 48 so the Gaussian model matches the density map at ~15 Å. The mask will exclude the Gaussian centers outside the volume. The last command can be repeated if the Gaussian model does not match well with the density map.

e2gmm_refine.py --projs gmm_00/projections.hdf --npt 512 --maxboxsz 24 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf
e2gmm_refine.py --projs gmm_00/projections.hdf --npt 1024 --maxboxsz 24 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf --model gmm_00/model_gmm.txt
e2gmm_refine.py --projs gmm_00/projections.hdf --npt 1600 --maxboxsz 48 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf --model gmm_00/model_gmm.txt --evalmodel gmm_00/model_projs.hdf

The --evalmodel option of e2gmm_refine.py will write the projection of the model to an image stack so we can take a look at the 3D reconstruction and check its FSC with the input density map.

e2make3dpar.py --input gmm_00/model_projs.hdf --output gmm_00/model_avg.hdf --pad 128 --mode trilinear --keep 1 --threads 12 --setsf strucfac.txt
e2proc3d.py gmm_00/model_avg.hdf gmm_00/fsc_map_model.txt --calcfsc r3d_00/threed_01.hdf

Heterogeneity analysis

Now we have a Gaussian model of the neutral structure, we can look into the heterogeneity of the particles. Run the e2gmm_refine.py command with particle input and --heter option to let the program embed them to a conformation space. The --gradout option writes the gradient with respect to the neutral structure to a file so the step can be skipped next time (read the saved file using --gradin). The middle layer output is saved in mid_00.txt. The --pas option controls the adjustment of position, amplitude, and sigma of the Gaussian functions. Setting it to 010 means changing amplitude only. Here use the new --conv mode to speed up convergence. For this dataset the improvement of the convolutional mode is minimal, but there are cases that it makes a significant difference.

e2gmm_refine.py --model gmm_00/model_gmm.txt --ptclsin r3d_00/ptcls_01.lst --heter --gradout gmm_00/allgrads.hdf --maxboxsz 48 --midout gmm_00/mid_00.txt --pas 010 --conv

Finally, the ptcl_pca_kmean_m3d.py script (in examples folder) runs PCA on the middle layer output, classifies the particles, and reconstructs 3D maps of each class.

ptcl_pca_kmean_m3d.py --pts gmm_00/mid_00.txt --pcaout gmm_00/mid_pca.txt --ptclsin r3d_00/ptcls_01.lst --ptclsout gmm_00/ptcl_cls --setsf strucfac.txt --ncls 3

Continuous heterogeneity

We use another dataset of pre-catalytic spliceosome from EMPIAR-10180 as an example of continuous conformational change. The sample dataset includes 20,000 particles with alignment information in their headers. The temporary download link can be found here. Similar to the ribosome dataset, here we intentionally selected a subset of particles that follows a certain motion trajectory.

First, create a new folder and generate projections of the given 3D structure.

mkdir gmm_00
e2project3d.py r3d_00/threed_01.hdf --outfile gmm_00/projections.hdf --orientgen=eman:delta=5 --parallel thread:12

Now build the Gaussian model. We start from lower resolution and fewer Gaussian functions, then increase both to have better convergence.

e2gmm_refine.py --projs gmm_00/projections.hdf --npt 512 --maxboxsz 24 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf
e2gmm_refine.py --projs gmm_00/projections.hdf --npt 1024 --maxboxsz 64 --model gmm_00/model_gmm.txt --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf --evalmodel gmm_00/model_projs.hdf

Check the Gaussian model and make sure it is consistent with the density map.

e2make3dpar.py --input gmm_00/model_projs.hdf --output gmm_00/model_avg.hdf --pad 128 --mode trilinear --keep 1 --threads 12 --setsf strucfac.txt
e2proc3d.py gmm_00/model_avg.hdf gmm_00/fsc_map_model.txt --calcfsc r3d_00/threed_01.hdf

Now perform the heterogeneity analysis. Here we look at both movement trajectory and amplitude change, so the --pas option can be omitted.

e2gmm_refine.py --model gmm_00/model_gmm.txt --ptclsin r3d_00/ptcls_01.lst --heter --maxboxsz 64 --midout gmm_00/mid_00.txt --conv

Finally, classify particles based on their distribution in the conformational space. Since this is a continuous motion, we use --mode=regress to generate 3D density maps along the first eigenvector at a constant spacing.

ptcl_pca_kmean_m3d.py --pts gmm_00/mid_00.txt --pcaout gmm_00/mid_pca.txt --ptclsin r3d_00/ptcls_01.lst --ptclsout gmm_00/ptcls_cls --setsf strucfac.txt --mode regress --ncls 9

-  ⇤ ← Revision 9 as of 2021-05-08 03:31:08 → 
  Size: 9578
  Editor: MuyuanChen
  Comment: attachment
+   ← Revision 10 as of 2021-05-10 01:10:18 → ⇥
  Size: 7893
  Editor: MuyuanChen
  Comment: clean up
-Deletions are marked like this.
+Additions are marked like this.
 Line 10:
-First, for the compositional heterogeneity analysis, we use the dataset of L17-Depleted 50S Ribosomal Intermediates from [[https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10076| EMPIAR-10076]]. In this tutorial, we use a subset of only 6000 particles of box size 128. The small dataset can be downloaded [[https://drive.google.com/file/d/19jP_XmmTtlGPznHS5LtnOauQ6tkPY8Ww |here]]. The particles are artificially selected to keep the tutorial is as simple as possible. To skip the alignment step, a single model refinement is run with the full dataset and the orientation of each particle is saved in the header of the image (xform.projection).
+First, for the compositional heterogeneity analysis, we use the dataset of L17-Depleted 50S Ribosomal Intermediates from [[https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10076| EMPIAR-10076]]. In this tutorial, we use a subset of only 3000 particles of box size 128. The small dataset can be downloaded [[https://drive.google.com/file/d/19jP_XmmTtlGPznHS5LtnOauQ6tkPY8Ww |here]]. The particles are artificially selected from the full dataset to keep the tutorial is as simple as possible. Inside the zip file is a minimum single particle refinement project from the new e2spa pipeline. This includes a '''particles''' folder that contains the particles, and a '''r3d_00''' folder with particle alignment information, mask, and the 3D structure. There is also a '''strucfac.txt''' file that is used for optimal sharpening of the density maps.
 Line 18:
-To convert a refinement from Relion, a script called e2convertrelion.py can be found in the examples folder (after 01/2021) that produces phase flipped particles and aligned list file.

== Reconstruct density map ==

e2gmm_refine.py --projs gmm_00/projections.hdf --npt 512 --maxboxsz 24 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf

e2gmm_refine.py --projs gmm_00/projections.hdf --npt 1024 --maxboxsz 24 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf --model gmm_00/model_gmm.txt

e2gmm_refine.py --projs gmm_00/projections.hdf --npt 1600 --maxboxsz 48 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf --model gmm_00/model_gmm.txt --evalmodel gmm_00/model_projs.hdf

e2make3dpar.py --input gmm_00/model_projs.hdf --output gmm_00/model_avg.hdf --pad 128 --mode trilinear --keep 1 --threads 12 --setsf strucfac.txt

e2proc3d.py gmm_00/model_avg.hdf gmm_00/fsc_map_model.txt --calcfsc r3d_00/threed_01.hdf

e2gmm_refine.py --model gmm_00/model_gmm.txt --ptclsin r3d_00/ptcls_01.lst --heter --maxboxsz 48 --midout gmm_00/mid_00.txt --conv --pas 010

ptcl_pca_kmean_m3d.py --pts gmm_00/mid_00.txt --pcaout gmm_00/mid_pca.txt --ptclsin r3d_00/ptcls_01.lst --ptclsout gmm_00/ptcl_cls --setsf strucfac.txt --ncls 3


Skip this step if you have an averaged map already. For the short tutorial, it is a useful step to double-check all the information in the particles is correct.

First, make the 3D map using the particles (or list file) with alignment information.

{{{
e2make3dpar.py --input ptcls_small.hdf --output avg.hdf --pad 192 --mode trilinear --keep 1 --threads 12
}}}

Now do a quick map sharpening for the output. The sharpened then filtered map is saved as avgsf.hdf. Also, create a structure factor file for the filtered map to simplify future reconstruction processes.
{{{
e2spt_structfac.py avg.hdf --res 15
e2proc3d.py avgsf.hdf avgsf.hdf --calcsf sf_lp.txt
}}}

Then make a mask and apply it to the map. The mask will be used to exclude Gaussian functions outside the boundary later. Now we have a “neutral” structure to start with. 
{{{
e2proc3d.py avgsf.hdf msk.hdf --process=normalize.edgemean: --process=mask.auto3d.thresh:nshells=4:nshellsgauss=8:threshold1=5:threshold2=2:return_mask=1
e2proc3d.py avgsf.hdf avgmsk.hdf --process=normalize.edgemean --multfile msk.hdf
}}}

 {{attachment:averagemap.png| Averaged structure |width=400}}
+To convert a refinement from Relion, a script called e2convertrelion.py can be found in the examples folder (after 01/2021) that produces phase flipped particles and the aligned list file.
-Line 62:
+Line 23:
-To limit memory consumption, the model is trained with 2D projections of the model instead of the 3D map. So here we make projections of the averaged map first.
+First, create a folder to contain related files.
-Line 64:
+Line 25:
-e2project3d.py avgmsk.hdf --outfile avg_projs.hdf --orientgen=eman:delta=5 --parallel thread:12
+mkdir gmm_00
-Line 67:
+Line 28:
-Now we begin training the Gaussian model from scratch. We start from a tiny box size so it converges faster. Then run with a Fourier box size of 48 so the Gaussian model matches the density map at ~15 Å. The mask will exclude the Gaussian centers outside the volume. The second command can be repeated if the Gaussian model does not match well with the density map.
+To limit memory consumption, the model is trained with 2D projections of the model instead of the 3D map. So here we make projections of the averaged map first. If you start from a e2refine_easy run, the projections can also be found in the refinement folder.
{{{
e2project3d.py r3d_00/threed_01.hdf --outfile gmm_00/projections.hdf --orientgen=eman:delta=5 --parallel thread:12
}}}

Now we begin training the Gaussian model from scratch. We start from a tiny box size and a small amount of Gaussian functions so it converges faster. Then run with a Fourier box size of 48 so the Gaussian model matches the density map at ~15 Å. The mask will exclude the Gaussian centers outside the volume. The last command can be repeated if the Gaussian model does not match well with the density map.
-Line 70:
+Line 36:
-e2gmm_refine.py --projs avg_projs.hdf --npt 2048 --maxboxsz 24 --modelout model_gmm.txt --niter 20 --mask msk.hdf
e2gmm_refine.py --projs avg_projs.hdf --maxboxsz 48 --modelout model_gmm.txt --niter 20 --evalmodel model_projs.hdf --model model_gmm.txt --npt 2048  --mask msk.hdf
+e2gmm_refine.py --projs gmm_00/projections.hdf --npt 512 --maxboxsz 24 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf
e2gmm_refine.py --projs gmm_00/projections.hdf --npt 1024 --maxboxsz 24 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf --model gmm_00/model_gmm.txt
e2gmm_refine.py --projs gmm_00/projections.hdf --npt 1600 --maxboxsz 48 --modelout gmm_00/model_gmm.txt --niter 20 --mask r3d_00/mask.hdf --model gmm_00/model_gmm.txt --evalmodel gmm_00/model_projs.hdf
-Line 76:
+Line 43:
-e2make3dpar.py --input model_projs.hdf --output model_avg.hdf --pad 192 --mode trilinear --keep 1 --threads 12 --setsf sf_lp.txt
e2proc3d.py model_avg.hdf fsc_00.txt --calcfsc avgmsk.hdf
+e2make3dpar.py --input gmm_00/model_projs.hdf --output gmm_00/model_avg.hdf --pad 128 --mode trilinear --keep 1 --threads 12 --setsf strucfac.txt
e2proc3d.py gmm_00/model_avg.hdf gmm_00/fsc_map_model.txt --calcfsc r3d_00/threed_01.hdf
-Line 85:
+Line 52:
-Now we have a Gaussian model of the neutral structure, we can look into the heterogeneity of the particles. Run the e2gmm_refine.py command with particle input and --heter option to let the program embed them to a conformation space. The --gradout option writes the gradient with respect to the neutral structure to a file so the step can be skipped next time (read the saved file using --gradin). The middle layer output is saved in mid_00.txt. The --pas option controls the adjustment of position, amplitude, and sigma of the Gaussian functions. Setting it to 010 means changing amplitude only.
+Now we have a Gaussian model of the neutral structure, we can look into the heterogeneity of the particles. Run the e2gmm_refine.py command with particle input and --heter option to let the program embed them to a conformation space. The --gradout option writes the gradient with respect to the neutral structure to a file so the step can be skipped next time (read the saved file using --gradin). The middle layer output is saved in mid_00.txt. The --pas option controls the adjustment of position, amplitude, and sigma of the Gaussian functions. Setting it to 010 means changing amplitude only. Here use the new '''--conv''' mode to speed up convergence. For this dataset the improvement of the convolutional mode is minimal, but there are cases that it makes a significant difference.
-Line 87:
+Line 54:
-e2gmm_refine.py --model model_gmm.txt --ptclsin ptcls_small.hdf --heter --maxboxsz 48 --midout mid_00.txt --pas 010
+e2gmm_refine.py --model gmm_00/model_gmm.txt --ptclsin r3d_00/ptcls_01.lst --heter --gradout gmm_00/allgrads.hdf --maxboxsz 48 --midout gmm_00/mid_00.txt --pas 010 --conv
-Line 92:
+Line 59:
-ptcl_pca_kmean_m3d.py --pts mid_00.txt --pcaout mid_pca.txt --ptclsin ptcls_small.hdf --ptclsout ptcls_cls --setsf sf_lp.txt
+ptcl_pca_kmean_m3d.py --pts gmm_00/mid_00.txt --pcaout gmm_00/mid_pca.txt --ptclsin r3d_00/ptcls_01.lst --ptclsout gmm_00/ptcl_cls --setsf strucfac.txt --ncls 3
-Line 99:
+Line 66:
-We use another dataset of pre-catalytic spliceosome from [[https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10180| EMPIAR-10180]] as an example of continuous conformational change. The sample dataset includes 20,000 particles with alignment information in their headers. Temporary download link [[https://drive.google.com/file/d/1th4OP4McoZFWYQcxDzo4SkahawqvNNTq | here]]. Similar to the ribosome dataset, here we intentionally selected a subset of particles that follows a certain motion trajectory. Inside the zip file is a minimum single particle refinement project, including a '''particles''' folder that contains the downsampled particles, and a '''r3d_00''' folder with particle alignment information and the 3D average structure. There is also a '''strucfac.txt''' file that is used for optimal sharpening of the density maps.
+We use another dataset of pre-catalytic spliceosome from [[https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10180| EMPIAR-10180]] as an example of continuous conformational change. The sample dataset includes 20,000 particles with alignment information in their headers. The temporary download link can be found [[https://drive.google.com/file/d/1th4OP4McoZFWYQcxDzo4SkahawqvNNTq | here]]. Similar to the ribosome dataset, here we intentionally selected a subset of particles that follows a certain motion trajectory.
-Line 101:
+Line 68:
-First, create a folder to contain related files.
+First, create a new folder and generate projections of the given 3D structure.
-Line 104:
+Line 71:
-}}}

Generate projections of the averaged structure:
{{{
-Line 111:
+Line 74:
-Now build the Gaussian model. We can start from lower resolution and fewer Gaussian functions, then increase both to have better convergence.
+Now build the Gaussian model. We start from lower resolution and fewer Gaussian functions, then increase both to have better convergence.
-Line 123:
+Line 86:
-Now perform the heterogeneity analysis. Here use the new '''--conv''' mode to speed up convergence. Here we look at both movement trajectory and amplitude change, so the '''--pas''' option can be omitted.
+Now perform the heterogeneity analysis. Here we look at both movement trajectory and amplitude change, so the '''--pas''' option can be omitted.