Differences between revisions 3 and 29 (spanning 26 versions)
Revision 3 as of 2010-12-17 15:27:37
Size: 6924
Editor: SteveLudtke
Comment:
Revision 29 as of 2013-04-19 15:16:33
Size: 10971
Editor: SteveLudtke
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
=== MPI Parallelism === == MPI Parallelism ==
Line 3: Line 3:
MPI stands for 'Message Passing Interface', and over the last decade it has become the de-facto standard for running large scale computations on Linux clusters around the world. In most supercomputing centers this will be the ONLY option you have for running in parallel, and administrators may be actively hostile to trying to make use of any non-MPI software on their clusters.  MPI stands for 'Message Passing Interface', and over the last decade it has become the de-facto standard for running large scale computations on Linux clusters around the world. In most supercomputing centers this will be the ONLY option you have for running in parallel, and administrators may be actively hostile to trying to make use of any non-MPI software on their clusters.
Line 5: Line 5:
==== MPI setup ==== '''PLEASE NOTE:''' __Using MPI on any cluster is not a task for linux/unix novices__. You must have a fair bit of education to understand what's involved in using MPI with any program (not just EMAN). You should be comfortable with running MPI jobs before attempting this with EMAN2. If necessary you may need to consult a cluster administrator for assistance. There is enough variation between different specific linux clusters that we cannot provide specific advice for every situation. We have tried to provide as much generic advice as possible, but this is often not going a cookie-cutter operation.
Line 7: Line 7:
 * Mac - MPI is provided as part of the operating system, so we provide a fully functional binary. No extra installation should be required.
 * Windows - we do not presently offer MPI support. try [[EMAN2/Parallel|one of the other parallelism methods]]
 * Linux - Unfortunately there are many variants of MPI and there are many variants of linux. Due to these issues, there is one specific file which we cannot distribute as part of the EMAN2 binary release for linux. The following will explain how to go about setting this up:
=== Installing MPI Support in EMAN2/SPARX ===
SPARX and EMAN2 have merged their MPI support efforts, and as of 4/19/2013, the legacy EMAN2 MPI system has been retired. To install the current combined system, start at the installation page: http://blake.bcm.edu/emanwiki/EMAN2/Parallel/PyDusa
Line 11: Line 10:
On linux clusters you will need to compile one small module directly on the cluster in question. In most cases this will be completely trivial, and the setup will be largely automatic. However, in some situations it may require you to do some research about your cluster and/or consult your cluster documentation. When you have completed the above installation, return to this page to find out how to use it from within EMAN2.
Line 13: Line 12:
The EMAN2 binary and source distributions both include a subdirectory called ''mpi_eman''. Change to this directory. Inside you will find a 0README text file you may consult for details, but in many cases simply doing a : === Using MPI in EMAN2 ===
Once you have EMAN2 and pydusa installed, usage should be straightforward. EMAN2 has a modular parallelism system, supporting several types of parallel computing, not just MPI. All of these parallelism systems use a common syntax. For EMAN2 commands which can run in parallel, to use MPI parallelism, the basic syntax is:
Line 15: Line 15:
make -f Makefile.linux2 install {{{
--parallel=mpi:<nproc>:/path/to/scratch
Line 17: Line 18:
will do everything that is necessary. for example:
Line 19: Line 20:
===== Specific MPI systems =====
 * OpenMPI - This is the most widely used distribution at present. If your cluster uses version 1.2 or earlier of OpenMPI, it will likely work without difficulty. However, if you are using 1.3 or newer, you will need to make sure OpenMPI is compiled with the ''--disable-dlopen'' option or you will probably get fatal errors when you try to run the test scripts. You may need to talk to your system administrator if this happens. ''--disable-dlopen'' is required for Python compatibility, and is not an EMAN2 specific requirement.
 * MPICH2/MVAPICH2 - Another very standard MPI library. Worked fine for us in initial testing, but we have not done extensive burn-in testing on it.
 * LAM - An older library. We haven't tested it.
 * Other proprietary MPI distributions - Many high-end clusters will have a commercial MPI installation to make optimal use of specific hardware. While EMAN2 should work fine with these systems, it is difficult to predict what problems you may encounter. Please contact us if you have any problems.
e2refine.py ... --parallel=mpi:64:/scratch/stevel
}}}

 * the number of processors MUST match your request to the batch system (see below)
 * A node-local scratch directory is required !
  * That means, each compute node must have a locally attached hard drive where users can temporarily store scratch data.
  * It's location will vary by cluster
  * There are some clusters which don't have this. There is an experimental nocache option (--parallel=mpi:64:/tmp:nocache) which MAY help with this. Contact sludtke@bcm.edu if you have problems.
 * Do NOT use 'runmpi' directly. This will be called for you by the program you give the --parallel option to.


=== Batch Queuing systems ===

==== How to create a script to run jobs on the cluster ====

A cluster is a shared computing environment. Limited resources must be allocated for many different users and jobs all at the same time. To run a job on most clusters, you must formulate a request in the form of a text file containing the details about your job. These details include, the number of processors you want, how long the job is expected to run, perhaps the amount of RAM you will need, etc. This text file is called a 'batch script' and is submitted to the 'Batch Queuing System' (BQS) on your cluster. The BQS then decides when and where your job will be run, and communicates information about which specific processors to use to your job when launched.

The BQS (allocates resources and launches jobs) is independent of MPI (runs jobs on allocated resources). There are several common BQS systems you may encounter. We cannot cover every possibility here, so you need to consult with your local cluster policy information for details on how to submit jobs using your BQS. We will provide examples for OpenPBS and SGE, which are two of the more common BQS systems out there. Even then, the details may vary a little from cluster to cluster. These examples just give you someplace to start.
Line 26: Line 40:
If your cluster uses openPBS/Torque, there is an example batch file called pbs.example which you can edit and use for testing. There are also a couple of simple python test scripts which could be executed with mpirun. You will need to learn and understand how you are expected to launch MPI jobs on your specific cluster before trying any of these things ! If you just naively run some of these scripts you could do things which in some installations will make the system administrator very angry, so please, learn what you're supposed to do and how before proceeding past this point. If you do not know what you're doing, showing the pbs.example script to a knowledgeable user should tell them what they need to know before offering you advice on what to do. Here is an example of a batch script for PBS-based systems:
Line 28: Line 42:
==== Using MPI ====
Once you have verified that your MPI support is installed and working, making actual use of MPI to run your jobs is quite straightforward, with a couple of caveats.
{{{
#!/bin/bash
# All lines starting with "#PBS" are PBS commands
#
# The following line asks for 10 nodes, each of which has 12 processors, for a total of 120 CPUs.
# The walltime is the maximum length of time your job will be permitted to run. If it is too small, your
# job will be killed before it's done. If it's too long, however, your job may have to wait a looong
# time before the cluster starts running it (depends on local policy).
#PBS -l nodes=10:ppn=12
#PBS -l walltime=120:00:00

# This prints the list of nodes your job is running on to the output file
cat $PBS_NODEFILE

# cd to your project directory
cd /home/stevel/data/myproject

# Now the actual EMAN2 command(s). Note the --parallel option at the end. The number of CPUs must match the number specified above
e2refine.py --input=bdb:sets#set-q5_phase_flipped --mass=1500 --apix=1.8 --automask3d=0.8,36,4,6,36 --iter=4 --sym=c4 --model=bdb:refine_05#threed_03 --path=refine_07 --orientgen=eman:delta=2.5:inc_mirror=0:perturb=1 --projector=standard --simcmp=frc:snrweight=1:zeromask=1:maxres=12 --simalign=rotate_translate_flip_iterative --simralign=refine --simraligncmp=ccc --simaligncmp=ccc --classcmp=frc:snrweight=1:zeromask=1:maxres=12 --classalign=rotate_translate_flip_iterative --classralign=refine --classraligncmp=ccc --classaligncmp=ccc --classkeep=2.5 --classnormproc=normalize.edgemean --classaverager=mean --classiter=1 --sep=3 --m3diter=2 --m3dkeep=0.7 --recon=fourier --m3dpreprocess=normalize.edgemean --m3dpostprocess=filter.lowpass.gauss:cutoff_freq=.06 --pad=320 --classkeepsig --m3dsetsf --twostage=2 --parallel=mpi:120:/localscratch/stevel

# Good idea to do this at the end
e2bdb.py -cF
}}}

If this file were called, for example, test.pbs, you would then submit the job to the cluster by saying
{{{
e2bdb.py -c
qsub test.pbs
}}}

There are additional options you can use with the qsub command as well. See your local cluster documentation for details on what is required/allowed. The e2bdb.py -c command is a good idea to make sure that the compute nodes will see any recent changes you've made to images in the project.

===== SGE (Sun Grid Engine) =====
This is another popular queuing system, which uses 'qsub' and 'qstat' commands much like OpenPBS/Torque does. Configuration, however, is completely different.

Here is an example of an SGE script to run a refinement with e2refine.py using mpich:

{{{
#!/bin/bash
#$ -S /bin/bash
#$ -V
#$ -N refine4
#$ -cwd
#$ -j y
#$ -pe mpich 40

e2refine.py --input=bdb:sets#set2-allgood_phase_flipped-hp --mass=1200.0 --apix=2.9 --automask3d=0.7,24,9,9,24 --iter=1 --sym=c1 --model=bdb:refine_02#threed_filt_05 --path=refine_sge --orientgen=eman:delta=3:inc_mirror=0 --projector=standard --simcmp=frc:snrweight=1:zeromask=1 --simalign=rotate_translate_flip --simaligncmp=ccc --simralign=refine --simraligncmp=frc:snrweight=1 --twostage=2 --classcmp=frc:snrweight=1:zeromask=1 --classalign=rotate_translate_flip --classaligncmp=ccc --classralign=refine --classraligncmp=frc:snrweight=1 --classiter=1 --classkeep=1.5 --classnormproc=normalize.edgemean --classaverager=ctf.auto --sep=5 --m3diter=2 --m3dkeep=0.9 --recon=fourier --m3dpreprocess=normalize.edgemean --m3dpostprocess=filter.lowpass.gauss:cutoff_freq=.1 --pad=256 --lowmem --classkeepsig --classrefsf --m3dsetsf -v 2 --parallel=mpi:40:/scratch/username

e2bdb.py -cF
}}}

==== Summary ====
Line 32: Line 96:
 1. Prepare the batch file appropriate for your cluster. Do not try to use 'mpirun' or 'mpiexec' on any EMAN programs. Instead, add the '--parallel=mpi:<n>:/path/to/scratch' option to an EMAN2 command like e2refine.py. Some commands do not support the --parallel option, and trying to run them using mpirun will not accomplish anything useful.  1. Prepare the batch file appropriate for your cluster. Do not try to use 'mpirun' or 'mpiexec' on any EMAN programs. Instead, add the '--parallel=mpi:<n>:/path/to/scratch[:nocache]' option to an EMAN2 command like e2refine.py. Some commands do not support the --parallel option, and trying to run them using mpirun will not accomplish anything useful.
Line 35: Line 99:
  * Make sure that after the last e2* command in your batch script you put an 'e2bdb.py -c' command to make sure all of the output image files have been flushed to disk.   * Make sure that after the last e2* command in your batch script you put an 'e2bdb.py -cF' command to make sure all of the output image files have been flushed to disk.
Line 38: Line 102:
 1. '''IMPORTANT :''' While the job is running, you have effectively ceded control of that specific project to the cluster nodes using MPI. You MUST NOT modify any of the files in that project in any way while the job is running, or you will risk a variety of ''bad things''. While the ''bad things'' will not always happen, there is a large risk, and the ''bad things'' are VERY bad, including corruption of your entire project. Wait until the job is complete before you do anything that could possibly change files in that directory.
 1. When you run into problems (note I say when, not if), and you have exhausted any local MPI experts, please feel free to email me (sludtke@bcm.edu). Once you have things properly configured, you should be able to use MPI very routinely, but getting there may be a painful process on some clusters. Don't get too discouraged.
Line 41: Line 103:


==== Note about use of shared clusters ====
EMAN2 can make use of MPI very efficiently, however, as this type of image processing is VERY data intensive, in some situations, your jobs may be limited by data transfer between nodes rather than by the computational capacity of the cluster. The inherent scalability of your job will depend quite strongly on the parameters of your reconstruction. In general larger projects will scale better than smaller projects, but projects can be 'large' in several different ways (eg- large box size, large number of particles, low symmetry,...). If your cluster administrator complains that your jobs aren't using the CPUs that you have allocated for your jobs sufficiently, you can try A) running on fewer processors, which will increase the efficiency (but also increase run-times), or you can refer them to me, and I will explain the issues involved. We are also developing tools to help better measure how efficiently your jobs are taking advantage of the CPUs you are allocating, but this will be an ongoing process.
 * '''IMPORTANT :''' While the job is running, you have effectively ceded control of that specific project to the cluster nodes using MPI. You MUST NOT modify any of the files in that project in any way while the job is running, or you will risk a variety of ''bad things''. While the ''bad things'' will not always happen, there is a large risk, and the ''bad things'' are VERY bad, including corruption of your entire project. Wait until the job is complete before you do anything that could possibly change files in that directory.
 * When you run into problems (note I say when, not if), and you have exhausted any local MPI experts, please feel free to email me (sludtke@bcm.edu). Once you have things properly configured, you should be able to use MPI routinely, but getting there may be a painful process on some clusters. Don't get too discouraged.
 * The 'nocache' option is new as of EMAN2.04, and allows you to rely on your local filesysem sharing rather than caching data on each node. If you are using a shared Lustre or similar filesystem to store your data, or if your cluster doesn't have any significant local scratch space on the compute nodes, this may be beneficial, but the option is still experimental.
 * EMAN2
can make use of MPI very efficiently, however, as this type of image processing is VERY data intensive, in some situations, your jobs may be limited by data transfer between nodes rather than by the computational capacity of the cluster. The inherent scalability of your job will depend quite strongly on the parameters of your reconstruction. In general larger projects will scale better than smaller projects, but projects can be 'large' in several different ways (eg- large box size, large number of particles, low symmetry,...). If your cluster administrator complains that your jobs aren't using the CPUs that you have allocated for your jobs sufficiently, you can try A) running on fewer processors, which will increase the efficiency (but also increase run-times), or you can refer them to me, and I will explain the issues involved. We are also developing tools to help better measure how efficiently your jobs are taking advantage of the CPUs you are allocating, but this will be an ongoing process.

MPI Parallelism

MPI stands for 'Message Passing Interface', and over the last decade it has become the de-facto standard for running large scale computations on Linux clusters around the world. In most supercomputing centers this will be the ONLY option you have for running in parallel, and administrators may be actively hostile to trying to make use of any non-MPI software on their clusters.

PLEASE NOTE: Using MPI on any cluster is not a task for linux/unix novices. You must have a fair bit of education to understand what's involved in using MPI with any program (not just EMAN). You should be comfortable with running MPI jobs before attempting this with EMAN2. If necessary you may need to consult a cluster administrator for assistance. There is enough variation between different specific linux clusters that we cannot provide specific advice for every situation. We have tried to provide as much generic advice as possible, but this is often not going a cookie-cutter operation.

Installing MPI Support in EMAN2/SPARX

SPARX and EMAN2 have merged their MPI support efforts, and as of 4/19/2013, the legacy EMAN2 MPI system has been retired. To install the current combined system, start at the installation page: http://blake.bcm.edu/emanwiki/EMAN2/Parallel/PyDusa

When you have completed the above installation, return to this page to find out how to use it from within EMAN2.

Using MPI in EMAN2

Once you have EMAN2 and pydusa installed, usage should be straightforward. EMAN2 has a modular parallelism system, supporting several types of parallel computing, not just MPI. All of these parallelism systems use a common syntax. For EMAN2 commands which can run in parallel, to use MPI parallelism, the basic syntax is:

--parallel=mpi:<nproc>:/path/to/scratch

for example:

e2refine.py ... --parallel=mpi:64:/scratch/stevel
  • the number of processors MUST match your request to the batch system (see below)
  • A node-local scratch directory is required !
    • That means, each compute node must have a locally attached hard drive where users can temporarily store scratch data.
    • It's location will vary by cluster
    • There are some clusters which don't have this. There is an experimental nocache option (--parallel=mpi:64:/tmp:nocache) which MAY help with this. Contact sludtke@bcm.edu if you have problems.

  • Do NOT use 'runmpi' directly. This will be called for you by the program you give the --parallel option to.

Batch Queuing systems

How to create a script to run jobs on the cluster

A cluster is a shared computing environment. Limited resources must be allocated for many different users and jobs all at the same time. To run a job on most clusters, you must formulate a request in the form of a text file containing the details about your job. These details include, the number of processors you want, how long the job is expected to run, perhaps the amount of RAM you will need, etc. This text file is called a 'batch script' and is submitted to the 'Batch Queuing System' (BQS) on your cluster. The BQS then decides when and where your job will be run, and communicates information about which specific processors to use to your job when launched.

The BQS (allocates resources and launches jobs) is independent of MPI (runs jobs on allocated resources). There are several common BQS systems you may encounter. We cannot cover every possibility here, so you need to consult with your local cluster policy information for details on how to submit jobs using your BQS. We will provide examples for OpenPBS and SGE, which are two of the more common BQS systems out there. Even then, the details may vary a little from cluster to cluster. These examples just give you someplace to start.

OpenPBS/Torque

Here is an example of a batch script for PBS-based systems:

# All lines starting with "#PBS" are PBS commands
#
# The following line asks for 10 nodes, each of which has 12 processors, for a total of 120 CPUs. 
# The walltime is the maximum length of time your job will be permitted to run. If it is too small, your
# job will be killed before it's done. If it's too long, however, your job may have to wait a looong
# time before the cluster starts running it (depends on local policy).
#PBS -l nodes=10:ppn=12
#PBS -l walltime=120:00:00

# This prints the list of nodes your job is running on to the output file
cat $PBS_NODEFILE

# cd to your project directory
cd /home/stevel/data/myproject

# Now the actual EMAN2 command(s). Note the --parallel option at the end. The number of CPUs must match the number specified above
e2refine.py --input=bdb:sets#set-q5_phase_flipped --mass=1500 --apix=1.8 --automask3d=0.8,36,4,6,36 --iter=4 --sym=c4 --model=bdb:refine_05#threed_03 --path=refine_07 --orientgen=eman:delta=2.5:inc_mirror=0:perturb=1 --projector=standard --simcmp=frc:snrweight=1:zeromask=1:maxres=12 --simalign=rotate_translate_flip_iterative --simralign=refine --simraligncmp=ccc --simaligncmp=ccc  --classcmp=frc:snrweight=1:zeromask=1:maxres=12 --classalign=rotate_translate_flip_iterative --classralign=refine --classraligncmp=ccc --classaligncmp=ccc --classkeep=2.5 --classnormproc=normalize.edgemean --classaverager=mean --classiter=1 --sep=3 --m3diter=2 --m3dkeep=0.7 --recon=fourier --m3dpreprocess=normalize.edgemean --m3dpostprocess=filter.lowpass.gauss:cutoff_freq=.06 --pad=320 --classkeepsig --m3dsetsf --twostage=2 --parallel=mpi:120:/localscratch/stevel 

# Good idea to do this at the end
e2bdb.py -cF

If this file were called, for example, test.pbs, you would then submit the job to the cluster by saying

e2bdb.py -c
qsub test.pbs

There are additional options you can use with the qsub command as well. See your local cluster documentation for details on what is required/allowed. The e2bdb.py -c command is a good idea to make sure that the compute nodes will see any recent changes you've made to images in the project.

SGE (Sun Grid Engine)

This is another popular queuing system, which uses 'qsub' and 'qstat' commands much like OpenPBS/Torque does. Configuration, however, is completely different.

Here is an example of an SGE script to run a refinement with e2refine.py using mpich:

#$ -S /bin/bash
#$ -V
#$ -N refine4
#$ -cwd
#$ -j y
#$ -pe mpich 40

e2refine.py --input=bdb:sets#set2-allgood_phase_flipped-hp --mass=1200.0 --apix=2.9 --automask3d=0.7,24,9,9,24 --iter=1 --sym=c1 --model=bdb:refine_02#threed_filt_05 --path=refine_sge --orientgen=eman:delta=3:inc_mirror=0 --projector=standard --simcmp=frc:snrweight=1:zeromask=1 --simalign=rotate_translate_flip --simaligncmp=ccc --simralign=refine --simraligncmp=frc:snrweight=1 --twostage=2 --classcmp=frc:snrweight=1:zeromask=1 --classalign=rotate_translate_flip --classaligncmp=ccc --classralign=refine --classraligncmp=frc:snrweight=1 --classiter=1 --classkeep=1.5 --classnormproc=normalize.edgemean --classaverager=ctf.auto --sep=5 --m3diter=2 --m3dkeep=0.9 --recon=fourier --m3dpreprocess=normalize.edgemean --m3dpostprocess=filter.lowpass.gauss:cutoff_freq=.1 --pad=256 --lowmem --classkeepsig --classrefsf --m3dsetsf -v 2 --parallel=mpi:40:/scratch/username

e2bdb.py -cF

Summary

  1. Make sure you read this warning

  2. Prepare the batch file appropriate for your cluster. Do not try to use 'mpirun' or 'mpiexec' on any EMAN programs. Instead, add the '--parallel=mpi:<n>:/path/to/scratch[:nocache]' option to an EMAN2 command like e2refine.py. Some commands do not support the --parallel option, and trying to run them using mpirun will not accomplish anything useful.

    • replace <n> with the total number of processors you have requested (these number must match exactly)

    • replace /path/to/scratch, with the path to a scratch storage directory available on each node of the cluster. Note that this directory must be local storage on each node, not a directory shared between nodes. If you use the path to a shared directory, like $HOME/scratch, you will have very very serious problems. You must use a filesystem local to the specific node. If you don't have this information, check your cluster documentation and/or consult with your system administrator.
    • Make sure that after the last e2* command in your batch script you put an 'e2bdb.py -cF' command to make sure all of the output image files have been flushed to disk.
  3. Immediately before submitting your job, run 'e2bdb.py -c'. This will require you to exit all running EMAN2 jobs (if any) before proceeding. Do this.
  4. Submit your job.
  5. IMPORTANT : While the job is running, you have effectively ceded control of that specific project to the cluster nodes using MPI. You MUST NOT modify any of the files in that project in any way while the job is running, or you will risk a variety of bad things. While the bad things will not always happen, there is a large risk, and the bad things are VERY bad, including corruption of your entire project. Wait until the job is complete before you do anything that could possibly change files in that directory.

  6. When you run into problems (note I say when, not if), and you have exhausted any local MPI experts, please feel free to email me (sludtke@bcm.edu). Once you have things properly configured, you should be able to use MPI routinely, but getting there may be a painful process on some clusters. Don't get too discouraged.

  7. The 'nocache' option is new as of EMAN2.04, and allows you to rely on your local filesysem sharing rather than caching data on each node. If you are using a shared Lustre or similar filesystem to store your data, or if your cluster doesn't have any significant local scratch space on the compute nodes, this may be beneficial, but the option is still experimental.
  8. EMAN2 can make use of MPI very efficiently, however, as this type of image processing is VERY data intensive, in some situations, your jobs may be limited by data transfer between nodes rather than by the computational capacity of the cluster. The inherent scalability of your job will depend quite strongly on the parameters of your reconstruction. In general larger projects will scale better than smaller projects, but projects can be 'large' in several different ways (eg- large box size, large number of particles, low symmetry,...). If your cluster administrator complains that your jobs aren't using the CPUs that you have allocated for your jobs sufficiently, you can try A) running on fewer processors, which will increase the efficiency (but also increase run-times), or you can refer them to me, and I will explain the issues involved. We are also developing tools to help better measure how efficiently your jobs are taking advantage of the CPUs you are allocating, but this will be an ongoing process.

EMAN2/Parallel/Mpi (last edited 2022-09-08 21:34:00 by SteveLudtke)