|
esp
|
 |
« on: March 21, 2012, 02:17 » |
|
I just access to some machines with hundreds of cores and tons of memory, i want to make use of them but having some difficulty installing ... do you know what this means?
/home/it1/patakye/ATK/atkpython/bin//atkpython: line 3: 26274 Killed PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
|
|
|
|
|
Logged
|
-------------------------- E.Pataky [  >-<
|
|
|
|
esp
|
 |
« Reply #1 on: March 22, 2012, 08:50 » |
|
Ok, i think i figured out what the issue was with the error I posted before ... i had to submit "jobs" in a different way with these supercomputers ... but now i have another question and i would appreciate your help ... the link below has a few very specific examples on how to run ATK on these huge powerful machines ... can you tell me, if i am running things like LDOS or transmission calcs, how should i set them per these examples? for example, i can specify how many nodes (up to 1000), processors (out of 8744), and memory per core ... what would be best? I set up a job for a transmission calc that normally has taken about 10-24 hours on my machines .. i set it up on their machine with 8 nodes and 48 cores and 1.5GB memory per core ... i want to make full use of these machines to make it run like the wind ... how can I best do that with ATK? https://www.msi.umn.edu/hardware/itasca/quickstart.htmlspecifics of this machine: 1,086 compute nodes 2 interactive nodes 5 server nodes 8,744 total cores 26.184 TB total main memory Suitable for: large MPI jobs Each node: Processors: Two quad-core 2.8 GHz Intel Xeon X5560 "Nehalem EP"-class processors Memory: 24 GB main memory
|
|
|
|
|
Logged
|
-------------------------- E.Pataky [  >-<
|
|
|
|
Anders Blom
|
 |
« Reply #2 on: March 22, 2012, 11:31 » |
|
In order that I don't waste time answering the wrong question, you are wondering about recommendations about allocation (number of nodes/cores etc), and not (or also) how to actually request them in your PBS script?
|
|
|
|
|
Logged
|
|
|
|
|
esp
|
 |
« Reply #3 on: March 22, 2012, 21:52 » |
|
Yes just number of nodes, etc .. I am not so familiar with this type of setup so I dont know how it applies best to atk ... also, I do not need scripts but there are multiple methods of running parallel jobs as you can see on the link I posted, I do not know which is best ... mpi,openmp, others ... the page has multiple short examples
|
|
|
|
|
Logged
|
-------------------------- E.Pataky [  >-<
|
|
|
|
Anders Blom
|
 |
« Reply #4 on: March 22, 2012, 22:49 » |
|
ATK can take advantage of both MPI and OpenMP (to a lesser extent), but for your calculations I think all the benefit will lie in MPI. As a rule of thumb, the code will scale well up to roughly the number of k-points NAxNB/2 for the self-consistent part for zero bias, whereas for finite bias you have a benefit up to 30-50 nodes due to the integration in the complex plane. The speed-up is however not linear, and you have to account for the probability to wait very long in the queue if you request too many nodes. For analysis, like computing the LDOS or T(E) etc, the scaling can be linear up to 100 nodes easily (the number of energy points in T(E) for instance).
I would recommend running over 16 MPI nodes, try 32 for some of the analysis.
|
|
|
|
|
Logged
|
|
|
|
|
esp
|
 |
« Reply #5 on: March 22, 2012, 23:26 » |
|
thank you
|
|
|
|
|
Logged
|
-------------------------- E.Pataky [  >-<
|
|
|
|
esp
|
 |
« Reply #6 on: March 22, 2012, 23:49 » |
|
I had a few jobs run this way sort of get stuck this way and it seems like different nodes are trying to create the same files, and maybe the job died? so i have a question ... i use nlsave and nlprint always .. but i am seeing multiple printing and multiple messages about trying to create the same file from within one of my scripts ... now this never happened before i went to the MPI system ... shouldn't nlsave protect different nodes from saving the same file?
|
|
|
|
|
Logged
|
-------------------------- E.Pataky [  >-<
|
|
|
|
Anders Blom
|
 |
« Reply #7 on: March 23, 2012, 00:15 » |
|
This is a well-known problem. It means your system uses OpenMPI (or similar) rather than MPICH2 which is required for ATK.
It is possible they have MPICH2 installed already (or a similar MPICH-compatible MPI library), if not they will have to install it, for you to run ATK.
I can see that Intel MPI is available on your system, that will work. So you need to load that module.
|
|
|
|
|
Logged
|
|
|
|
|
esp
|
 |
« Reply #8 on: March 23, 2012, 22:42 » |
|
ahhh thank you very much i will try again ...  )
|
|
|
|
|
Logged
|
-------------------------- E.Pataky [  >-<
|
|
|
|
esp
|
 |
« Reply #9 on: March 23, 2012, 22:48 » |
|
there is intel ompi and intel pmpi ... can i use either ... ? actually last time I did use
module load pmpi/intel
and got the same error ... the file it says DNE does exist
oh they also have "impi": module load impi/intel
|
|
|
|
« Last Edit: March 23, 2012, 22:50 by esp »
|
Logged
|
-------------------------- E.Pataky [  >-<
|
|
|
|
Anders Blom
|
 |
« Reply #10 on: March 24, 2012, 12:56 » |
|
There is a simple test to check if the parallelization is correctly set up. Enter the following into a script import socket if processIsMaster(): print 'Master node:', else: print 'Slave node:', print socket.gethostname()
and execute it in parallel. Make sure to capture the output. It should print "Master" once and "Slave" N-1 times, where N is your -n N in mpiexec. This will be the signal that you have a proper MPI setup. For OpenMPI and its relatives it prints "Master" N times, however, and that tells you that all processes think they are masters, and will try to write to the NetCDF file, and this of course causes problems. For more information about running ATK in parallel, see the Parallel Tutorial.
|
|
|
|
|
Logged
|
|
|
|
|
esp
|
 |
« Reply #11 on: March 24, 2012, 22:27 » |
|
Ok it is all working now .. thank you very much
|
|
|
|
|
Logged
|
-------------------------- E.Pataky [  >-<
|
|
|
|
esp
|
 |
« Reply #12 on: March 26, 2012, 08:07 » |
|
hey i finally ran a device, and got some good results!!! i just had to post .. this shows on/off ratio and subthreshold slope for a graphene TFET .. just one i picked randomly from a paper i read .. but at least i got it to work now ... these supercomputer sure are a luxury too i must say .. i am running 32 nodes with 8 processor each ... awesome
|
|
|
|
|
Logged
|
-------------------------- E.Pataky [  >-<
|
|
|
|
esp
|
 |
« Reply #13 on: March 26, 2012, 08:10 » |
|
and thank you guys for all the help!
|
|
|
|
|
Logged
|
-------------------------- E.Pataky [  >-<
|
|
|
|
Anders Blom
|
 |
« Reply #14 on: March 26, 2012, 09:44 » |
|
Great! Parallel does help a lot, and once you get used to it you are hooked - you don't want to go back to serial 
|
|
|
|
|
Logged
|
|
|
|
|