Author Topic: Checkpoint handler and remote jobs  (Read 238 times)

0 Members and 1 Guest are viewing this topic.

Offline Sylvan

  • New ATK user
  • *
  • Posts: 5
  • Country: fr
  • Reputation: 0
    • View Profile
Checkpoint handler and remote jobs
« on: July 12, 2018, 15:50 »
Hello, I am working with VNL on a local windows machine, and running the calculations on a remote Linux cluster.
I haven't found a clear indication on how to use checkpoint files in a remote configuration.
So far, what I've tried is :
-   Create a script file specifying a local path for the checkpoint file (i.e. u’C:\User\[...]\Bi_nw\checkpointraw.hdf5’)
-   Go into the script editor and change it to an absolute path on the server (i.e. u’/W/sb255620/VNL/checkpointraw.hdf5’)
-   Create a blank “checkpoint.hdf5” file at the specified location on the server, and make it writable for all users (otherwise atk will throw an error when I start the calculation)
When I do this, the calculations finish, but once I download the result file on my local machine and try to read the band structure, I get the error shown in the picture below.
Apparently the results in the final hdf5 file are dependent on the checkpoint file.

What should I do instead ? Should I use the same file for results and checkpoint ? Or is there something to do with the I/O window in the job manager when I submit the job ?

Offline Petr Khomyakov

  • QuantumWise Staff
  • Supreme ATK Wizard
  • *****
  • Posts: 871
  • Country: dk
  • Reputation: 13
    • View Profile
Re: Checkpoint handler and remote jobs
« Reply #1 on: July 16, 2018, 10:01 »
Please post your actual python script and log file related to this calculation, as well as a text file with the error details (now one can see only a part of that text on the image you have enclosed to your post).
« Last Edit: July 16, 2018, 10:04 by Petr Khomyakov »

Offline Sylvan

  • New ATK user
  • *
  • Posts: 5
  • Country: fr
  • Reputation: 0
    • View Profile
Re: Checkpoint handler and remote jobs
« Reply #2 on: July 16, 2018, 16:10 »
Of course, here is the python script

Code: [Select]
# -*- coding: utf-8 -*-
# -------------------------------------------------------------
# Bulk Configuration
# -------------------------------------------------------------

# Set up lattice
vector_a = [40.0, 0.0, 0.0]*Angstrom
vector_b = [0.0, 40.0, 0.0]*Angstrom
vector_c = [0.0, 0.0, 11.8619]*Angstrom
lattice = UnitCell(vector_a, vector_b, vector_c)

# Define elements
elements = [Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth,
            Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth,
            Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth,

[...]

            Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth,
            Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth,
            Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth, Bismuth]

# Define coordinates
fractional_coordinates = [[ 0.5           ,  0.171896319741,  0.192666573062],
                          [ 0.329512326692,  0.270327423819,  0.192666573062],
                          [ 0.443170775564,  0.270327423819,  0.192666573062],

[...]

                          [ 0.727316897743,  0.696862208155, -0.000000485839],
                          [ 0.443170775564,  0.795293312233, -0.000000485839],
                          [ 0.556829224436,  0.795293312233, -0.000000485839]]

# Set up configuration
bulk_configuration = BulkConfiguration(
    bravais_lattice=lattice,
    elements=elements,
    fractional_coordinates=fractional_coordinates
    )

# -------------------------------------------------------------
# Calculator
# -------------------------------------------------------------
#----------------------------------------
# Exchange-Correlation
#----------------------------------------
exchange_correlation = SOGGA.PBE

k_point_sampling = MonkhorstPackGrid(
    nc=7,
    force_timereversal=False,
    )
numerical_accuracy_parameters = NumericalAccuracyParameters(
    k_point_sampling=k_point_sampling,
    density_mesh_cutoff=220.0*Hartree,
    )

checkpoint_handler = CheckpointHandler(
    file_name=u'/W/sb255620/VNL/checkpointraw.hdf5',
    )

calculator = LCAOCalculator(
    exchange_correlation=exchange_correlation,
    numerical_accuracy_parameters=numerical_accuracy_parameters,
    checkpoint_handler=checkpoint_handler,
    )

bulk_configuration.setCalculator(calculator)
nlprint(bulk_configuration)
bulk_configuration.update()
nlsave('Rawcut_k=7.hdf5', bulk_configuration)

# -------------------------------------------------------------
# Bandstructure
# -------------------------------------------------------------
bandstructure = Bandstructure(
    configuration=bulk_configuration,
    route=['G', 'Z'],
    points_per_segment=50
    )
nlsave('Rawcut_k=7.hdf5', bandstructure)

And here is the log file (edited for brevity):
Code: [Select]
------------------------------------------------------
Allocation (N):    -n 8 -ppn 8
------------------------------------------------------
MPI nodes (NCPU):  1
Cores (NCORES):    8
Threads per process: automatic
------------------------------------------------------
Node list
s148
Core list
s148
s148
s148
s148
s148
s148
s148
s148
------------------------------------------------------
PBS: qsub is running on summer.cluster
PBS: originating queue is standard
PBS: executing queue is standard
PBS: execution mode is PBS_BATCH
PBS: current home directory is /home/sb255620
PBS: working directory is /W/sb255620/VNL/180711-oYoHx5Wc
PBS: job name is 180711-oYoHx5Wc
PBS: job identifier is 1833694.summer.cluster
PBS: PATH = /home/sb255620/QuantumATK/QuantumATK-2018.06/bin:/home/sb255620/QuantumWise/VNL-ATK-2017.12/bin:/bin:/sbin:/usr/X11R6/bin:/usr/sbin:/usr/bin:/etc:/home/cmd:/usr/ccs/bin:/usr/openwin/bin:/usr/dt/bin:/opt/sfw/bin:/home/prog/SUNWspro/bin:/home/prog/s1studio/ee/bin:/home/systeme/SUNWspro/bin:/home/systeme/SUNWste/bin:/usr/ucb:/usr/local/bin:.
PBS: node file is /var/spool/torque/aux//1833694.summer.cluster
------------------------------------------------------
+------------------------------------------------------------------------------+
|                                                                              |
| QuantumATK 2018.06[Build 1745fc0]                                            |
|                                                                              |
+------------------------------------------------------------------------------+
+----------------------------------------------------------+
| Bulk Bravais lattice                                     |
+----------------------------------------------------------+
Type:
UnitCell

Lattice constants:

Primitive vectors:
u_1 =     40.000000      0.000000      0.000000 Ang
u_2 =      0.000000     40.000000      0.000000 Ang
u_3 =      0.000000      0.000000     11.861900 Ang

+----------------------------------------------------------+
| Bulk: Cartesian (Angstrom) / fractional                  |
+----------------------------------------------------------+
182
Bulk
Bi    2.000000e+01  6.875853e+00  2.285392e+00    0.50000  0.17190  0.19267
Bi    1.318049e+01  1.081310e+01  2.285392e+00    0.32951  0.27033  0.19267
Bi    1.772683e+01  1.081310e+01  2.285392e+00    0.44317  0.27033  0.19267

[...]

Bi    2.909268e+01  2.787449e+01 -5.762974e-06    0.72732  0.69686 -0.00000
Bi    1.772683e+01  3.181173e+01 -5.762974e-06    0.44317  0.79529 -0.00000
Bi    2.227317e+01  3.181173e+01 -5.762974e-06    0.55683  0.79529 -0.00000
+------------------------------------------------------------------------------+
|                                                                              |
| DFT Calculation  [Started Wed Jul 11 16:22:46 2018]                          |
|                                                                              |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
|                                                                              |
| CPU Information                                                              |
|                                                                              |
+------------------------------------------------------------------------------+
|  Process ID 0 at s148 (2 threads)                                            |
|  Process ID 1 at s148 (2 threads)                                            |
|  Process ID 2 at s148 (2 threads)                                            |
|  Process ID 3 at s148 (2 threads)                                            |
|  Process ID 4 at s148 (2 threads)                                            |
|  Process ID 5 at s148 (2 threads)                                            |
|  Process ID 6 at s148 (2 threads)                                            |
|  Process ID 7 at s148 (2 threads)                                            |
+------------------------------------------------------------------------------+

                            |--------------------------------------------------|
Calculating Kinetic Matrix : ==================================================

                            |--------------------------------------------------|
Calculating Nonlocal Part  : ==================================================

                            |--------------------------------------------------|
Calculating Nonlocal Part  : ==================================================

                            |--------------------------------------------------|
Calculating Nonlocal Part  : ==================================================
+------------------------------------------------------------------------------+
|                                                                              |
| SCF Loop Information                                                         |
|                                                                              |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| K-point grid: 1 x 1 x 7                                                      |
| Number of irreducible k-points: 7                                            |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| Real space grid sampling is (505, 505, 150) in a, b, and c directions.       |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| Memory requirements for the calculation                                      |
+------------------------------------------------------------------------------+
| Dense matrices: 157 MB per matrix [Matrix dimensions 4550 x 4550]            |
| Total memory required per k-point: 473 MB                                    |
|                                                                              |
| Storage of real-space orbitals: Disabled                                     |
| Storage requires 720 MB                                                      |
|                                                                              |
| Total memory required per real-space grid: 1.17 GB                           |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| SCF History                                                                  |
+------------------------------------------------------------------------------+
| Memory required to store SCF history: 5.86 GB                                |
| Number of history steps: 20                                                  |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| Checkpoint Handler                                                           |
+------------------------------------------------------------------------------+
| Filename : /W/sb255620/VNL/checkpointraw.hdf5                                |
| Interval : 0.5 h                                                             |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| Diagonalization solver parallelization report                                |
+------------------------------------------------------------------------------+
| Total number of processes: 8                                                 |
| Total number of k-points: 7                                                  |
| Processes per k-point: 1                                                     |
+------------------------------------------------------------------------------+
| Process occupation                                                           |
+------------------------------------------------------------------------------+
| Process 0: |===============================================================| |
| Process 1: |===============================================================| |
| Process 2: |===============================================================| |
| Process 3: |===============================================================| |
| Process 4: |===============================================================| |
| Process 5: |===============================================================| |
| Process 6: |===============================================================| |
| Process 7: |                                                               | |
+------------------------------------------------------------------------------+
| WARNING: Some processes are idle.                                            |
+------------------------------------------------------------------------------+

                            |--------------------------------------------------|
Calculating Eigenvalues    : ==================================================
Calculating Density Matrix : ==================================================

+------------------------------------------------------------------------------+
| Density Matrix Report                         DM[U]     DM[D]      DD        |
+------------------------------------------------------------------------------+
|   0  Bi   [  20.000 ,   6.876 ,   2.285 ]    7.94291   6.93789  -0.11920     |
|   1  Bi   [  13.180 ,  10.813 ,   2.285 ]    7.81446   7.11966  -0.06589     |
|   2  Bi   [  17.727 ,  10.813 ,   2.285 ]    7.75630   7.30821   0.06451     |
|   3  Bi   [  22.273 ,  10.813 ,   2.285 ]    7.75630   7.30821   0.06451     |

[...]

|  179  Bi   [  29.093 ,  27.874 ,  -0.000 ]    7.80852   7.14572  -0.04576    |
|  180  Bi   [  17.727 ,  31.812 ,  -0.000 ]    7.82164   7.13865  -0.03971    |
|  181  Bi   [  22.273 ,  31.812 ,  -0.000 ]    7.82164   7.13865  -0.03971    |
+------------------------------------------------------------------------------+
|   0 E = -2132.12 dE =  3.444100e+01 dH =  1.077753e-01                       |
+------------------------------------------------------------------------------+

[...]

+------------------------------------------------------------------------------+
| Calculation Converged in 29 steps                                            |
|                                                                              |
| Fermi Level  = -4.274652 eV                                                  |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
|                                                                              |
| DFT Calculation  [Finished Thu Jul 12 13:25:20 2018]                         |
|                                                                              |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
|                                                                              |
| Bandstructure Analysis                                                       |
|                                                                              |
+------------------------------------------------------------------------------+

                           |--------------------------------------------------|
Calculating Bandstructure : ==================================================

Timing:                          Total     Per Step        %

--------------------------------------------------------------------------------

Diagonalization         :   61221.98 s    2111.10 s      74.48% |=============|
Real Space Integral     :    5636.02 s     187.87 s       6.86% ||
Valence Density         :    2679.99 s      89.33 s       3.26% ||
Exchange-Correlation    :    1902.03 s      63.40 s       2.31% |
Hartree Potential       :     191.05 s       6.37 s       0.23% |
Core Density            :      91.54 s      91.54 s       0.11% |
Setting Density Matrix  :      88.10 s      88.10 s       0.11% |
Mixing                  :      69.65 s       2.40 s       0.08% |
Constant Terms          :      35.20 s      35.20 s       0.04% |
Difference Density      :      30.93 s       1.00 s       0.04% |
Loading Modules + MPI   :      18.39 s      18.39 s       0.02% |
Real Space Basis        :       9.26 s       9.26 s       0.01% |
Neutral Atom Potential  :       7.30 s       7.30 s       0.01% |
Hubbard Term            :       0.00 s       0.00 s       0.00% |
Fixed Spins Term        :       0.00 s       0.00 s       0.00% |
Basis Set Generation    :       0.00 s       0.00 s       0.00% |
--------------------------------------------------------------------------------
Total                   :   82194.93 s (22h49m54.93s)

As for the error message, it is as such :
Code: [Select]
Traceback (most recent call last):
  File "zipdir\NL\GUI\MainWindow\LabFloor\LabFloorModel.py", line 327, in load
  File "C:\Program Files\QuantumATK\QuantumATK-2018.06\Lib\site-packages\AddOns\ATKDataReader\ATKDataReader.py", line 291, in load
    filename, object_id=object_id, read_state=read_full, lightweight=lightweight)[0]
  File "zipdir\NL\IO\NLSaveUtilities.py", line 847, in nlread
  File "zipdir\NL\IO\HDF5.py", line 490, in readHDF5
  File "zipdir\NL\IO\HDF5.py", line 576, in readHDF5Group
  File "zipdir\NL\IO\HDF5.py", line 537, in readHDF5Dict
  File "zipdir\NL\IO\HDF5.py", line 609, in readHDF5Group
  File "zipdir\NL\IO\HDF5.py", line 537, in readHDF5Dict
  File "zipdir\NL\IO\HDF5.py", line 597, in readHDF5Group
  File "zipdir\NL\IO\Serializable.py", line 318, in _fromVersionedData
  File "zipdir\NL\CommonConcepts\Calculator.py", line 68, in _createObject
  File "zipdir\NL\QuantumATK\ScopeExecuter.py", line 214, in scope_execute
NLScopeExecutionError: The checkpoint file /W/sb255620/VNL/checkpointraw.hdf5 is not writable. Please make sure that a writable directory is selected, or disable checkpoint_handler on the calculator.
(Do note that the reason for the error "The checkpoint file /W/sb255620/VNL/checkpointraw.hdf5 is not writable" is that this file is on the cluster and not on my local machine.)

Offline Petr Khomyakov

  • QuantumWise Staff
  • Supreme ATK Wizard
  • *****
  • Posts: 871
  • Country: dk
  • Reputation: 13
    • View Profile
Re: Checkpoint handler and remote jobs
« Reply #3 on: July 17, 2018, 16:53 »
-   Create a script file specifying a local path for the checkpoint file (i.e. u’C:\User\[...]\Bi_nw\checkpointraw.hdf5’)
-   Go into the script editor and change it to an absolute path on the server (i.e. u’/W/sb255620/VNL/checkpointraw.hdf5’)
You should use relative (not absolute) paths to the directory where the checkpoint file is supposed to be created, e.g.,  u'../tmp/checkpointraw.hdf5’ in the 'tmp' directory is created in the working directory.

-   Create a blank “checkpoint.hdf5” file at the specified location on the server, and make it writable for all users (otherwise atk will throw an error when I start the calculation)
It is a bug, but you have found a work-around, i.e., one should first create a blank checkpoint file in the checkpoint file directory. Thank you for reporting this bug; we will fix it in the next release, so that it will be possible to set u'checkpoint.hdf5' in the Scripter to have the checkpoint file directly in the working directory with all the other (py, log, hdf5) files.

When I do this, the calculations finish, but once I download the result file on my local machine and try to read the band structure, I get the error shown in the picture below.
Apparently the results in the final hdf5 file are dependent on the checkpoint file.
That is another bug related to the use of an absolute path to the checkpoint file directory. Please use relative paths as a work-around. We will fix this issue in the next release.

 

Offline Sylvan

  • New ATK user
  • *
  • Posts: 5
  • Country: fr
  • Reputation: 0
    • View Profile
Re: Checkpoint handler and remote jobs
« Reply #4 on: July 19, 2018, 15:05 »
Thank you Petr, I will try that.

Offline Taviouso

  • New ATK user
  • *
  • Posts: 1
  • Country: us
  • Reputation: 0
    • View Profile
    • Thots blog
Re: Checkpoint handler and remote jobs
« Reply #5 on: July 29, 2018, 03:25 »
-   Create a script file specifying a local path for the checkpoint file (i.e. u’C:\User\[...]\Bi_nw\checkpointraw.hdf5’)
-   Go into the script editor and change it to an absolute path on the server (i.e. u’/W/sb255620/VNL/checkpointraw.hdf5’)
You should use relative (not absolute) paths to the directory where the checkpoint file is supposed to be created, e.g.,  u'../tmp/checkpointraw.hdf5’ in the 'tmp' directory is created in the working directory.

-   Create a blank “checkpoint.hdf5” file at the specified location on the server, and make it writable for all users (otherwise atk will throw an error when I start the calculation)
It is a bug, but you have found a work-around, i.e., one should first create a blank checkpoint file in the checkpoint file directory. Thank you for reporting this bug; we will fix it in the next release, so that it will be possible to set u'checkpoint.hdf5' in the Scripter to have the checkpoint file directly in the working directory with all the other (py, log, hdf5) files.

When I do this, the calculations finish, but once I download the result file on my local machine and try to read the band structure, I get the error shown in the picture below.
Apparently the results in the final hdf5 file are dependent on the checkpoint file.
That is another bug related to the use of an absolute path to the checkpoint file directory. Please use relative paths as a work-around. We will fix this issue in the next release.

The first 2 tips especially the relative paths one helped me solve my own issue. Glad I opened this particular topic

Offline Anders Blom

  • QuantumWise Staff
  • Supreme ATK Wizard
  • *****
  • Posts: 4958
  • Country: dk
  • Reputation: 78
    • View Profile
    • QuantumWise
Re: Checkpoint handler and remote jobs
« Reply #6 on: October 2, 2018, 22:37 »
This bug was fixed in release 2018.06-SP1-1.