Try   HackMD

Yambo GW+U implementation

Fulvio You can write comments like this.

TODO:

  • merge with updated yambo version (5.2.3)
  • prepare test-suite on corvina
    • TESTS: hBN standard, hBN + U, hBN + Covariant dipoles
  • add diagonal-only Hbare calculation as flag to Hbare_from_scratch + warning if ib/=ibp, in that case compute the full one
  • Recreate a parallel structure for the WF in Hbare_from_scratch before loading, following the QP case
  • Possible memory leak + issue with covariant dipoles

Method

Yambo right now

Linearized one-shot QP equation from textbook:

(1)EnkQP=ϵnkKS+Znk[Σnkxc(ϵnkKS)Vnkxc].

In Yambo,

Vnkxc is obtained via the libxc libraries.

GW+U

We know that, when a Hubbard

U-term is included,

(2)ϵnkKS=hnk0+VnkH+Vnkxc+Unk.

VnkH can be computed by Yambo in real space. We have already implemented the reconstruction of
hnk0
but we don't have access to
Unk
.

Strategy

We define

(3)V~nkxcVnkxc+Unk

and we obtain it via subtraction as:

(4)V~nkxc=ϵnkKShnk0VnkH

and then calculate the correction using again Eq. (1), i.e.,

(5)EnkQP=ϵnkKS+Znk[Σnkxc(ϵnkKS)V~nkxc].

For this we only need to compute

h0 and
VH
and we don't change the form of the QP equation.

This strategy works at the level of XCo_driver.F and in particular replaces the subroutine XCo_local.F with a new one, XCo_from_H_minus_Hbare.F, which calculates

V~xc by calling the subroutines for
h0
and
VH
and converting them to the QP basis, then taking the difference with the KS eigenvalue.

From XCo_driver.F, present version:

! ! Vxc !===== ! if (l_Vxc) then ! call XCo_local(E,Xk) ! ! endif

New XCo_driver.F proposed version:

! ! Vxc !===== ! if (l_Vxc) then ! if (l_use_libxc) call XCo_local(E,Xk) if (l_use_qe_pseudo) call XCo_from_H_minus_Hbare(E,Xk,k,Dip) ! ! endif

This formulation has the features of (i) not using libxc and (ii) keeping the QP section of the code untouched.

The matrix elements of

h0 are computed in Kohn-Sham space in the subroutine Bare_Hamiltonian_from_Scratch. The required steps are (i) obtaining the kinetic part (P_square) from the dipole routines; (ii) extracting the local part of the potential in
G
-space from QE, multiplying it by the structure factors and taking the FFT to real space; (iii) computing the contribution of the nonlocal part of the potential using the KB coefficients.

Notes on implementation and running

This section contains notes on the implementation that may be useful to write down

  • We need the hamiltonian module of the code, which is defined only for yambo_rt or yambo_sc. We could define it for the yambo main, but we then encounter undefined variables (under #if defined _SC || _RT) in some subroutines. Thus, [!!!]so far the GW+U implementation is only accessible by yambo_sc[!!!].
  • Main subroutine XCo_from_H_minus_Hbare is called from XCo_driver: it performs some setup and allocations, then calls Bare_Hamiltonian_from_Scratch to get H0 (ks space) and Vh (r space); finally converts to qp space and gets QP_Vxc.
  • The code now re-computes the G-shells in order to build vloc.
  • It is important to ensure that the G-shells and G-vectors are setup correctly during the initialization (pay attention to G_thresh variable).
  • It is likely that we need to load the full density G-vectors from QE if we want very accurate results.
  • If we read a previously computed screening database, like ndb.pp, the reports write the error *ERR* G-shells even if the SAVE is the same and was initialised the same way. This however doesn't stop execution and produces correct results.
  • If using high values of ecutwfc, the agreement in one-electron energy (trace over occupied states) between Yambo-Hbare and QE worsens and may become of the order of meV (i.e., meV/Nel per electron level). This is due to the wave functions being loaded in single precision. By using double precision, the errors with QE are restored to the order 1e-5 meV. I don't believe this to be a huge problem for QP corrections, but running in double precision may be the safest option (but see numerical test below which show that single precision seems ok for production runs).
  • We must be careful with nonlocal core corrections, since if they are used in the DFT calculation, then the formulation V_xc=H-H0-Vh intrinsically includes them, which may be a problem.
  • At the moment, the self-energy MPI tasks q qp b must be 1 Nqp Nb, i.e. it is not possible to parallelize over q.

How to run

  • Use the devel branch named phys-gw-from-bareH on my fork. NOTE: it should be merged with the current yambo release and pass the test-suite.
  • Generate the SAVE using p2y -p to generate additional databases for local potential and KB projectors
  • Run with yambo_sc
  • The input file has several additions to a standard standard GW input:
  1. Specific flags to turn on H0 calculation and align levels:
BareHfromScratch BareHScaleFermi
  1. Specific options for dipoles needed for kinetic part of H0
dipoles % DipBands 1 | Nb | # [DIP] Bands range for dipoles % DipBandsALL # [DIP] Compute all bands range, not only valence and conduction
  1. [Optional] G-vectors maximised in testing for FFT and xc (but see numerical test below: standard production runs should not require this)
FFTGvecs= Max_value RL EXXRLvcs= Max_value RL # [XX] Exchange RL components VXCRLvcs= Max_value RL # [XC] XCpotential RL components

Tests

We don't actually need to use a DFT+U system to test this. We should compare to a standard GW calculation.

  • System: bulk hBN, 12x12x1 kpts, 80 Ry cutoff, 40 bnds, no NLCC, double precision.
  • QPkrange: 1 | 19 | 1 | 12

Serial test

Timings

  • Reference calculation (libxc): [Time-Profile]: 01h-23m
**XC_potential_driver** : 0.2666s CPU ( 2 calls, 133.278 msec avg) **XCo_local** : 0.3704s CPU FFT_setup : 0.5985s CPU ( 2 calls, 299.243 msec avg) io_WF : 0.1056s CPU ( 100 calls, 1.056 msec avg) WF_load_FFT : 2.4818s CPU ( 21 calls, 0.118 sec avg) io_DIPOLES : 0.1282s CPU ( 22 calls, 5.829 msec avg) DIPOLE_transverse : 2.1067s CPU Dipoles : 3.2399s CPU

NB: HF + GW(ppa) is 1h-22m-58s

  • New implementation (qepseudo): [Time-Profile]: 01h-23m
**Bare_Hamiltonian_from_Scratch** : 1.4728s CPU **XCo_from_H_minus_Hbare** : 2.2920s CPU **V_Hartree** : 0.0053s CPU **PP_compute_becp** : 0.0490s CPU ( 19 calls, 2.577 msec avg) **PP_uspp_init** : 0.1993s CPU ( 21 calls, 9.491 msec avg) **V_real_space_to_H** : 0.8020s CPU ( 38 calls, 21.106 msec avg) FFT_setup : 0.7347s CPU ( 4 calls, 183.685 msec avg) io_WF : 0.1347s CPU ( 121 calls, 1.113 msec avg) WF_load_FFT : 2.9919s CPU ( 22 calls, 0.136 sec avg) io_DIPOLES : 0.1803s CPU ( 42 calls, 4.292 msec avg) DIPOLE_transverse : 7.6371s CPU Dipoles : 8.8087s CPU

NB: HF + GW(ppa) is 1h-22m-52s

Both XCo_from_H_minus_Hbare and the dipoles take more time than XCo_local and the regular dipoles. However this is negligible with respect to the HF+GW time.

See below for timing tests for parallel execution.

Comparison with QE quantities

  1. One-electron energy (i.e.,
    trvkH0
    )
QE Yambo TEST
-14.34038209260606 Ry -14.34042613 Ry
ERR: 0.599159485573194 meV*

*Fake error due to limited significant digits in manual Fermi level shift and/or output precision, real error is a couple of order of magnitudes lower

  1. E_Hartree (i.e.,
    trvkVH
    )
QE Yambo direct calc. Yambo
HH0Vxclibxc
15.55811245 Ry 15.558112447364758 Ry 15.558111787762893 Ry
ERR: 3.5854298254003755e-05 meV 0.009010198112837818 meV*
  1. V_xc energy (i.e.,
    trvkVxc
    ). V_xc is the central quantity computed in the new implementation, corresponding to QP_Vxc in the code.
Yambo REF
Vxclibxc
Yambo TEST
Vxcqepseudo
-19.15953696907989 Ry -19.159535649876165 Ry
ERR: 0.017948640378714614 meV*
  1. E_xc
Yambo REF
Exclibxc
QE REF
-14.63761 Ry -14.63761011 Ry
  1. KS-electron energy (i.e.,
    trvkH
    )
QE/Yambo REF Yambo TEST
H0+VH+Vxclibxc
-8.292277439246057 Ry -8.292276120042345 Ry
ERR: 0.017948687387481878 meV*

Comparison with standard GW run

  • Comparison of QP energies

Blue: libxc run. Red: qepseudo run.

Average deviation: 8.026315789952826e-04 meV
Max deviation: 2.3999999996249244e-02 meV

  • Dependence on G-vector cutoff

We change the

G-dependent parameters together:

FFTGvecs= NGvcs RL EXXRLvcs= NGvcs RL # [XX] Exchange RL components VXCRLvcs= NGvcs RL # [XC] XCpotential RL components

We first plot the dependence of the average error with the energy cutoff, defined like this:

δ1Enk=Enkqepseudo[Ecutoff]Enklibxc[max(Ecutoff)]

QP-G_mean-error_test

Max deviation at 44 Ry: 0.16 meV

The error explodes at average cutoffs, however it still remains sub-meV and is comparable to the libxc error.

We can also check the discrepancy between libxc and qepseudo while reducing the cutoff, we plot the following error:

δ2Enk=Enkqepseudo[Ecutoff]Enklibxc[Ecutoff]
QP-G_discrepancy_test

  • Accuracy of single and double precision

Reference: libxc in double precision on QPkrange: 15 | 19 | 7 | 10 at the maximum cutoff of 160 Ry

libxc DP REF qepseudo SP qepseudo DP
Average error (meV) 2.18e-05 -7.00e-07
Max error (meV) 2.95e-04 2.10e-05

Single precision seems more than ok.

Parallel tests

  • Comparison of QP energies

System: same as for serial tests. Bulk hBN, 12x12x1 kpts, 80 Ry cutoff, 40 bnds, no NLCC, double precision.

QPkrange: 15 | 19 | 7 | 10

The SE roles are q qp b. Only qp appears as a loop in the new section in XCo_from_H_minus_Hbare. The loops over QP bands and QP kpoints in Bare_Hamiltonian_from_Scratch are not parallelised.
The MPI run works correctly when tasks are distributed over qp and/or b:

QP_MPI1x4x2

Average deviation: 7.000000000423334e-07 meV
Max deviation: 2.0999999999826713e-05 meV

Currently, q must be set to 1 because it creates issues with the parallelization scheme of the P_square dipoles:

 DIP_par_scheme="K"
 if(l_sc_run) DIP_par_scheme="SC"

 call DIPOLE_IO(k,E,Dip,'read',io_DIP_err,DIP_par_scheme)

I introduced a barrier in XCo_driver to avoid this case.

  • Parallel timings

MPI tests with

DIP_ROLEs= "k c v"
DIP_CPU= "1 **Ncpu** 1"
SE_ROLEs= "q qp b"
SE_CPU= "1 **Ncpu** 1"

Total timings are very similar, with the qepseudo implementation taking slightly more time.

timings

The differences in the two implementation are mainly due to the time spent in Dipoles and in the Vxc subroutines.

Typically, the Vxc part takes around one second in all cases and is very slightly affected by the parallelisation. The longer Dipoles part takes up the majority of the additional time in the qepseudo implementation, and it is strongly affected by task distribution, having its own parallelization scheme.

Below, the ratios

(TVxc+TDip)/Ttot for the two cases.

ratio

The ratio becomes bigger as

Ttot decreases because of gains in the GW and HF part, stabilizing on 7% and 14% in the libxc and qepseudo cases, respectively. The vertical lines in the plot represent the thresholds after which the Vxc subroutines (XCo_local and XCo_from_H_minus_Hbare) exhibit negative gains. This is not relevant for the overall timings, yet in the qepseudo case this happens already with 6 MPI tasks.

Validation

In order to validate the implementation, we have to compare it with existing GW@DFT+U calculations.

  • There is one on very complicated systems (perovskites), but at first we better stick to oxides.
  • This paper (2012) by Patrick and Giustino contains results on TiO2-anatase
  • This paper (2010) by Rinke, Scheffler et al contains results on ScN, ZnS, MnO, FeO, CoO, and NiO.
tags: Yambo-devel GW GW+U