Yambo GW+U implementation

Fulvio You can write comments like this.

Yambo GW+U implementation

TODO:

merge with updated yambo version (5.2.3)
prepare test-suite on corvina
- TESTS: hBN standard, hBN + U, hBN + Covariant dipoles
add diagonal-only Hbare calculation as flag to Hbare_from_scratch + warning if ib/=ibp, in that case compute the full one
Recreate a parallel structure for the WF in Hbare_from_scratch before loading, following the QP case
Possible memory leak + issue with covariant dipoles

Method

Yambo right now

Linearized one-shot QP equation from textbook:

\begin{matrix} (1) & E_{n k}^{Q P} = ϵ_{n k}^{K S} + Z_{n k} [Σ_{n k}^{x c} (ϵ_{n k}^{K S}) - V_{n k}^{x c}] . \end{matrix}

In Yambo,

V_{n k}^{x c}

is obtained via the libxc libraries.

GW+U

We know that, when a Hubbard

U

-term is included,

\begin{matrix} (2) & ϵ_{n k}^{K S} = h_{n k}^{0} + V_{n k}^{H} + V_{n k}^{x c} + U_{n k} . \end{matrix}

V_{n k}^{H}

can be computed by Yambo in real space. We have already implemented the reconstruction of

h_{n k}^{0}

but we don't have access to

U_{n k}

Strategy

We define

\begin{matrix} (3) & {\tilde{V}}_{n k}^{x c} \equiv V_{n k}^{x c} + U_{n k} \end{matrix}

and we obtain it via subtraction as:

\begin{matrix} (4) & {\tilde{V}}_{n k}^{x c} = ϵ_{n k}^{K S} - h_{n k}^{0} - V_{n k}^{H} \end{matrix}

and then calculate the correction using again Eq. (1), i.e.,

\begin{matrix} (5) & E_{n k}^{Q P} = ϵ_{n k}^{K S} + Z_{n k} [Σ_{n k}^{x c} (ϵ_{n k}^{K S}) - {\tilde{V}}_{n k}^{x c}] . \end{matrix}

For this we only need to compute

h^{0}

and

V^{H}

and we don't change the form of the QP equation.

This strategy works at the level of XCo_driver.F and in particular replaces the subroutine XCo_local.F with a new one, XCo_from_H_minus_Hbare.F, which calculates

{\tilde{V}}^{x c}

by calling the subroutines for

h^{0}

and

V^{H}

and converting them to the QP basis, then taking the difference with the KS eigenvalue.

From XCo_driver.F, present version:











 !
 ! Vxc
 !=====
 !
 if (l_Vxc) then
   !
   call XCo_local(E,Xk)
   !
 
   !
 endif

New XCo_driver.F proposed version:












 !
 ! Vxc
 !=====
 !
 if (l_Vxc) then
   !
   if (l_use_libxc)     call XCo_local(E,Xk)
   if (l_use_qe_pseudo) call XCo_from_H_minus_Hbare(E,Xk,k,Dip)
   !

   !
 endif

This formulation has the features of (i) not using libxc and (ii) keeping the QP section of the code untouched.

The matrix elements of

h^{0}

are computed in Kohn-Sham space in the subroutine Bare_Hamiltonian_from_Scratch. The required steps are (i) obtaining the kinetic part (P_square) from the dipole routines; (ii) extracting the local part of the potential in

G

-space from QE, multiplying it by the structure factors and taking the FFT to real space; (iii) computing the contribution of the nonlocal part of the potential using the KB coefficients.

Notes on implementation and running

This section contains notes on the implementation that may be useful to write down

We need the hamiltonian module of the code, which is defined only for yambo_rt or yambo_sc. We could define it for the yambo main, but we then encounter undefined variables (under #if defined _SC || _RT) in some subroutines. Thus, [!!!]so far the GW+U implementation is only accessible by yambo_sc[!!!].
Main subroutine XCo_from_H_minus_Hbare is called from XCo_driver: it performs some setup and allocations, then calls Bare_Hamiltonian_from_Scratch to get H0 (ks space) and Vh (r space); finally converts to qp space and gets QP_Vxc.
The code now re-computes the G-shells in order to build vloc.
It is important to ensure that the G-shells and G-vectors are setup correctly during the initialization (pay attention to G_thresh variable).
It is likely that we need to load the full density G-vectors from QE if we want very accurate results.
If we read a previously computed screening database, like ndb.pp, the reports write the error *ERR* G-shells even if the SAVE is the same and was initialised the same way. This however doesn't stop execution and produces correct results.
If using high values of ecutwfc, the agreement in one-electron energy (trace over occupied states) between Yambo-Hbare and QE worsens and may become of the order of meV (i.e., meV/Nel per electron level). This is due to the wave functions being loaded in single precision. By using double precision, the errors with QE are restored to the order 1e-5 meV. I don't believe this to be a huge problem for QP corrections, but running in double precision may be the safest option (but see numerical test below which show that single precision seems ok for production runs).
We must be careful with nonlocal core corrections, since if they are used in the DFT calculation, then the formulation V_xc=H-H0-Vh intrinsically includes them, which may be a problem.
At the moment, the self-energy MPI tasks q qp b must be 1 Nqp Nb, i.e. it is not possible to parallelize over q.

How to run

Use the devel branch named phys-gw-from-bareH on my fork. NOTE: it should be merged with the current yambo release and pass the test-suite.
Generate the SAVE using p2y -p to generate additional databases for local potential and KB projectors
Run with yambo_sc
The input file has several additions to a standard standard GW input:

Specific flags to turn on H0 calculation and align levels:


BareHfromScratch
BareHScaleFermi

Specific options for dipoles needed for kinetic part of H0





dipoles
% DipBands
   1 |  Nb |                         # [DIP] Bands range for dipoles
%
DipBandsALL                   # [DIP] Compute all bands range, not only valence and conduction

[Optional] G-vectors maximised in testing for FFT and xc (but see numerical test below: standard production runs should not require this)



FFTGvecs=  Max_value           RL
EXXRLvcs=  Max_value           RL    # [XX] Exchange    RL components
VXCRLvcs=  Max_value           RL    # [XC] XCpotential RL components

Tests

We don't actually need to use a DFT+U system to test this. We should compare to a standard GW calculation.

System: bulk hBN, 12x12x1 kpts, 80 Ry cutoff, 40 bnds, no NLCC, double precision.
QPkrange: 1 | 19 | 1 | 12

Serial test

Timings

Reference calculation (libxc): [Time-Profile]: 01h-23m








**XC_potential_driver**    :      0.2666s CPU (   2 calls, 133.278 msec avg)
**XCo_local**              :      0.3704s CPU
FFT_setup                  :      0.5985s CPU (   2 calls, 299.243 msec avg)
io_WF                      :      0.1056s CPU ( 100 calls,   1.056 msec avg)
WF_load_FFT                :      2.4818s CPU (  21 calls,   0.118 sec avg)
io_DIPOLES                 :      0.1282s CPU (  22 calls,   5.829 msec avg)
DIPOLE_transverse          :      2.1067s CPU
Dipoles                    :      3.2399s CPU

NB: HF + GW(ppa) is 1h-22m-58s

New implementation (qepseudo): [Time-Profile]: 01h-23m












**Bare_Hamiltonian_from_Scratch** :      1.4728s CPU
**XCo_from_H_minus_Hbare** :      2.2920s CPU
**V_Hartree**              :      0.0053s CPU
**PP_compute_becp**        :      0.0490s CPU (  19 calls,   2.577 msec avg)
**PP_uspp_init**           :      0.1993s CPU (  21 calls,   9.491 msec avg)
**V_real_space_to_H**      :      0.8020s CPU (  38 calls,  21.106 msec avg)
FFT_setup                  :      0.7347s CPU (   4 calls, 183.685 msec avg)
io_WF                      :      0.1347s CPU ( 121 calls,   1.113 msec avg)
WF_load_FFT                :      2.9919s CPU (  22 calls,   0.136 sec avg)
io_DIPOLES                 :      0.1803s CPU (  42 calls,   4.292 msec avg)
DIPOLE_transverse          :      7.6371s CPU
Dipoles                    :      8.8087s CPU

NB: HF + GW(ppa) is 1h-22m-52s

Both XCo_from_H_minus_Hbare and the dipoles take more time than XCo_local and the regular dipoles. However this is negligible with respect to the HF+GW time.

See below for timing tests for parallel execution.

Comparison with QE quantities

One-electron energy (i.e.,
${tr}_{v k} H_{0}$ )

QE	Yambo TEST
-14.34038209260606 Ry	-14.34042613 Ry
ERR:	0.599159485573194 meV*

*Fake error due to limited significant digits in manual Fermi level shift and/or output precision, real error is a couple of order of magnitudes lower

E_Hartree (i.e.,
${tr}_{v k} V_{H}$ )

QE	Yambo direct calc.	Yambo $H - H_{0} - V_{x c}^{l i b x c}$
15.55811245 Ry	15.558112447364758 Ry	15.558111787762893 Ry
ERR:	3.5854298254003755e-05 meV	0.009010198112837818 meV*

V_xc energy (i.e.,
${tr}_{v k} V_{x c}$ ). V_xc is the central quantity computed in the new implementation, corresponding to QP_Vxc in the code.

Yambo REF $V_{x c}^{l i b x c}$	Yambo TEST $V_{x c}^{q e p s e u d o}$
-19.15953696907989 Ry	-19.159535649876165 Ry
ERR:	0.017948640378714614 meV*

E_xc

Yambo REF $E_{x c}^{l i b x c}$	QE REF
-14.63761 Ry	-14.63761011 Ry

KS-electron energy (i.e.,
${tr}_{v k} H$ )

QE/Yambo REF	Yambo TEST $H_{0} + V_{H} + V_{x c}^{l i b x c}$
-8.292277439246057 Ry	-8.292276120042345 Ry
ERR:	0.017948687387481878 meV*

Comparison with standard GW run

Comparison of QP energies

Blue: libxc run. Red: qepseudo run.

Average deviation: 8.026315789952826e-04 meV
Max deviation: 2.3999999996249244e-02 meV

Dependence on G-vector cutoff

We change the

G

-dependent parameters together:



FFTGvecs=  NGvcs    RL
EXXRLvcs=  NGvcs    RL    # [XX] Exchange    RL components
VXCRLvcs=  NGvcs    RL    # [XC] XCpotential RL components

We first plot the dependence of the average error with the energy cutoff, defined like this:

δ_{1} E_{n k} = E_{n k}^{q e p s e u d o} [E_{c u t o f f}] - E_{n k}^{l i b x c} [\max (E_{c u t o f f})]

Max deviation at 44 Ry: 0.16 meV

The error explodes at average cutoffs, however it still remains sub-meV and is comparable to the libxc error.

We can also check the discrepancy between libxc and qepseudo while reducing the cutoff, we plot the following error:

δ_{2} E_{n k} = E_{n k}^{q e p s e u d o} [E_{c u t o f f}] - E_{n k}^{l i b x c} [E_{c u t o f f}]

Accuracy of single and double precision

Reference: libxc in double precision on QPkrange: 15 | 19 | 7 | 10 at the maximum cutoff of 160 Ry

`libxc` DP REF	`qepseudo` SP	`qepseudo` DP
Average error (meV)	2.18e-05	-7.00e-07
Max error (meV)	2.95e-04	2.10e-05

Single precision seems more than ok.

Parallel tests

Comparison of QP energies

System: same as for serial tests. Bulk hBN, 12x12x1 kpts, 80 Ry cutoff, 40 bnds, no NLCC, double precision.

QPkrange: 15 | 19 | 7 | 10

The SE roles are q qp b. Only qp appears as a loop in the new section in XCo_from_H_minus_Hbare. The loops over QP bands and QP kpoints in Bare_Hamiltonian_from_Scratch are not parallelised.
The MPI run works correctly when tasks are distributed over qp and/or b:

Average deviation: 7.000000000423334e-07 meV
Max deviation: 2.0999999999826713e-05 meV

Currently, q must be set to 1 because it creates issues with the parallelization scheme of the P_square dipoles:

 DIP_par_scheme="K"
 if(l_sc_run) DIP_par_scheme="SC"

 call DIPOLE_IO(k,E,Dip,'read',io_DIP_err,DIP_par_scheme)

I introduced a barrier in XCo_driver to avoid this case.

Parallel timings

MPI tests with

DIP_ROLEs= "k c v"
DIP_CPU= "1 **Ncpu** 1"
SE_ROLEs= "q qp b"
SE_CPU= "1 **Ncpu** 1"

Total timings are very similar, with the qepseudo implementation taking slightly more time.

The differences in the two implementation are mainly due to the time spent in Dipoles and in the Vxc subroutines.

Typically, the Vxc part takes around one second in all cases and is very slightly affected by the parallelisation. The longer Dipoles part takes up the majority of the additional time in the qepseudo implementation, and it is strongly affected by task distribution, having its own parallelization scheme.

Below, the ratios

(T_{V x c} + T_{D i p}) / T_{t o t}

for the two cases.

The ratio becomes bigger as

T_{t o t}

decreases because of gains in the GW and HF part, stabilizing on 7% and 14% in the libxc and qepseudo cases, respectively. The vertical lines in the plot represent the thresholds after which the Vxc subroutines (XCo_local and XCo_from_H_minus_Hbare) exhibit negative gains. This is not relevant for the overall timings, yet in the qepseudo case this happens already with 6 MPI tasks.

Validation

In order to validate the implementation, we have to compare it with existing GW@DFT+U calculations.

There is one on very complicated systems (perovskites), but at first we better stick to oxides.
This paper (2012) by Patrick and Giustino contains results on TiO2-anatase
This paper (2010) by Rinke, Scheffler et al contains results on ScN, ZnS, MnO, FeO, CoO, and NiO.