Fulvio You can write comments like this.
TODO:
Linearized one-shot QP equation from textbook:
In Yambo, is obtained via the libxc
libraries.
We know that, when a Hubbard -term is included,
can be computed by Yambo in real space. We have already implemented the reconstruction of but we don't have access to .
We define
and we obtain it via subtraction as:
and then calculate the correction using again Eq. (1), i.e.,
For this we only need to compute and and we don't change the form of the QP equation.
This strategy works at the level of XCo_driver.F
and in particular replaces the subroutine XCo_local.F
with a new one, XCo_from_H_minus_Hbare.F
, which calculates by calling the subroutines for and and converting them to the QP basis, then taking the difference with the KS eigenvalue.
From XCo_driver.F
, present version:
New XCo_driver.F
proposed version:
This formulation has the features of (i) not using libxc
and (ii) keeping the QP section of the code untouched.
The matrix elements of are computed in Kohn-Sham space in the subroutine Bare_Hamiltonian_from_Scratch
. The required steps are (i) obtaining the kinetic part (P_square
) from the dipole routines; (ii) extracting the local part of the potential in -space from QE, multiplying it by the structure factors and taking the FFT to real space; (iii) computing the contribution of the nonlocal part of the potential using the KB coefficients.
This section contains notes on the implementation that may be useful to write down
hamiltonian
module of the code, which is defined only for yambo_rt
or yambo_sc
. We could define it for the yambo
main, but we then encounter undefined variables (under #if defined _SC || _RT
) in some subroutines. Thus, [!!!]so far the GW+U implementation is only accessible by yambo_sc[!!!].XCo_from_H_minus_Hbare
is called from XCo_driver
: it performs some setup and allocations, then calls Bare_Hamiltonian_from_Scratch
to get H0
(ks space) and Vh
(r space); finally converts to qp space and gets QP_Vxc
.vloc
.G_thresh
variable).ndb.pp
, the reports write the error *ERR* G-shells
even if the SAVE
is the same and was initialised the same way. This however doesn't stop execution and produces correct results.ecutwfc
, the agreement in one-electron energy (trace over occupied states) between Yambo-Hbare and QE worsens and may become of the order of meV
(i.e., meV/Nel
per electron level). This is due to the wave functions being loaded in single precision. By using double precision, the errors with QE are restored to the order 1e-5 meV
. I don't believe this to be a huge problem for QP corrections, but running in double precision may be the safest option (but see numerical test below which show that single precision seems ok for production runs).V_xc=H-H0-Vh
intrinsically includes them, which may be a problem.q qp b
must be 1 Nqp Nb
, i.e. it is not possible to parallelize over q
.phys-gw-from-bareH
on my fork. NOTE: it should be merged with the current yambo release and pass the test-suite.SAVE
using p2y -p
to generate additional databases for local potential and KB projectorsyambo_sc
H0
calculation and align levels:H0
We don't actually need to use a DFT+U system to test this. We should compare to a standard GW calculation.
QPkrange
: 1 | 19 | 1 | 12
[Time-Profile]: 01h-23m
NB: HF + GW(ppa) is 1h-22m-58s
[Time-Profile]: 01h-23m
NB: HF + GW(ppa) is 1h-22m-52s
Both XCo_from_H_minus_Hbare
and the dipoles take more time than XCo_local
and the regular dipoles. However this is negligible with respect to the HF+GW time.
See below for timing tests for parallel execution.
QE | Yambo TEST |
---|---|
-14.34038209260606 Ry | -14.34042613 Ry |
ERR: | 0.599159485573194 meV* |
*Fake error due to limited significant digits in manual Fermi level shift and/or output precision, real error is a couple of order of magnitudes lower
QE | Yambo direct calc. | Yambo |
---|---|---|
15.55811245 Ry | 15.558112447364758 Ry | 15.558111787762893 Ry |
ERR: | 3.5854298254003755e-05 meV | 0.009010198112837818 meV* |
QP_Vxc
in the code.Yambo REF | Yambo TEST |
---|---|
-19.15953696907989 Ry | -19.159535649876165 Ry |
ERR: | 0.017948640378714614 meV* |
Yambo REF | QE REF |
---|---|
-14.63761 Ry | -14.63761011 Ry |
QE/Yambo REF | Yambo TEST |
---|---|
-8.292277439246057 Ry | -8.292276120042345 Ry |
ERR: | 0.017948687387481878 meV* |
Blue: libxc
run. Red: qepseudo
run.
Average deviation: 8.026315789952826e-04 meV
Max deviation: 2.3999999996249244e-02 meV
We change the -dependent parameters together:
We first plot the dependence of the average error with the energy cutoff, defined like this:
Max deviation at 44 Ry: 0.16 meV
The error explodes at average cutoffs, however it still remains sub-meV and is comparable to the libxc error.
We can also check the discrepancy between libxc
and qepseudo
while reducing the cutoff, we plot the following error:
Reference: libxc
in double precision on QPkrange
: 15 | 19 | 7 | 10
at the maximum cutoff of 160 Ry
libxc DP REF |
qepseudo SP |
qepseudo DP |
---|---|---|
Average error (meV) | 2.18e-05 | -7.00e-07 |
Max error (meV) | 2.95e-04 | 2.10e-05 |
Single precision seems more than ok.
System: same as for serial tests. Bulk hBN, 12x12x1 kpts, 80 Ry cutoff, 40 bnds, no NLCC, double precision.
QPkrange
: 15 | 19 | 7 | 10
The SE roles are q qp b
. Only qp
appears as a loop in the new section in XCo_from_H_minus_Hbare
. The loops over QP bands and QP kpoints in Bare_Hamiltonian_from_Scratch
are not parallelised.
The MPI run works correctly when tasks are distributed over qp
and/or b
:
Average deviation: 7.000000000423334e-07 meV
Max deviation: 2.0999999999826713e-05 meV
Currently, q
must be set to 1 because it creates issues with the parallelization scheme of the P_square
dipoles:
I introduced a barrier in XCo_driver
to avoid this case.
MPI tests with
Total timings are very similar, with the qepseudo
implementation taking slightly more time.
The differences in the two implementation are mainly due to the time spent in Dipoles
and in the Vxc
subroutines.
Typically, the Vxc
part takes around one second in all cases and is very slightly affected by the parallelisation. The longer Dipoles
part takes up the majority of the additional time in the qepseudo
implementation, and it is strongly affected by task distribution, having its own parallelization scheme.
Below, the ratios for the two cases.
The ratio becomes bigger as decreases because of gains in the GW
and HF
part, stabilizing on 7% and 14% in the libxc
and qepseudo
cases, respectively. The vertical lines in the plot represent the thresholds after which the Vxc
subroutines (XCo_local
and XCo_from_H_minus_Hbare
) exhibit negative gains. This is not relevant for the overall timings, yet in the qepseudo
case this happens already with 6 MPI tasks.
In order to validate the implementation, we have to compare it with existing GW@DFT+U calculations.
Yambo-devel
GW
GW+U