# Yambo GW+U implementation
> [name=Fulvio] You can write comments like this.
**IMPORTANT NOTE FOR EXTERNAL USERS**
This feature of the yambo code is still under development. It does not presently support full relativistic pseudopotentials for spinorial calculations, nor the Coulomb cutoff truncation for 2D systems. In addition, it has not been tested estensively and the way in which it is run will change in future updates leading to the official release. So please be careful and if possible wait for the official release.
[TOC]
TODO:
- merge with updated yambo version (5.2.3)
- prepare test-suite on corvina
- TESTS: hBN standard, hBN + U, hBN + Covariant dipoles
- add diagonal-only Hbare calculation as flag to Hbare_from_scratch + warning if ib/=ibp, in that case compute the full one
- Recreate a parallel structure for the WF in Hbare_from_scratch before loading, following the QP case
- Possible memory leak + issue with covariant dipoles
Method
---
### Yambo right now
Linearized one-shot QP equation from textbook:
\begin{equation}\tag1
E^{QP}_{nk}=\epsilon^{KS}_{nk}+Z_{nk}\left[\Sigma^{xc}_{nk}(\epsilon^{KS}_{nk})-V_{nk}^{xc} \right].
\end{equation}
In Yambo, $V_{nk}^{xc}$ is obtained via the `libxc` libraries.
### GW+U
We know that, when a Hubbard $U$-term is included,
\begin{equation}\tag2
\epsilon_{nk}^{KS}=h^0_{nk}+V^H_{nk}+V^{xc}_{nk}+U_{nk}.
\end{equation}
$V^H_{nk}$ can be computed by Yambo in real space. We have already implemented the reconstruction of $h^0_{nk}$ but we don't have access to $U_{nk}$.
Strategy
---
We define
\begin{equation}\tag3
\tilde{V}^{xc}_{nk}\equiv V^{xc}_{nk}+U_{nk}
\end{equation}
and we obtain it via subtraction as:
\begin{equation}\tag4
\tilde{V}^{xc}_{nk}=\epsilon^{KS}_{nk}-h^0_{nk}-V^H_{nk}
\end{equation}
and then calculate the correction using again Eq. (1), i.e.,
\begin{equation}\tag5
E^{QP}_{nk}=\epsilon^{KS}_{nk}+Z_{nk}\left[\Sigma^{xc}_{nk}(\epsilon^{KS}_{nk})-\tilde{V}_{nk}^{xc} \right].
\end{equation}
For this we only need to compute $h^0$ and $V^H$ and we don't change the form of the QP equation.
This strategy works at the level of `XCo_driver.F` and in particular replaces the subroutine `XCo_local.F` with a new one, `XCo_from_H_minus_Hbare.F`, which calculates $\tilde{V}^{xc}$ by calling the subroutines for $h^0$ and $V^H$ and converting them to the QP basis, then taking the difference with the KS eigenvalue.
From `XCo_driver.F`, present version:
```fortran=
!
! Vxc
!=====
!
if (l_Vxc) then
!
call XCo_local(E,Xk)
!
!
endif
```
New `XCo_driver.F` proposed version:
```fortran=
!
! Vxc
!=====
!
if (l_Vxc) then
!
if (l_use_libxc) call XCo_local(E,Xk)
if (l_use_qe_pseudo) call XCo_from_H_minus_Hbare(E,Xk,k,Dip)
!
!
endif
```
This formulation has the features of (i) not using `libxc` and (ii) keeping the QP section of the code untouched.
The matrix elements of $h^0$ are computed in Kohn-Sham space in the subroutine `Bare_Hamiltonian_from_Scratch`. The required steps are (i) obtaining the kinetic part (`P_square`) from the dipole routines; (ii) extracting the local part of the potential in $G$-space from QE, multiplying it by the structure factors and taking the FFT to real space; (iii) computing the contribution of the nonlocal part of the potential using the KB coefficients.
Notes on implementation and running
---
This section contains notes on the implementation that may be useful to write down
- We need the `hamiltonian` module of the code, which is defined only for `yambo_rt` or `yambo_sc`. We could define it for the `yambo` main, but we then encounter undefined variables (under `#if defined _SC || _RT`) in some subroutines. Thus, **[!!!!!]so far the GW+U implementation is only accessible by yambo_sc**[!!!!!!].
- Main subroutine `XCo_from_H_minus_Hbare` is called from `XCo_driver`: it performs some setup and allocations, then calls `Bare_Hamiltonian_from_Scratch` to get `H0` (ks space) and `Vh` (r space); finally converts to qp space and gets `QP_Vxc`.
- The code now re-computes the G-shells in order to build `vloc`.
- It is important to ensure that the G-shells and G-vectors are setup correctly during the initialization (pay attention to `G_thresh` variable).
- It is likely that we need to load the full density G-vectors from QE if we want very accurate results.
- If we read a previously computed screening database, like `ndb.pp`, the reports write the error `*ERR* G-shells` even if the `SAVE` is the same and was initialised the same way. This however doesn't stop execution and produces correct results.
- If using high values of `ecutwfc`, the agreement in one-electron energy (trace over occupied states) between Yambo-Hbare and QE worsens and may become of the order of `meV` (i.e., `meV/Nel` per electron level). This is due to the wave functions being loaded in single precision. By using double precision, the errors with QE are restored to the order `1e-5 meV`. I don't believe this to be a huge problem for QP corrections, but running in double precision may be the safest option (but see numerical test below which show that **single precision seems ok for production runs**).
- We must be careful with nonlocal core corrections, since if they are used in the DFT calculation, then the formulation `V_xc=H-H0-Vh` intrinsically includes them, which [may be a problem](https://github.com/yambo-code/yambo-devel/issues/471).
- At the moment, the self-energy MPI tasks `q qp b` must be `1 Nqp Nb`, i.e. it is not possible to parallelize over `q`.
How to run
---
- Use the devel branch named `phys-gw-from-bareH` on [my fork](https://github.com/palful/yambo/tree/phys-gw-from-bareH). **NOTE:** it should be merged with the current yambo release and pass the test-suite.
- Generate the `SAVE` using `p2y` (formerly `p2y -p`) to generate additional databases for local potential and KB projectors
- Run with `yambo_sc`
- The input file has several additions to a standard standard GW input:
1. Specific flags to turn on `H0` calculation and align levels:
```bash=
BareHfromScratch
BareHScaleFermi
```
2. Specific options for dipoles needed for kinetic part of `H0`
```bash=
dipoles
% DipBands
1 | Nb | # [DIP] Bands range for dipoles
%
DipBandsALL # [DIP] Compute all bands range, not only valence and conduction
```
3. [Optional] G-vectors maximised in testing for FFT and xc (but see numerical test below: **standard production runs should not require this**)
```bash=
FFTGvecs= Max_value RL
EXXRLvcs= Max_value RL # [XX] Exchange RL components
VXCRLvcs= Max_value RL # [XC] XCpotential RL components
```
Tests
---
We don't actually need to use a DFT+U system to test this. We should compare to a standard GW calculation.
- System: bulk hBN, 12x12x1 kpts, 80 Ry cutoff, 40 bnds, no NLCC, double precision.
- `QPkrange`: `1 | 19 | 1 | 12`
### Serial test
#### Timings
- Reference calculation (libxc): `[Time-Profile]: 01h-23m`
```bash=
**XC_potential_driver** : 0.2666s CPU ( 2 calls, 133.278 msec avg)
**XCo_local** : 0.3704s CPU
FFT_setup : 0.5985s CPU ( 2 calls, 299.243 msec avg)
io_WF : 0.1056s CPU ( 100 calls, 1.056 msec avg)
WF_load_FFT : 2.4818s CPU ( 21 calls, 0.118 sec avg)
io_DIPOLES : 0.1282s CPU ( 22 calls, 5.829 msec avg)
DIPOLE_transverse : 2.1067s CPU
Dipoles : 3.2399s CPU
```
NB: HF + GW(ppa) is `1h-22m-58s`
- New implementation (qepseudo): `[Time-Profile]: 01h-23m`
```bash=
**Bare_Hamiltonian_from_Scratch** : 1.4728s CPU
**XCo_from_H_minus_Hbare** : 2.2920s CPU
**V_Hartree** : 0.0053s CPU
**PP_compute_becp** : 0.0490s CPU ( 19 calls, 2.577 msec avg)
**PP_uspp_init** : 0.1993s CPU ( 21 calls, 9.491 msec avg)
**V_real_space_to_H** : 0.8020s CPU ( 38 calls, 21.106 msec avg)
FFT_setup : 0.7347s CPU ( 4 calls, 183.685 msec avg)
io_WF : 0.1347s CPU ( 121 calls, 1.113 msec avg)
WF_load_FFT : 2.9919s CPU ( 22 calls, 0.136 sec avg)
io_DIPOLES : 0.1803s CPU ( 42 calls, 4.292 msec avg)
DIPOLE_transverse : 7.6371s CPU
Dipoles : 8.8087s CPU
```
NB: HF + GW(ppa) is `1h-22m-52s`
Both `XCo_from_H_minus_Hbare` and the dipoles take more time than `XCo_local` and the regular dipoles. However this is negligible with respect to the HF+GW time.
See below for timing tests for parallel execution.
#### Comparison with QE quantities
1. One-electron energy (i.e., $\mathrm{tr}_{vk} H_0$)
| QE | Yambo TEST |
|-----|--------|
|-14.34038209260606 Ry | -14.34042613 Ry|
|ERR: |0.599159485573194 meV* |
*Fake error due to limited significant digits in manual Fermi level shift and/or output precision, real error is a couple of order of magnitudes lower
2. E_Hartree (i.e., $\mathrm{tr}_{vk} V_H$)
| QE | Yambo direct calc. | Yambo $H-H_0-V^{libxc}_{xc}$ |
|-----|--------|----|
|15.55811245 Ry |15.558112447364758 Ry| 15.558111787762893 Ry|
|ERR: |3.5854298254003755e-05 meV | 0.009010198112837818 meV*|
3. V_xc energy (i.e., $\mathrm{tr}_{vk} V_{xc}$). V_xc is the central quantity computed in the new implementation, corresponding to `QP_Vxc` in the code.
| Yambo REF $V^{libxc}_{xc}$ | Yambo TEST $V^{qepseudo}_{xc}$ |
|-----|--------|
|-19.15953696907989 Ry| -19.159535649876165 Ry|
|ERR: | 0.017948640378714614 meV*|
4. E_xc
| Yambo REF $E^{libxc}_{xc}$ | QE REF |
|-----|--------|
|-14.63761 Ry| -14.63761011 Ry|
5. KS-electron energy (i.e., $\mathrm{tr}_{vk} H$)
| QE/Yambo REF | Yambo TEST $H_0+V_H+V^{libxc}_{xc}$ |
|----------|----|
|-8.292277439246057 Ry |-8.292276120042345 Ry|
|ERR: |0.017948687387481878 meV* |
#### Comparison with standard GW run
- **Comparison of QP energies**
Blue: `libxc` run. Red: `qepseudo` run.

Average deviation: `8.026315789952826e-04 meV`
Max deviation: `2.3999999996249244e-02 meV`
- **Dependence on G-vector cutoff**
We change the $G$-dependent parameters together:
```bash=
FFTGvecs= NGvcs RL
EXXRLvcs= NGvcs RL # [XX] Exchange RL components
VXCRLvcs= NGvcs RL # [XC] XCpotential RL components
```
We first plot the dependence of the average error with the energy cutoff, defined like this:
$$
\delta_1 E_{nk} = E_{nk}^{qepseudo}[E_{cutoff}]-E_{nk}^{libxc}[\mathrm{max}(E_{cutoff})]
$$

Max deviation at 44 Ry: `0.16 meV`
The error explodes at average cutoffs, however it still remains sub-meV and is **comparable to the libxc error**.
We can also check the discrepancy between `libxc` and `qepseudo` while reducing the cutoff, we plot the following error:
$$
\delta_2 E_{nk} = E_{nk}^{qepseudo}[E_{cutoff}]-E_{nk}^{libxc}[E_{cutoff}]
$$

- **Accuracy of single and double precision**
Reference: `libxc` in double precision on `QPkrange`: `15 | 19 | 7 | 10` at the maximum cutoff of `160 Ry`
| `libxc` DP REF | `qepseudo` SP | `qepseudo` DP |
|----------|----|---|
|Average error (meV) | 2.18e-05 | -7.00e-07 |
|Max error (meV) | 2.95e-04 | 2.10e-05 |
Single precision seems more than ok.
### Parallel tests
- **Comparison of QP energies**
System: same as for serial tests. Bulk hBN, 12x12x1 kpts, 80 Ry cutoff, 40 bnds, no NLCC, double precision.
`QPkrange`: `15 | 19 | 7 | 10`
The SE roles are `q qp b`. Only `qp` appears as a loop in the new section in `XCo_from_H_minus_Hbare`. The loops over QP bands and QP kpoints in `Bare_Hamiltonian_from_Scratch` are not parallelised.
The MPI run works correctly when tasks are distributed over `qp` and/or `b`:

Average deviation: `7.000000000423334e-07 meV`
Max deviation: `2.0999999999826713e-05 meV`
Currently, `q` must be set to 1 because it creates issues with the parallelization scheme of the `P_square` dipoles:
```
DIP_par_scheme="K"
if(l_sc_run) DIP_par_scheme="SC"
call DIPOLE_IO(k,E,Dip,'read',io_DIP_err,DIP_par_scheme)
```
I introduced a barrier in `XCo_driver` to avoid this case.
- **Parallel timings**
MPI tests with
```
DIP_ROLEs= "k c v"
DIP_CPU= "1 **Ncpu** 1"
SE_ROLEs= "q qp b"
SE_CPU= "1 **Ncpu** 1"
```
Total timings are very similar, with the `qepseudo` implementation taking slightly more time.

The differences in the two implementation are mainly due to the time spent in `Dipoles` and in the `Vxc` subroutines.
Typically, the `Vxc` part takes around one second in all cases and is very slightly affected by the parallelisation. The longer `Dipoles` part takes up the majority of the additional time in the `qepseudo` implementation, and it is strongly affected by task distribution, having its own parallelization scheme.
Below, the ratios $(T_{Vxc}+T_{Dip})/T_{tot}$ for the two cases.

The ratio becomes bigger as $T_{tot}$ decreases because of gains in the `GW` and `HF` part, stabilizing on 7% and 14% in the `libxc` and `qepseudo` cases, respectively. The vertical lines in the plot represent the thresholds after which the `Vxc` subroutines (`XCo_local` and `XCo_from_H_minus_Hbare`) exhibit negative gains. This is not relevant for the overall timings, yet in the `qepseudo` case this happens already with 6 MPI tasks.
Validation
---
In order to validate the implementation, we have to compare it with existing GW@DFT+U calculations.
- There is [one](https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.075003) on very complicated systems (perovskites), but at first we better stick to oxides.
- [This paper](https://iopscience.iop.org/article/10.1088/0953-8984/24/20/202201) (2012) by Patrick and Giustino contains results on [TiO2-anatase](https://en.wikipedia.org/wiki/Anatase)
- [This paper](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.82.045108) (2010) by Rinke, Scheffler et al contains results on [ScN](https://en.wikipedia.org/wiki/Scandium_nitride), [ZnS](https://en.wikipedia.org/wiki/Zinc_sulfide), [MnO](https://en.wikipedia.org/wiki/Manganese(II)_oxide), [FeO](https://en.wikipedia.org/wiki/Iron(II)_oxide), [CoO](https://en.wikipedia.org/wiki/Cobalt(II)_oxide), and [NiO](https://en.wikipedia.org/wiki/Nickel(II)_oxide).
###### tags: `Yambo-devel` `GW` `GW+U`