**QP Self Consistent GW in Yambo**

# **QP Self Consistent GW in Yambo** Github: remote branch 'devel-qsgw' ## Updating $W(\omega)$ and dipoles SC_driver calls X_dielectric_matrix, which uses X%whoami to determine what frequency range to use. So it should be fine to use SC_driver with a frequency-dependent $\chi$. - DV: this should be easily done using the ppa keyword in the input file - SC_driver updates the dipoles by rotating them to the new QP states (using DIPOLE_SC_rotate). It does that every few cycles, as set by SC_up_W_iters (SCUpWIter in input file). Then it updates $W$. - When setting SC_up_W_iters>0, the dipoles are loaded in the beginning of SC_driver (otherwise it does not need them, because $W$ is already calculated and won't be updated). This causes a segfault. - It looks like this is a problem internal to the coding of SC-COHSEX, and therefore hopefully solvable. - It also bugs when trying with the yambo-devel branch. ## Calculating the QSGW self-energy Right now, SC_driver calls SC_add_XC, which itself calls QP_ppa_cohsex to calculate the part of the Hamiltonian corresponding to COHSEX. Replacing the call to QP_ppa_cohsex inside SC_add_XC by QP_driver would be overkill, because we don't need the solvers. Hence we want to write something simpler, that only calls PPA, MPA, etc... We start by only modifying QP_ppa_cohsex. ### PPA implementation 1. The PPA calculation inside QP_ppa_cohsex only calculates the diagonal elements of the self-energy: $\langle n \textbf{k} | \Sigma | n \textbf{k} \rangle$. We need to calculate $\langle n \textbf{k} | \Sigma | n' \textbf{k} \rangle$. In QP_ppa_cohsex, PPA depends only on $n$, and COHSEX depends on $n$ and $n'$. This is because i_qp (the index of the QP_loop) runs over $n$, $n'$ and $k$ for COHSEX, but only over $n$ and $k$ for PPA. The range of i_qp is given by QP_n_states, which is defined in QP_state_table_setup. The flag to change QP_n_states in QP_state_table_setup is l_sc_run, so we leave this as it is and it should work when using the keyword scrun. 2. We added a new flag (l_sc_ppa) to use in QP_ppa_cohsex. It is triggered by HXC_Potential= "PPA" in the input file. 3. For each pair of states $i$, $j$, we need to calculate $\langle i | \Sigma (\epsilon_i) | j \rangle$ and $\langle i | \Sigma (\epsilon_j) | j \rangle$. For this, we modified l.491 in QP_ppa_cohsex by assigning W(iw) to $\epsilon(i)$ and $\epsilon(j)$, where $i$ and $j$ are stored in the QP_table. For testing this, we will need to set l_ppa=true, i.e. to use the keyword "ppa" in the input file. We need to make sure that Yambo runs consistently with l_sc_run and l_ppa and that QP_dSC_steps=2. 4. l_ppa was not set to true when using the keyword ppa inside the input file. This variable is set by the subroutine LOCAL_from_runlevels_to_logicals, which is defined in INIT.F. For each call of this subroutine, the flags are computed twice (in a do loop). There are 4 calls to this subroutine within INIT.F, hence why l_ppa is changed 8 times in total. The function that sets l_ppa within INIT.F is runlevel_is_on, defined in src/parser/mod_it_tools.F. The function runlevel_is_on seems to simply checks the array rstatus. These subsequent calls of LOCAL_from_runlevels_to_logicals were changing l_ppa from false to true and then back to false. This was solved by modifying INIT_barriers.F. 5. The calculation crashed because of a problem in the size of the dynamical matrix, when io_X was called at l.273 of QP_ppa_cohsex. The error message said that dimension 3 of X_mat was called out of bounds; this dimension corresponds to frequency. This was due to the fact the the allocation for X was done with only one frequency. I modified l.222 to make sure that in the PPA case X would be allocated two frequencies. 6. The calculation was crashing at l.500 of QP_ppa_cohsex, which is where I added two lines to calculate W at two KS energies. The index that the code gives to qp%E is out of bounds. This is due to the fact that qp%E was not allocated. We added a piece of code to allocate qp inside SC_add_XC. 7. At this point, the code was finally running but the calculated PPA frequencies were the same for all states. This was probably due to the fact that qp%E is used differently in QP_ppa_cohsex and in SC_add_XC. We ended up replacing qp%E by E%E at l.500 of QP_ppa_cohsex. 8. The calculation of PPA frequencies was done within the loop over ib, which runs on QP_n_G_bands, meaning that we were doing many time the same calculation. We moved it out of the loop, along with the calculation of the frequency for self-consistent COHSEX. 9. We faced a intermitent crash seemingly due to the allocation of qp%S_total, but we were not able to identified precisely where the crash occurred. This was corrected by reverting a previous change: not allocating qp at all, and going back to using qp_dummy. ## Tests of the PPA implementation - We tested self-consistent Hartree-Fock against the one in Quantum Espresso by performing calculations on bulk hBN at Gamma, and obtained similar results, although not completely equal. - Self-consistent PPA converges when using few bands, but not when using many bands. This seems to be due to bands at low energies oscillating, and is reminiscent of a remark in Bruneval et al., PRB 74, 045102 (2006). In these same low-energy states, Re(Z) is very small, and Im(Z) is large. - We found a breaking of degeneracy in both hBN and silicon, which is one order of magnitude larger than the one in G0W0. This seems to be also the case in self-consistent COHSEX. ## Next steps 1. It would be good to output $\Sigma$ at each k-point, as well as the whole QP_table, to do checks. 2. We need to make sure that setting l_sc_sex to true in XC_potentials.F is okay. For example, in QP_ppa_cohsex it does a specific setup of the bands. 3. Calculation of the whole $\Sigma$: the current implementation of SC_COHSEX does not recalculate the whole $\Sigma$ after convergence is reached. We will need to do this, because within the SC loop we will only consider a "Hermitianised" $\Sigma$. This will be of interest in particular when using MPA as we could try to address lifetimes. 4. check of the SC HF Yambo vs QE, activate random integrals for the Fock divergencies (yamob -r) --> rim_cut (Randvec=1 , RandQpts=2000000), in QE (force_symmorphic=.true.) 5. Merge the current branch with one containing MPA, and start implementing QSGW with MPA. 6. Fix the degeneracy breaking in self-consistent COHSEX. ## DV: Sparse notes after the first tests Relevant routines: - SC_driver.F - SC_add_XC.F Here are some comments: 1. *HF scheme* HXC_Potential= "FOCK+HARTREE" - Build Ho=eps-V_hartree-Vxc - Add terms to Ho, H=Ho+V_hartree+Vfock 2. *PPA* HXC_Potential= "PPA" it should be H=Ho+V_hartree+Vfock+Sc(omega,ppa) - Build Ho=eps-V_hartree-Vxc OK - Flags: SEX=T, Fock=F - It seems to calculate SEX and FOCK anyway (Check why and what it is indeed doing): HF@it1 |######## SEX@it1 |########## - V_hartree is set to **False** and not calculated 3. *Conclusions and further questions* - adding HARTREE to HXC_Potential seems to work and switches l_sc_hartree=**True** - SEX label was reported, but PPA was actually computed (reported fixed) - HF is calculated with MaxGvec (probably default) - Hartree is computed if input is set as HXC_Potential="HARTREE+PPA" - Setting HXC_Potential="HARTREE+FOCK+PPA" provides same results as above and fix the label in the report **Some tests from scratch on Silicon 2x2x2** - Fixed generation of input variables: yambo_sc -s -v qsgw - g0W0: degeneracy at gamma preserved. - Hartree only: converges after tenth of iterations (Mixing 0.1). Degenearcy is preserved. System became metallic. Do not know if this makes sense or not. - Hartree-Fock: fast convergences. Degenearcy is nearly preserved (up to 1^-5). Occupations are ok - Cohsex converge very fast. **Degeneracies at gamma are heavily broken (300meV)** Should it be deeply investigated? - Cohsex use empty converges very fast. Also here deceneracy are broken but milder (80meV) - qsgw: first evaluation seems ok (compared with g0w0: **besides deep state Z etremely small with large imaginary component**), then it does not converge (stopped at 10 iter), at 5 iter: something strange happen (jumps). Degeneracy broken. - test: compare the results of 2 iteration with the one calculated in two separated steps (slightly different, but most probably is because of the mixing algorithm). - convergence wrt eigenvalues is particularly odd, deep state oscillating. - Something is not working with deep eigenvalues (that's due to Z factor which is nearly zero) - Eliminating first state from SC gives a seg. fault. after 1st iteration: anyway this cannot be done naively (e.g. computation of Hartree) - The issue of very small Z in G0W0 is due to considering too few bands, but inserting many bands makes the qsgw calcualation extremely heavy - The effect is milder using a more converged W: eigenvalues go through convergence beside the first one at gamma which is oscilating (the one with extremely low Z factor) - test to try: reducing mixing of the Hamiltonian (not the mixinx of wfs): same issue happen, first eigenvalue oscillating. **MPA** - First note that the current devel-qsgw was branched from a develop version where MPA was not yet merged **Sync with bug-fixes needed** - Testing MPA on Silicon: first notice *yambo -gw0 m* does not show all the needed variable **to be fixed** -->**fixed** in *bug-fixes* branch - Source merged with develop on 6/4/2024 by AF: all tests passed except for 2 tests related to MPA and Green's function plotting (need to check where the problem is, but it could well be already in the develop and independent of us) **DV: some tests on sc-MPA** - Tests done with sc+mpa (by hand): **code crashing** as QP_solver not defined (Sc_W(i_qp)%p(i_w) not allocated) - also setting QP_solver="n" by hand, qp%n_states is zero and variables are not defined. To note that qp%n_states is 0 in SC_add_XC, also for ppa. - In PPA SC_W is used only in presence of terminator. - The issue arise because qp%n_states is defined in QP_driver which is not called. This was done on purpose as we do not need the solver. - **ISSUE SOLVED:** by definining by hand W_(1) and W(2) without passing through Sc_W(i_qp)%p(i_w) as done in PPA - First vanilla run: it seems deep states are better behaved wrt PP, degeneracy is anyway present. Systematic testing is needed - To do: -define MPA variables when generating qsgw input **Done** -Optimize MPA subroutine (for the moment removing rimw case) **Done** gain not evident but using minimal G dimension. To be futher optimized--> see Andrea changes in tech branch -Check MPA diagonal Sc first iteration with one shot calculation **DV: On W updating** For the moment testing with a simple cohsex input - Solved issue of dipoles allocation index in parallel_global_Selfenergy.F - code run when calculating screening - code crashes at the end of 1st iteration if screening is already calculated (to check). Even if it is recalculated crash at the end of the last iteration (different problem).**Solved**: see below - The crash happens in the rotation of the Wfs, which at first glance seems to be useless (to be verified). i_wf = WFo%index(ib,ik,i_sp_pol) seems to be wrong. **Solved**: it was due a missing assignment of spin index. - The idea to reload the wfs should be that more bands are needed for the screening and the ones non present in the SC cycle are shifted of the same amount of the last SC bands. Anyway at moment the screening is not update (to check) **Solved** - Now the screening is recalculcalated after assigning the logical l_recalculate_X=.true. but there is a crash after the first iteration (to check). - The crash happens in **XCo_Hartree_Fock.F** PAR_IND_G_b%element_1D is not allocated in the second iteration (to be checked) - It results not allocated after the call to X_dielectric_matrix - This is due to a PARALLEL_global_indexes in X_dielectric_matrix: possible solution is to reassign par_index in SC_driver after the calculation of X_dielectric_matrix(to try). Probably it is needed to avoid reallocation of dipoles. - PAR_INDEX reassigned: Now HF does not crash anymore, but it crash after second cohsex iteration to be cheked. The crash is in DIPOLE_SC_rotate. Dipoles (DIP_P) are not allocated. **Solved** - Verify that self-consistency is indeed active--> results differs. **to do**: write corrects wfs in database description. - check with MPA: **not working**: issue: (i) yambo stop becuse of header problem in dipoles. - The problem is the following: *[DIP] Header not correct or missing while reading dipoles*. It happens when loading dipoles in X_irredux. This is puzzling as this happens for MPA and not in cohsex. At this level this should be the same procedure and also the generated dipole database is the same in the two case. **to be solved** - While debugging I inserted some write of parameters and then removed in DIPOLE_IO.F. Now the code run smoothly. Anyway, there is a problem, **the code after the first iteration calculate COHSEX static em1s and not MPA. em1s is calculated with all bands (50) and 1 RL (probably some default) and then the read of the dipoles fails: ERROR in: *ERR*DIP band range : 1 10 *ERR*DIP band range limits : 10 1 **to check**: flags and X%whoami in the calculations of the screening inside the cycle!!!** - Besides that, looking at the flux there is something not correct. Flux is sc iteration--> new wfs + new density-->if Wsc loads dipoles, rotate dipoles, re-evaluate the screening. In X_irredux the dipoles are read again so the rotation does not take effect.From this consideration, only new energies enter in the new screening. The same is true for the wfs (used in scatterBamp?) **to think about** - Suggestions by Claudio: (i) Look if wfs are rotated inside wf_load. **Indeed it seems they are!!** (ii) Avoid rotations of dipoles and call covariant dipoles. (CM: there is already a comment in SC_driver saying that the dipole rotation should be replaced by covariant dipoles) - Changing the call to X_dielectric_matrix inside SC_driver removes the *ERR* in the report file. The code updates X once without trouble, but crashes the second time it tries to update it, after the call to X_dielectric_matrix, when it calls WF_rotate, because of an invalid pointer. - The call to SC_driver in driver/yambo was with X(:4) (hence leaving out MPA). Fixing this changes the error slightly to "double free or corruption". **Tests to be done** - Corentin test is not running (allocation problem?) - Check dipoles rotations (it happens twice consecutive??) *This is most probably wrong* - check which dipoles are effectively used in X_irredux - properly rotated dipoles (call it once and check they are not overwritten in X_irredux) - explore covariant solution - check COHSEX with updated W (most of te issues are probably already there) **note that COHSEX without empty bands shows degeneracy breaking, while when using empty bands the degeneracy is preserved** **More tests on W update** - The codes runs properly in serial, but crashes when running eg with 4 mpi tasks (test is Silicon 2x2x2)(*DVnote: so it happens also for cohsex, or for MPA only?*) - The problem is tracked to the rotation of WFs that happens in SC_driver after X_dielectric_matrix/X_irredux are called. - In particular: X_dielectric_matrix clears WF (which was loaded by the Self-energy calculators), and the following WF_load seems to do something different for some reasons. - Moreover: WF_rotate has a weird interface (WPo enters, WP and H_rotation are taken from modules, possibly leading to a number of inconsistency) - Importantly: it seems that the WF_rotations inside WF_load are switched off during our SC runs, making X_irredux to work on un-rotated wfcs. - Here, the WF_load in X_irredux needs also to be done with a parallel distribution of wfcs that is compatible with the rotations available. - Possibly, this could be obtained by properly selecting the levels of parallelism for X during SC runs with W updates. DV: after Andrea commit forcing the load and rotation of the wfs I did a serial test both cohsex/MPA: at iteration n.2 I get the following warnings/errors: [WARNING] [WF-X] WF-distribution turned off, not compatible with SC wave functions [WARNING] [WF-X] Forced 10 bands to be loaded [WARNING] Allocation of H_rotation failed with code 151 [WARNING] allocatable array is already allocated [WARNING] Object was already allocated in H_rotation [ERROR] Allocation of H_rotation_ref failed with code 151 The warnings/errors are emitted in load_sc_components.F called by WF_load in dielctric matrix: H_rotation results allocated before the call to the dielectric matrix despite WF_free(WF) is present before the call. --> Tried to fix inserting a call load_SC_components('WF_clean') to deallocate H_rotation before the call to the dielectric matrix: it doesn't work: it resuts in found_SC_DB, compatible_SC_DB F F and next a segmentation fault after the calculation of the screening. **changing strategy** 1. Avoid reading sc databases 2. Load more bands in X_irredux 3. X_irredux compute screening and free the WF 4. Load and rotate wfs after X_irredux **Verify that WF_load in X_irredux does not overwrite wfs already rotated** *serial* MPA_sc calculation: it runs (but not converge after 20 iteration) *parallel* 4 cpus ìt runs but results are slightly different (more than machine precision) from the serial one *parallel* 6 cpus crashes in PARALLEL distribution for Wave-Function states *parallel* 4 ncpu and 2 ncpu differ * All of this happens when updating W, otherwise it is correct* **Debugging strategy: Checks TODO** * We need to sort out the inconsistent dependence of the MPA-SC results on the parallel distribution used * In order to do this we can: 1. identify a couple of parallel distributions (with 2 or 4 MPIs, 2 prob better) that clearly do not work 2. print out the Hamiltonian matrix elements for each Hamiltoninan contribution (Bare Ho, XC, etc) in order to identify the issues (and when they kick in). CM: I investigated this, and found that matrix elements of the Hamiltonian are different for serial and 2 MPI very early on, when H_nl_sc is built, before the electric field is added. 3. Since the problems appear only when updating W, it is very likely the issues are in the XC part of the Hamiltonian 4. TOBE-CHECKED: is there any interference between the paralleism and the QP extrapolation used to correct the eigs of the states needed by X ? FOr instance: do we load all the QP and then extrapolate or each processor only has a few (distributed) and etrapolates using them ?? This could explain some issues... 6. **IMPORTANT**: even running serially, I get different results (errors in the order of 10-3 or more) in different runs (eg using mpirun -np 1 or not). I haven't checked yet whether this is random noise or it is reproducible, but I have the impression we have some randomness around - I tried a serial run and one with -np 1 and I could not find any difference: to be cheked further **DV: I’ve monitored the scissor applied to the bands nb > SC_bands for myid=0** - Results are identical in serial and parallel if W is not updated (as it should) - When W is updated, serial and parallel are identical after the first iteration, and also identical to the case of no W update (correct). Then they differs. - changing parallel strategy "b" and "qp" results change and both are different from the serial. - Interestingly there are differences also changing the X parallel strategy: "g" and "v" are equal, while "c" is different (order of meV) also after the 1st iteration where the scissor does not play a role, so it is the first X. This seems to me another story not related with what we are looking at. - **IMPORTANT** Switching to cohsex (both using and not using empty states) the results are indepndent on the number of processor. $\rightarrow$ at this point I think the problem must be sought in the MPA self-energy (or MPA screening) - CM: Results are identical when running in serial but with different numbers of tasks and processors in job file. It is therefore possible to run tests in serial and parallel in the same job file. - CM: The Hamiltonian H_nl_sc is identical in serial and parallel up to the call to V_qp_basis_to_H(ik,i_sp_pol,QP_Sc,H_nl_sc) in the second iteration of the self-consistent loop, where it starts differing. - CM: Printing QP_Sc in QP_mpa at the end of the QP_loop (starting at l.399 and ending at l.529) gives almost only zeros in the parallel run, already in the first iteration of the self-consistent loop. - CM: In QP_mpa l.457, l_RIM_W_g should also be private. This was already modified in the yambo-devel main branch. - Self-consistent COHSEX with updates in W gives the same results in serial and parallel. - Self-consistent PPA and MPA with updates in W give differences in energies of the order of 10^-4 at the second iteration of the self-consistent cycle. - CM: Changing values in X_and_IO_CPU, SE_CPU and DIP_CPU does not seem to change inconsistencies between serial and parallel. - AF: There is an additional part in QP_ppa_cohsex (l.481) which does not exist in QP_mpa and which is related to parallelisation. - CM: I tested changing the number of tasks to 4 and 8, and there is no clear difference with 2: there are differences between serial and parallel of the order of 10^⁻2 to 10^⁻3 in the energies starting in the second self-consistent iteration. The problem seems to be related to serial vs parallel. - Printing elements of DIP_P in DIPOLE_SC_rotate, which is called at the end of SC_driver, shows that the rotation causes differences between serial and parallel of the order of 10^{-7}, with a few differences as large as 10^(-4). This could be due to the fact that the rotation matrix H_rotate is different in serial and parallel. The Hamiltonian H_nl_sc just before adding the self-energy is identical in serial and parallel, and mostly similar just after adding the self energy, but some changes appear on a very small set of elements. After diagonalisation, changes of the order of 10^-4 appear. - CM: I reran tests after recompiling with double precision to clarify the situation. Namely, I printed 8 digits of the elements of several matrices in serial and with two tasks: - **QP_Sc at the end of QM_mpa**: the serial elements are equal to the sum of the elements in the matrices of the two processors in the first SC loop. Afterwards, there are 10^-4 differences. - **DIP_P at the beginning and the end of DIPOLE_SC_rotate**: the serial and parallel matrices are all equal for the first 3 prints, i.e. until after the rotation in the second SC cycle. Afterwards, differences of the order of 10^-4 appear. - **H_nl_sc in SC_driver before and after V_qp_basis_to_H, and before and after diagonalisation**: the serial and parallel matrices are all equal (except around V_qp_basis_to_H, where it is equal to zero on the second processor) until after calling V_qp_basis_to_H in the second SC cycle. Afterwards, differences of the order of 10^-4 appear between the serial file and the others. - **H_rotation in the beginning of DIPOLE_SC_rotate**: the serial and parallel matrices are all equal in the first SC cycle, and in the second SC cycle differences of the order of 10^-4 appear between the serial file and the others. - **X_par in the end of X_dielectric_matrix**: X_dielectric_matrix is called first in yambo.F, then in SC_driver in each SC loop when updating W. The second processor does not print X_par, as if it was not allocated, but this also happens when running COHSEX. The serial and parallel (for the first processor) matrices are equal when X_dielectric_matrix is called from yambo.F, but differences if the order of 10^-4 appear in the first call from SC_driver, meaning at the end of the first SC cycle. These differences appear only on the first q-point if X_and_IO_ROLEs="c v" and X_and_IO_CPU="1 2". But they appear on all the q-points if X_and_IO_CPU="2 1". - **Conclusion**: the differences in the self-energy (QP_Sc), the dipoles (DIP_P), the Hamiltonian (H_nl_sc) and the rotation matrix (H_rotation) all appear when using the susceptibility (X_par) calculated at the end of the first SC cycle, which is therefore the prime suspect. Moreover, these differences disappear when using COHSEX instead of PPA and MPA. The main difference there could be that X(2) is allocated differently than X(4) and X(5). From here highlighted text, the tests were done not using all the ocupied bands in the SC loop. This produced several problem related to FFT, bands in HF and so on. Let's consider all the ocucpied bands and ignore the yellow comments: <mark style="background-color: lightblue"> - DV just observed a weired behaviour, even in DP **there are difference if the code run in serial or with mpirun -n 1**. From now on checking difference serial vs mpirun -n 1 - DV looking at dielectric matrix serial vs n1, difference at second iteration already in Xo irredux. - DV going backward: difference in dipoles after rotation in second iteration. This should mean dipoles in Xo in 2nd iteration are ok. - DV Energies after shift E%E are ok after first iteration, then differs. This should mean energies in Xo in 2nd iteration are ok. - DV Xo indeed is ok in iteration 1 (after diago). Also X redux. Checked for all freqs - DV Hamiltonian iteration 2 is ok before adding QP_SC. Differ slightly after adding QP_SC - DV QP_SC after MPA in SC_add_XC.F differs at iteration 2. Debugging is needed in QP_mpa - QP_mpa X_mat is ok at iteration 2 ok. Then differs at it.3 - QP_mpa Residuals and poles. Some difference at Iteration 2, then different at iteration 3 - **Differences btw serial and mpi -n 1 disappear when Nthreads=1** - Anyway differences remains changing number of threads order 10^-3 - Differences remain changing mpi tasks. - Next step: check PPA (OMP and MPI). - PPA with nthread=1 seems to work comparing serial and MPI 2 and 4 - PPA nthreads 1, 8 and 16 slightly differ at 2nd ite - PPA nthreads QP_sc and Ham are the same up to the 2nd iteration. why energies are different? - OK. Differences arise after addition of Hartree term **WARNING**: The Hartree potential strongly depends on the number of threads (while the density at second iteration is ok.) Notably, only a portion of of the V_Hartree is thread dependent. FFT problem? fft_g_table seems ok. - Problem spotted when compiling with gfortran (Seg. Fault) in X_dielectric_matrix, probably in parallelism. - Still debugging omp (intel) on hydra. In the V_hartree there are two omp loop. The first one is ok while the second is the source of error ``` do ig=1,QP_ng_SH vhtmp(fft_g_table(ig,1))=vhg(ig) enddo ``` - The problem is the following: fft_size is reduced from max(g-g') to Xng in qp_ppa. Problem arise as it is smaller than QP_ng_SH which is wf_ng. Note that whtmp are dimensioned differently vhtmp(fft_size),vhg(QP_ng_SH) - The fft value changes in QP_ppa_cohsex: ``` if (((l_sc_srpa.or.l_cohsex.or.l_sc_coh).and..not.COHSEX_use_empties).or.l_GW_terminator) isc%ngrho=maxval(G_m_G) call WF_load(WF,isc%ngrho,maxval(qindx_S(:,:,2)),(/1,n_WF_bands_to_load/),(/1,k%nibz/),title=trim(ch)) ``` - It is set to the density cutoff max(g-g')when needed e.g. cohsex empty etc, while it is set to the Xng otherwise.This is used for the rhotw. Most probably it is convenient to leave as it is and reset the fft_dim before the call fo V_Hartree as in PPA gmg are not used. - **Besides the FFT size problem, the WF are load in qp_ppa and used for rhotw (scatterBamp), are they correctly rotated?** - FFT problem solved with poor man solution before V_Hartree call. Now threads=1,4 and MPI=1,2 gives same results. Results anyway seems weired (shift of the Fermi level not applied). Oddly the code chrashes where setting Wupdate=0, even COHSEX in HF at second iteration. Possibily related with rhotw dimension? To explore Note: in corvina (gnu) still crashing! - Seg fault Corvina: X_dielectric_matrix:elemental_IO crash in io_X. Problem in the netcdf call in io_connect on file ndb.pp (file exists from previous iteration) and is called in IO_WRITE_MODE, while at 1st iteration is in IO_CREATE_MODE. Same problem with cohsex. - The error is malloc(): smallbin double linked list corrupted. Debugger gdb: Program received signal SIGSEGV, Segmentation fault. 0x00002aaaac48ba2d in H5FL_reg_malloc () from libhdf5.so.200 - This problem in lyra does not appear. - **Another problem spotted. In SC HF summation is not done on all occupied states due to the condition: if (.not.PAR_IND_G_b%element_1D(jb)) cycle** is this correct? Pobably not. Anyway deep states ont included in the sc how should they be handled? ` </mark> -**Setting all bands in the sc loop solves most of the problem. For the moment is the right way to proceed. Commented previous patch. When using all bands parallelism problem are solved in PPA, still differences are observed in MPA slightly different results depending on OMP and MPI. If screening is not updated it is ok** -**Testing OMP** first difference appear in QP_SC at iteration 2. QP_mpa: X_mat is ok, differences in few residuals and poles, e.g. q=1, most of poles are the same. - M.C A.F We corrected SC_start_and_restart and WF_and_dipole_dimensions, where only X(2) was declared and used, to include X(4) and X(5).