# FxCorr vs. GCorr vs. Ewen tutorial
This tutorial provides the key elements for comparing manual and automated versions of the DiFX-based correlator deployment on target CPU-GPU architectures. The aim is replicability.
## Introduction
The manual version of the correlator can be found here:
`git clone git@github.com:XhrisPhillips/gcorr.git`
The principle of the core of DiFX is the following :
1. Data Alignment: Since the telescopes are separated by large distances, the signals they record may experience delays due to the varying lengths of the paths the signals travel. The correlator aligns these signals in time to ensure that they correspond to the same moment in the observation.
1. Correlation: Once the data are aligned, the correlator compares the signals received by each telescope. It performs a mathematical operation known as correlation, which involves multiplying the signals received by pairs of telescopes and summing them up over a specific time interval.
1. Image Formation: The correlated data contain information about the brightness and structure of the observed astronomical object.
FxCorr and Gcorr are two implementation of the DiFX correlator, the fisrt one targetting CPU architecture the second one targetting GPU architecture. Hereafter a simplified dataflow representation:

> **Data Acquisition**: Telescope data is read from disk. This data is assumed to be binary and contains no headers. (The data format ar detail in [FxCorr Design](https://github.com/XhrisPhillips/gcorr/tree/master/doc) in Section *Packed Data*). The binary files (xx.bin) are read and packed into a 2D array (Data per Antenna), there are as many binary files as the number of antenna. Subsequently, the delay between telescope is computed as a polynomial from the configuration file (xx.conf).
> **Floating point conversion** : Raw data (encoded as integers) is converted to floating point numbers. At this stage, a conversion creates a complex number with an imaginary zero value (phase 0). The data is divided into independent channels at this stage and store in a 3D array (Data per polarization per Antenna). An "offset" correction is also applied at this stage to correct for the geometric effects of signals received by the telescopes at different times.
> **Fringe rotation**: Each sample is 'fringe-rotated' to take account of the different speeds of the telescopes relative to each other (Doppler shift). This is achieved by applying a time-varying phase shift to each sample (more detail in [wiki: Doppler effect](https://en.wikipedia.org/wiki/Doppler_effect)).
> **FFT** of the samples: "N" samples in time are transformed into "channelled" data. This is repeated for each "N" samples in time. Changing time-domain to frequency-domain make it easier to extract the information afterwards.
> **X** Cross-correlation: For each FFT block, the individual frequency channels of each telescope are multiplied and then accumulated to form "visibilities". For N antennas, there are N(N-1)/2 combinations of unique baselines.Each combination must be formed. The data from a series of FFTs is accumulated to form a 'sub-integration'.Typically, this will take between 10 and 100 milliseconds.The sub-integrations are finally averaged for approximately one second to form the final visibility integration.
> **Accumulation**: The cross-correlation values for each FFT block are added to the previous iteration in the "visibilities" table.These are the final visibility products required.
> At the end phase and amplitude for each frequency channel and each baseline are stored in a vis.out file.
## Input file
As stated above, before doing anything you need to generate the input files, i.e. the binary file for each antenna and the telescope configuration file.
### Binary generation
```
cd generator
make
#This will generate a Test.vdf file containing spectrum data.
./generateSpectrum <args>
Usage: generateSpectrum [options]
-w/-bandwidth <BANDWIDTH> Channel bandwidth in MHz (64)
-b/-nbits <N> Number of bits/sample (default 2)
-C/-channels <N> Number of IF channels (default 1)
-c/-complex Generate complex data
-f/-float Save data as floats
-l/-duration <DURATION> Length of output, in seconds
-T/-tone <TONE> Frequency of tone (MHz)
-2/-tone2 <TONE> Frequency of second tone (MHz)
-a/-amp <amp> Amplitude of tone
-A/-amp2 <amp2>d Amplitude of second tone
-n/-noise Include (correlated) noise
-ntap <TAPS> Number of taps for FIR filter to create band shape
./generateSpectrum -w 16 -bits 2 -channels 2 -duration 10 -tone2 5 -noise -amp2 0.01 aa.bin
→ aa.bin
./generateSpectrum -w 16 -bits 8 -complex -channels 2 -duration 10 -tone2 5 -noise -amp2 0.01 aaa.bin
→ aaa.bin
#This will generate binary files from the generated configuration file
Usage: autoSpec [options]
-C/-chan <n> Channel to correlate (can specify multiple times)
-n/-npoint <n> # spectral channels
-N/-init <val> Number of ffts to average per integration
-device <pgdev> Pgplot device to plot to
-s/-skip <n> Skip this many bytes at start of file
-h/-help This list
./autoSpec -w 16 -b 2 -n 512 -C 2 -d /xs aa.bin
→ aa.Spec
```

> autoSpec output Two plots represents 2 channels where a tone has been set on 5MHz (2 bits for real)
```
./autoSpec -w 16 -b 8 -complex -n 512 -C 2 -d /xs aaa.bin
→ aaa.Spec
```

> autoSpec output Two plots represent 2 channels where a tone has been set on 5MHz (8 bits for complex)
<div style="background-color:#e6f7ff; padding:10px; border-radius: 5px;font-size: 12px">
<strong><em style="color:gray;">Tips:</em></strong><br>
<em style="color:gray;">
If you are setting up a 6 station test (for example), run generateSpectrum 6 times using a different file name each time. <br><br>
You can also fake your binary: <br>
touch aa.bin <br>
echo 01 > aa.bin <br>
touch Test-8bit-complex.bin <br>
echo 01 > Test-8bit-complex.bin <br>
</em>
</div>
### Configuration file
In order to run on GPU kernel make sure that your configuration file aka test.conf respect this (default NTHREADS = 256):
```
NBIT 8 (=2 + real or =8 + complex)
NPOL 2 (no matter, it's forced at 2)
COMPLEX 1
NCHAN 256 (<=NTHREADS or divisible ([256; 1024] otherwise on voit que dalle))
LO 1650000000 (no matter)
BANDWIDTH 16000000 (>0)
NUMFFTS 256 (<=NTHREADS or divisible and div(8))
NANT 4 (as many as "aa" binary file)
Aa aa.bin 0 0 0 0 0
Bb bb.bin 0 0 0 0 0
Cc cc.bin 0 0 0 0 0
Dd dd.bin 0 0 0 0 0
```
<div style="background-color:#e6f7ff; padding:10px; border-radius: 5px;font-size: 12px">
<strong><em style="color:gray;">Tips:</em></strong><br>
<em style="color:gray;">
If numFFT increases, then parallelism increases, memory requirements increase and data transfers increase. GPUs being more limited in memory than CPUs and having weaknesses in data transfer may no longer be the optimal solution. <br>
</em>
</div>
## Running FxCorr
First install and configure the stuff:
```
cd fxkernel/
#install ipp [Intel Integrated Performance Primitives for Linux ,2021.11.0,19 MB,Online,Mar. 27, 2024]
chmod +x l_ipp_oneapi_p_2021.11.0.532.sh
./l_ipp_oneapi_p_2021.11.0.532.sh
#set env var:
nano ~/.bashrc
#at the end
export INTELROOT=~/intel
export IPPROOT=~/intel/ipp
export CPATH=path/to/ipp/include
→ctrl + o, enter, ctrl+x
source ~/.bashrc
#in genipppc add l.122 "or major==2021"
python3 genipppc ~/intel/oneapi 2021.11
#copy past ipp.pc in the right place
sudo cp ~/DiFX/gcorr/fxkernel/ipp.pc /lib/pkgconfig/
autoreconf --install
./configure
make
```
IPP install [here](https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html#ipp)
Then run:
```
cd src
./bench_fxkernel test.conf
```
<div style="background-color:#300A24; padding:10px; border-radius: 5px;font-size: 12px">
<strong><em style="color:white;">Prompt:</em></strong><br>
<em style="color:gray;">
→ Got COMPLEX 0<br>
Got NBIT 2<br>
Got NPOL 2<br>
Got NCHAN 1024<br>
Got LO 1.65e+09<br>
Got BANDWIDTH 3.2e+07<br>
Got NUMFFTS 3200<br>
Got NANT 4<br>
Subint time is 102.4 msec<br>
Processing 1.024 sec <br>
Allocating 3 MB per antenna per subint<br>
12 MB total<br>
Initialising data to random values<br>
Launching Threads<br>
Go<br>
Run time was 1334 milliseconds<br>
785.819 Mbps<br>
</em>
</div>
```
./testfxkernel test.conf
```
<div style="background-color:#300A24; padding:10px; border-radius: 5px;font-size: 12px">
<strong><em style="color:white;">Prompt:</em></strong><br>
<em style="color:gray;">
→ Got COMPLEX 0<br>
Got NBIT 2<br>
Got NPOL 2<br>
Got NCHAN 1024<br>
Got LO 1.65e+09<br>
Got BANDWIDTH 3.2e+07<br>
Got NUMFFTS 3200<br>
Got NANT 4<br>
Aa:aa.bin<br>
Bb:bb.bin<br>
Cc:cc.bin<br>
Dd:dd.bin<br>
Allocating 3 MB per antenna per subint<br>
12 MB total<br>
</em>
</div>
Or:
```
cd bench
./runall.sh
```

## Running GCorr
### Running on your laptop
```
#GPU check (Should be NVIDIA)
lspci | grep -i nvidia
→ 3D controller: NVIDIA Corporation GA107M [GeForce RTX 2050] (rev a1)
```
Install NVIDIA CUDA Drivers & CUDA Toolkit for your system, see NVIDIA's [tutorial](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/
)
```
#Check CUDA compiler & CUDA install with
nvcc -V
→ nvcc: NVIDIA (R) Cuda compiler driver
→ Copyright (c) 2005-2024 NVIDIA Corporation
→ Built on Thu_Mar_28_02:18:24_PDT_2024
→ Cuda compilation tools, release 12.4, V12.4.131
→ Build cuda_12.4.r12.4/compiler.34097967_0
#check your GPU architecture
nvidia-smi
```
<div style="background-color:#300A24; padding:10px; border-radius: 5px;font-size: 12px">
<strong><em style="color:white;">Prompt:</em></strong><br>
<em style="color:gray;">
→ =========================================+======================+======================|<br>
| 0 Tesla P100-PCIE-16GB On | 00000000:04:00.0 Off | 0 |<br>
...<br>
+-----------------------------------------+----------------------+----------------------+<br>
| 1 Tesla P100-PCIE-16GB On | 00000000:82:00.0 Off | <br> ...<br>
</em>
</div>
> e.g.: Tesla P100-PCIE-16GB => Pascal architecture, NVIDIA GeForce RTX 2080 Ti => Turing, Quadro RTX 6000 => Turing , ... (ask chatGPT for that...)
| Fermi† | Kepler† | Maxwell‡ | Pascal | Volta | Turing | Ampere | Ada | Hopper | Blackwell |
| ------ | ------- | -------- | ------ | ----- | ------ | ------ | --- | ------ | --------- |
| sm_20 | sm_30 | sm_50 | sm_60 | sm_70 | sm_75 | sm_80 | sm_89 | sm_90 | ??? |
| | sm_35 | sm_52 | sm_61 | sm_72 | | sm_86 | | sm_90a | |
| | sm_37 | sm_53 | sm_62 | | | sm_87 | | | |
> †: deprecated from CUDA 9 and 11 onwards, ‡: deprecated from CUDA 11.6 onwards
And then build & make GCorr
```
cd gcorr/
aclocal
autoconf
autoheader
automake --add-missing
./configure
make
```
> Makefile is created in src/
adapt this line: `NVCC = nvcc -O3 -arch=sm_86 -lineinfo -maxrregcount 64` with your architecture:
`NVCC = nvcc -O3 -arch=sm_86 -arch=sm_60 -lineinfo -maxrregcount 64` otherwise you will have "no kernel image is available for execution on the device" error at the time of correlation.
To run the tests :
```
cd gcorr/bench
./runall.sh
```

### Running on grid5000 Nvidia GPU
Open an account [grid5000 account](https://www.grid5000.fr/w/Grid5000:Get_an_account).
*The example here are for rennes.*
Choose the node you need based on its characteristics (number of GPUs, memory ...) presented in [Rennes:Hardware - grid5000](https://www.grid5000.fr/w/Rennes:Hardware).
Check the availability of your `chosen` node on [Rennes:node](https://intranet.grid5000.fr/oar/Rennes/drawgantt-svg/) or [Rennes:node(production)](https://intranet.grid5000.fr/oar/Rennes/drawgantt-svg-prod/)
(more status [here](https://www.grid5000.fr/w/Status))
```
# ssh connect
ssh orenaud@access.grid5000.fr
ssh rennes
#connect 1 abacus node (they host NVIDIA GPU)
oarsub -q production -p abacus1 -I
#copy the folder
scp -r ~/path/gcorr orenaud@access.grid5000.fr:rennes
#Check CUDA compiler & CUDA install with
nvcc -V
cd gcorr/gcorr
aclocal
autoconf
autoheader
automake --add-missing
./configure
make
```
```
./benchmark_gxkernel test.conf
```
<div style="background-color:#300A24; padding:10px; border-radius: 5px;font-size: 12px">
<strong><em style="color:white;">Prompt:</em></strong><br>
<em style="color:gray;">
→ fftsamples = 4096 , numffts is 2048<br>
BENCHMARK PROGRAM STARTS<br>
Each unpacking test will run with 512 threads, 8 x 2048 blocks<br>
nsamples = 8388608<br>
nantennas = 6<br>
==== TIMER: calculateDelaysAndPhases ====<br>
Iterations | Average time | Min time | Max time | Data time | Speed up |<br>
100 | 0.017 ms | 0.014 ms | 0.041 ms | 0.066 s | 3793.364 |<br>
...<br>
</em>
</div>
```
./testgpukernel test.conf
```
<div style="background-color:#300A24; padding:10px; border-radius: 5px;font-size: 12px">
<strong><em style="color:white;">Prompt:</em></strong><br>
<em style="color:gray;">
→ reading configuration file <br> test.conf <br>
running 10 loops<br>
will output text data<br>
Subintsamples= 131072<br>
Subint = 2.048 msec<br>
Allocate Memory<br>
Allocating host data<br>
Allocating 0 MB per antenna per subint<br>
0 MB total<br>
Allocating GPU data<br>
Alloc 65536 complex output values per baseline<br>
Allocated 50.891 Mb on GPU<br>
Reading data
</em>
</div>
```
# Compare GPU correlation engine to CPU version
./validate_xcorr test.conf
```
<div style="background-color:#300A24; padding:10px; border-radius: 5px;font-size: 12px">
<strong><em style="color:white;">Prompt:</em></strong><br>
<em style="color:gray;">
→ ***CPU***<br>
1 1: (-1.7135,2.39079) ...<br>
***CrossCorr***<br>
Maximum difference = 1106.0758 (147587.14%)<br>
Average difference = 150.8477 (2130.07%)<br>
***CrossCorrAccumHoriz***<br>
Maximum difference = 0.0000 (0.00%)<br>
Average difference = 0.0000 (0.00%)<br>
***CCAH2***<br>
Maximum difference = 0.0000 (0.00%)<br>
Average difference = 0.0000 (0.00%)<br>
**CCAH3**<br>
Maximum difference = 0.0000 (0.00%)<br>
Average difference = 0.0000 (0.00%)<br>
</em>
</div>
>
```
#run
cd gcorr/bench
./runall.sh
```
<div style="background-color:#300A24; padding:10px; border-radius: 5px;font-size: 12px">
<strong><em style="color:white;">Prompt:</em></strong><br>
<em style="color:gray;">
→ ***CHANNELS***<br>
NCHAN= 64 <br>***ANTENNA***<br>
NANT= 4 NANT= 6 NANT= 8 NANT= 10 NANT= 12 NANT= 16 <br>***ANTENNA 16384***<br>
NANT= 4 NANT= 6 NANT= 8 NANT= 10 NANT= 12 NANT= 16 <br>***CHANNELS 8bit***<br>
NCHAN= 64 <br>***ANTENNA 8bit***<br>
NANT= 4 NANT= 6 NANT= 8 NANT= 10 NANT= 12 NANT= 16 <br> ***ANTENNA 16384***<br>
NANT= 4 NANT= 6 NANT= 8 NANT= 10 NANT= 12 NANT= 16 <br> *****************<br>
******HALF*******<br>
*****************<br>
***CHANNELS***<br>
NCHAN= 64 <br>***ANTENNA***<br>
NANT= 4 NANT= 6 NANT= 8 NANT= 10 NANT= 12 NANT= 16 <br>***ANTENNA 16384***<br>
NANT= 4 NANT= 6 NANT= 8 NANT= 10 NANT= 12 NANT= 16 <br> ***CHANNELS 8bit***<br>
NCHAN= 64 ***ANTENNA 8bit***<br>
NANT= 4 NANT= 6 NANT= 8 NANT= 10 NANT= 12 NANT= 16 <br> ***ANTENNA 16384***<br>
NANT= 4 NANT= 6 NANT= 8 NANT= 10 NANT= 12 NANT= 16
</em>
</div>
## Output file
You can visualize your data with the notebook available [here](https://colab.research.google.com/drive/1wEkoTKZVg0fBC_8KcRdCKvL0_GrnKbjb#scrollTo=JAQfFu9OwXy2).
Import your vis.out file and run the code.

> Display of the 4 average caracheristics by baselines (detailed in the notebook).
>TODO: changer le lien vers github
## References
[[1] E. Michel, O. Renaud, A. Deller, K. Desnos, C. Phillips, J.-F. Nezan, Static Dataflow Synthesis for Heterogeneous CPU-GPU systems, IETR, , Swinburne, CSIRO, 202_](https://fr.overleaf.com/project/660670ce5cf576bdb096dcd4).
>TODO: change the link to pdf version once published
[[2] A.T. Deller, S.J. Tingay, M. Bailes, & C. West, DiFX: A software correlator for very long baseline interferometry using multi-processor computing environments, Swinburne, 2007](https://arxiv.org/abs/astro-ph/0702141).
[[3] A.T. Deller, W.F. Brisken, C.J. Phillips, J. Morgan, W. Alef, R. Cappallo, E.Middelberg, J. Romney, H. Rottmann, S.J. Tingay, R. Wayth
DiFX2: A more flexible, efficient, robust and powerful software correlator,Swinburne, 2011](https://arxiv.org/abs/1101.0885).
[](https://github.com/XhrisPhillips/gcorr)
[](https://github.com/preesm/preesm-apps)
## Acknowledgement
*This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 873120.*
####