Scale-up PINN to Real Data

# Scale-up PINN to Real Data Summarize the current performance issues on scaling up the PINN method to real-world data. The number of collocation points we are targeting is about $10^8$ data points. The settings: - We assume using $10^8$ collocation points, the same number as data points. - We use a model that is larger than the typical models we use (8 layers with 128 nodes each, and a final layer with 4 nodes as output) - We run the expeirments with $2\times10^3$ iterations and multiply the number to make estimatation on $10^5$ iterations, which is also larger than typical one we use. - The optimizer we use is Adam. L-BFGS is roughly 2x times slower than Adam - Tests on the Della, using the IceShelf2D project at commit [172caf](https://github.com/YaoGroup/IceShelf2D/tree/172caf06bb30de3528f09f87fd4cb7984636e15f). We consider the estimation larger than the actual time required but is close enough to be a guide. ### GPU Result: |# of Collocation/Data Points|Raw Time(secs)|Unit Time(secs per $pt\times10^3$ iter) |-|-|-| |1024 pts (32 * 32) | 31.6597| 0.0154| |4096 pts (64 * 64) | 32.8476| 0.0040| |16384 pts (128 * 128) | 75.0162| 0.0022| |65536 pts (256 * 256) | 185.3456| 0.0014| |262144 pts (512 * 512) | OOM| -| ### CPU Result: |# of Collocation/Data Points,CPU Cores|Raw Time(secs)|Unit Time(secs per $pt\times10^3$ iter) |-| -| -| |1024 pts, 1 core| 835.4102| 0.4079| |1024 pts, 8 core| 473.1401| 0.2310| |1024 pts, 16 cores| 439.2834| 0.2145| |4096 pts, 8 core| 1308.3128| 0.1597| |4096 pts, 16 cores| -| -| |16384 pts, 8 core| 4141.0066| 0.1264| |16384 pts, 16 cores| 3293.7275| 0.1005| |65536 pts, 16 cores| 11379.4185| 0.0868| |262144 pts, 16 cores| -| -| Ray's runs show that GPU is significantly faster (> 50x) than CPU. The difference between CPU/GPU is significantly larger then on our workstation. **Ray suspects the CPU time can be reduced. He will test more on it**. Note: Raw time = Adam with $2\times10^3$ iterations ### Estimated Total Run Time We use the best unit time in the above, multiply by $10^8$ points $10^5$ iterations to get the **estimated total computing time (ETCT)** for the task: - ETCT If all jobs using GPU ~ **$3.9\times10^3$ Hrs** - ETCT If all jobs using CPU, 8 cores ~ **$3.5\times10^5$ Hrs** Finally, we have to consider the resources on the [Della system](https://researchcomputing.princeton.edu/systems/della). The **total run time (TRT)** highly depends on our assumption on the resources we can gather. Ray's experience is that basically nobody except us uses GPU. CPU cores are relatively hard to acquire, depending on the number of cores required. With 16 cores, we can get 10 nodes within an hour. With 8 cores, we can likely acquire 20 nodes almost immediately. To make a conservative estimation, we assume that we can only get 16 GPU nodes (with 2 GPUs each) and 40 CPU nodes (with 8 cores each). Assuming no gaps between jobs, doing some math, we can get the **estimated total run time (ETRT)**: GPU 16 * 2 = 32 instances. Each has a unit speed of $\frac{1}{3.9\times10^3}$ CPU has 40 instances. Each has a unit speed of $\frac{1}{3.5\times10^5}$ Solving $\frac{32x}{3.9\times10^3} + \frac{40x}{3.5\times10^5} = 1$, we have $x=121$ hrs ~ 5 days ### Factors Affecting ETRT #### Is the Server Running Ceaselessly? The above assumes the given resources are running 24/7 non-stop. Since we already discount the available resources, that is likely to happen. If, say, only 16 hrs is available per day, then the number goes from 121 hrs to 121 * 24/16 ~ 180 hrs #### Is the $10^8$ Collocation Points a Good Estimation? Since equation residue is the most computing entensive parts of the program. Actually, the computing time is linearly proportional to the number of collocation points and relatively less depends on the number of data point. If $10^8$ is not a good one, the ETRT may rise (or fall, if $10^8$ is over-estimated). #### Can We Use 65536 Points Per Job? As one can see from the above tables. The power of parallel computing of GPU shines when we feed enough data. The 120 hrs ETRT is assuming 65536 points per job. If we reduce the number to 4096, the ETRT will be 3 times (360 hrs). #### Model Size and Number of Iterations The model size and number of iterations are also important factors for the ETRT. The ETRT is also linearly proportionaly to both the model size and number of iterations. We expect our ETRT is overestimated on both factors, so 5 days should be a safe bet. ## Others Relavent Topis ### Acceleration with One GPU/CPU **Note: The results in [Summary](/BLBm6v2CRHGe5xsfPGPN_g?view#Summary) alread include this boost.** We can have an almost free speed-up by using XLA: ```python # Add followings lines at the beginning of script makes it 1.3x - 2x faster! import os os.environ["TF_XLA_FLAGS"]="--tf_xla_enable_xla_devices --tf_xla_auto_jit=2" os.environ["XLA_FLAGS"]="--xla_gpu_cuda_data_dir=/usr/local/cuda" ``` References: [TensorFlow Official Site on XLA](https://www.tensorflow.org/xla) [TensorFLow Talk On XLA](https://www.youtube.com/watch?v=cPAD9vLKE0c) #### Jax ... ### Acceleration with Data Parallelism (with working code) Generally, we divide our domains into smaller grids and run separate jobs over those grids to achieve fully parallel running (which is called [embarassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel)). In that case, with N GPUs fully loaded, we can achieve N times speedup. However, this approach imposes some limitations: we can not train a single model with a larger domain size without sacrificing the density of collocation points. We can migrate this issue via data parallelism. Data parallelism is ... It introduces an overhead of ... :::success **This method does not increase training time per GPU**. Totally separated jobs will achieve best per GPU training time due to no communication overhead. ::: Sample code is ... Experiment results ... #### Horovod ...