Currently HP-H users are facing lot's of performance issues, affecting the system usablilty and preventing users from being able to work with the system. ## Problem Statement: Users of HP-H are currently experiencing significant performance issues which are affecting system usability and hindering work processes. ![](https://hackmd.io/_uploads/rygEOGQxp.png) ## Detailed Issues: Slow loading and occasional failure of Digital Lab tiles. Although a direct connection shows slight improvement, reliability remains a concern. Frequent errors arising due to failed connections with the cloud service (tiles-hub) that is responsible for serving tiles to users over the internet. ## What we did: #### Server Resource Assessment: Internet Speed: Adequate (Download: 3.5 Gbps, Upload: 1.3 Gbps). CPU and RAM: Appear to be sufficient. Digital Lab consumes 12GB of RAM for caching. ![](https://hackmd.io/_uploads/rks_sMQxp.png) #### Scale-Up cloud services: We have doubled the allocated resources for our tile-hub service. Although, the services were not showing any un-healthy status, we've doubled the allocated resources to ensure a high performance. #### Load Testing Conducted a load test on HPH for 10 concurrent users over 20 seconds, resulting in a rate of 368 tiles served per second via the Digital Lab server. > The test was performed over the HP-H VPN during working hours. ![](https://hackmd.io/_uploads/B1HaszmlT.png) #### Analyse logs and monitoring: **Agent - logs** We have many errors happening because of failing connection with the cloud service (tiles-hub) This shows network connection error between the on-premise serivce (Agent) and the cloud (tiles-hub) service. Since we can't see these failing requests on our logs, it might be related to blocked requests from the HP-H network infrasturcture. ``` 2023-09-27 10:33:27.670 +02:00 [ERR] HubConnection reconnecting due to an error. System.TimeoutException: Server timeout (5000,00ms) elapsed without receiving a message from the server. 2023-09-27 10:38:13.957 +02:00 [ERR] HubConnection reconnecting due to an error. System.TimeoutException: Server timeout (5000,00ms) elapsed without receiving a message from the server. 2023-09-27 10:38:14.010 +02:00 [ERR] HubConnection reconnecting due to an error. System.TimeoutException: Server timeout (5000,00ms) elapsed without receiving a message from the server. 2023-09-27 10:38:14.010 +02:00 [ERR] Failed to serve a tile ``` **Browser Logs (Direct - connection):** Lot's of errors from client's browsers are being reported indicating the requests are not being received by the on-premise Digital Lab server in HPH (https://patho-hph.depot.pathozoom.com) ![](https://hackmd.io/_uploads/ryuO9MmgT.png) **Indexing logs** There is a pottential correlation between the performance issues and the indexing of new slides. This could happen for the following reasons: - Network bandwidth being consumed. - NAS resources being utilized. ![](https://hackmd.io/_uploads/HJvKRfme6.png) ## What we are doing: - Reduce the load on the NAS. - Better source directory enumeration (short-term fix). - Implement incoming folder (long-term solution). - Better handling for failing tiles: When tiles serving from Agent fails, it creates lots of errors and exception this might be slightly affecting availablilty but shouldn't be the main factor for the performance issues.. - More monitoring