# Best practices for Git repositories consisting of very large number of TEI XML files This document: https://hackmd.io/@UPPMAX/git-large-tei-xml ## Possible workflow on Bianca - clone / partial clone to scratch (node-local I/O operations faster) - commit locally - push to central repo - commit less often ## Challenges with Git version control for TEI XML files - inconsistent indentation, line wrapping, attribute reordering - ensure validation and schema compliance durring commits and pull requests - e.g., use tools like `xmllint` and integrate it with pre-commit hooks ## Git version control for repositories with a very large number of files Ensure you have a recent version of Git as some of the options are new. - organizes files into subdirectories - consider: - submodules - subtrees - `sparse-checkout` and partial clone (per-user narrow clones) - or a combination of the above - add in `.gitignore` everything that does not need version control - Configure Git: **CAUTION**: these options need to be tested to see if it improves the performance for this particular repo. Make sure you understand what the config does before implementing the change: - `git config feature.manyFiles true` lets Git know it's dealing with a large repo - `git update-index --index-version 4` performs a simple pathname compression that reduces index size by 30%-50% on large repositories - performance tweaks: - `git config --global core.preloadIndex true` preloads index in memory, speeds up git status and git diff - `git config --global core.deltaBaseCacheLimit 512m` affects delta compression performance - `git config --global core.fsync none` reduces I/O latency, safe for scratch/non-critical repos, would be suitable for local work before pushing to the central repo - `git config --global core.packedGitLimit 512m` the larger, the better, needs to fit in available RAM - `git config --global core.packedGitWindowSize 64m` useful during compression/cloning/fetching, value may need tuning, large window sizes increases memory usage - `git config --global core.untrackedCache true` useful for repos with many untracked/generated files - git status speed-up and and index optimization: - `git config status.showUntrackedFiles no` - `git config diff.renameLimit 0` - ~~`git config core.fsmonitor true`~~ speeds up git status; update: does not seem to work, unsure about the suitable argument for fsmonitor - disable git garbage collection (`git config gc.auto 0`), reason: auto garbage collection is too heavy in a shared environment - maybe??? - `git config pack.threads 1` sets the pack threads to 1 Example of `.gitconfig` (does not include all options above): ``` [core] preloadIndex = true deltaBaseCacheLimit = 512m packedGitLimit = 512m packedGitWindowSize = 64m fsync = none ignoreStat = true untrackedCache = true [status] showUntrackedFiles = no [gc] auto = 0 [pack] threads = 1 ``` References: - https://www.git-tower.com/blog/git-performance/