# Best practices for Git repositories consisting of very large number of TEI XML files
This document: https://hackmd.io/@UPPMAX/git-large-tei-xml
## Possible workflow on Bianca
- clone / partial clone to scratch (node-local I/O operations faster)
- commit locally
- push to central repo
- commit less often
## Challenges with Git version control for TEI XML files
- inconsistent indentation, line wrapping, attribute reordering
- ensure validation and schema compliance durring commits and pull requests
- e.g., use tools like `xmllint` and integrate it with pre-commit hooks
## Git version control for repositories with a very large number of files
Ensure you have a recent version of Git as some of the options are new.
- organizes files into subdirectories
- consider:
- submodules
- subtrees
- `sparse-checkout` and partial clone (per-user narrow clones)
- or a combination of the above
- add in `.gitignore` everything that does not need version control
- Configure Git: **CAUTION**: these options need to be tested to see if it improves the performance for this particular repo. Make sure you understand what the config does before implementing the change:
- `git config feature.manyFiles true` lets Git know it's dealing with a large repo
- `git update-index --index-version 4` performs a simple pathname compression that reduces index size by 30%-50% on large repositories
- performance tweaks:
- `git config --global core.preloadIndex true` preloads index in memory, speeds up git status and git diff
- `git config --global core.deltaBaseCacheLimit 512m` affects delta compression performance
- `git config --global core.fsync none` reduces I/O latency, safe for scratch/non-critical repos, would be suitable for local work before pushing to the central repo
- `git config --global core.packedGitLimit 512m` the larger, the better, needs to fit in available RAM
- `git config --global core.packedGitWindowSize 64m` useful during compression/cloning/fetching, value may need tuning, large window sizes increases memory usage
- `git config --global core.untrackedCache true` useful for repos with many untracked/generated files
- git status speed-up and and index optimization:
- `git config status.showUntrackedFiles no`
- `git config diff.renameLimit 0`
- ~~`git config core.fsmonitor true`~~ speeds up git status; update: does not seem to work, unsure about the suitable argument for fsmonitor
- disable git garbage collection (`git config gc.auto 0`), reason: auto garbage collection is too heavy in a shared environment
- maybe???
- `git config pack.threads 1` sets the pack threads to 1
Example of `.gitconfig` (does not include all options above):
```
[core]
preloadIndex = true
deltaBaseCacheLimit = 512m
packedGitLimit = 512m
packedGitWindowSize = 64m
fsync = none
ignoreStat = true
untrackedCache = true
[status]
showUntrackedFiles = no
[gc]
auto = 0
[pack]
threads = 1
```
References:
- https://www.git-tower.com/blog/git-performance/