Try  HackMD Logo HackMD

Best practices for Git repositories consisting of very large number of TEI XML files

This document: https://hackmd.io/@UPPMAX/git-large-tei-xml

Possible workflow on Bianca

  • clone / partial clone to scratch (node-local I/O operations faster)
  • commit locally
  • push to central repo
  • commit less often

Challenges with Git version control for TEI XML files

  • inconsistent indentation, line wrapping, attribute reordering
  • ensure validation and schema compliance durring commits and pull requests
    • e.g., use tools like xmllint and integrate it with pre-commit hooks

Git version control for repositories with a very large number of files

Ensure you have a recent version of Git as some of the options are new.

  • organizes files into subdirectories
  • consider:
    • submodules
    • subtrees
    • sparse-checkout and partial clone (per-user narrow clones)
    • or a combination of the above
  • add in .gitignore everything that does not need version control
  • Configure Git: CAUTION: these options need to be tested to see if it improves the performance for this particular repo. Make sure you understand what the config does before implementing the change:
    • git config feature.manyFiles true lets Git know it's dealing with a large repo
    • git update-index --index-version 4 performs a simple pathname compression that reduces index size by 30%-50% on large repositories
    • performance tweaks:
    • git config --global core.preloadIndex true preloads index in memory, speeds up git status and git diff
    • git config --global core.deltaBaseCacheLimit 512m affects delta compression performance
    • git config --global core.fsync none reduces I/O latency, safe for scratch/non-critical repos, would be suitable for local work before pushing to the central repo
    • git config --global core.packedGitLimit 512m the larger, the better, needs to fit in available RAM
    • git config --global core.packedGitWindowSize 64m useful during compression/cloning/fetching, value may need tuning, large window sizes increases memory usage
    • git config --global core.untrackedCache true useful for repos with many untracked/generated files
    • git status speed-up and and index optimization:
      • git config status.showUntrackedFiles no
      • git config diff.renameLimit 0
    • git config core.fsmonitor true speeds up git status; update: does not seem to work, unsure about the suitable argument for fsmonitor
    • disable git garbage collection (git config gc.auto 0), reason: auto garbage collection is too heavy in a shared environment
    • maybe???
      • git config pack.threads 1 sets the pack threads to 1

Example of .gitconfig (does not include all options above):

[core]
    preloadIndex = true
    deltaBaseCacheLimit = 512m
    packedGitLimit = 512m
    packedGitWindowSize = 64m
    fsync = none
    ignoreStat = true
    untrackedCache = true

[status]
    showUntrackedFiles = no

[gc]
    auto = 0

[pack]
    threads = 1

References: