Gitifying I2P: How to make git clone resumable

2018-09-07 20:50:45 +0200 - Written by Mikal Villa

Have you ever cloned a git repository on a bad internet connection? I have, it don’t work. When I lived in the Philippines for a year and a half, I had to clone repositories to a server of mine in Norway, for then sharing the repo via torrent so I could download it to my laptop in the Philippines. And it seems many has asked for this feature, or workarounds when I’ve googled the topic. Luckly, it’s hope for git and resumable clone/fetch(es).

This year, I’m so lucky that I got the chance to work on I2P fulltime. I2P is a mix-network, somewhat alike Tor. This link compares the two. Working for&with I2P is beyond aweome, but we got a minor problem which I bet almost nobody of the readers have used, ever. It’s our version control system named Monotone, it’s actually not that bad in itself, but it’s a couple of issues with it, which is:

  • No one afaik uses it besides us (the I2P project).
  • Verry little documentation of it, if you google it - you won’t find anything.
  • It’s not been updated since 2011.
  • It’s hard to get into, even with knowledge from other VCS.

Monotone was choosen before the git revolution, and it was a quite good tool cause it supported cryptographic signing of commits and resume on network errors, which is perfect for I2P.

However, now many years later Git is familiar to almost every developer, and it supports both signed commits, but also recovery/resume out of box! And no, I’m not talking about the achive or bundle features.

This started with research into git internals and git remote helpers, and after a carefully look into all of the git binaries and scripts, I found an option in a rather hidden git subcommand named http-fetch. You could do man git-http-fetch or read the manual online at git’s webpage if you wish to read the manual on it.

For my test over I2P, I decided to repack the server’s git repository with the command git --git-dir=bare-test-repo.git repack --threads 8 --max-pack-size 1m -A -d -f -F which would create a lot of pack files with a maximum size of 1mb each, instead of a big one on 300mb. Please also note that I’ve just by tweaking parameters and repacked/unpacked objects have grown and shrinked the Git repo from 646mb, 384mb and 148mb without adding or removing anything from the content - so it’s performance and optimalization possibilities here. Note that Git by default keeps everything in one packfile cause it’s the best for performance. At last, I served the Git repository over http with grack and pointed a I2P tunnel towards it.

I also noticed http-fetch will always try loose objects first so you can both split your pack files into chuncks that fits your need or just keep them all loose. The Git book in the internals section can learn you more about packfiles and loose objects here.

Here is a script to unpack packfiles to loose objects in a git repository:

#!/bin/sh

if [ -f .git/objects/pack/*.pack ]; then
  mkdir /tmp/tmpgit.$$
  GIT_DIR=/tmp/tmpgit.$$ git init

  for pack in .git/objects/pack/*.pack; do
    GIT_DIR=/tmp/tmpgit.$$ git unpack-objects < $pack
    if [ $? -ne 0 ]; then
      echo "Unpack of $pack failed, aborting"
      exit 1
    fi
  done

  rsync -a --info=PROGRESS2 --delete /tmp/tmpgit.$$/objects/ .git/objects/

  rm -fr /tmp/tmpgit.$$
else
  echo "No packs to unpack"
  exit 1
fi

name it git-unpack and place it it PATH, and you can use it as git unpack inside the root directory of your Git repository :)

A sample from access log produced by grack:

127.0.0.1 - - [07/Sep/2018:21:33:34 +0200] "GET /i2p.git/objects/info/packs HTTP/1.1" 200 28303 0.0024
127.0.0.1 - - [07/Sep/2018:21:33:37 +0200] "GET /i2p.git/objects/74/275493be573bc4917f0dec38f141e6d3d1dec3 HTTP/1.1" 404 - 0.0008
127.0.0.1 - - [07/Sep/2018:21:33:39 +0200] "GET /i2p.git/objects/69/78829554290491439f4515943b88f740a8fdca HTTP/1.1" 404 - 0.0009
127.0.0.1 - - [07/Sep/2018:21:33:41 +0200] "GET /i2p.git/objects/fa/05f88bfe4a8e1ae4b07822ccd0d7176615c420 HTTP/1.1" 404 - 0.0010
127.0.0.1 - - [07/Sep/2018:21:33:43 +0200] "GET /i2p.git/objects/d0/9e885133c6a23773a040562ec96e66f7e00c29 HTTP/1.1" 404 - 0.0014
127.0.0.1 - - [07/Sep/2018:21:33:45 +0200] "GET /i2p.git/objects/ea/d1f93d7c91c660c2ec71dd0076d4c61e44dfbb HTTP/1.1" 404 - 0.0009
127.0.0.1 - - [07/Sep/2018:21:33:46 +0200] "GET /i2p.git/objects/pack/pack-27414fbc03964eeff8b3922ef6bea0b1ade4fa6e.pack HTTP/1.1" 200 1047379 0.0825

So to actually be able to clone a Git repository over I2P you’ll have to switch clone with three commands which of course can be combined to a script. git-http-fetch expects to be run inside an repo, so we can’t use it with the clone subcommand. Instead take a look bellow.

mkdir i2p.git
cd i2p.git
git init # Do not add anything to it, no commits, leave it empty.
# Next command will list all refs (branches/tags) with their HEAD hashes.
http_proxy=127.0.0.1:4444 curl -v http://4grg5bjtr5m6fpat2wpwfq7axpfhkuvmsna5z4iqlthu72k3bv5a.b32.i2p/i2p.git/info/refs
# this can return something like:
# 4ead982831f4097ff058a847fdb2b89cc97d2ad7	refs/heads/master
# the hash starting on 4ead9 is needed in next command
# --
# The command bellow can be placed in a loop that check it's exit code 
# and exit the loop when it's 0, but retries if not.
# This is the resumable magic :D
http_proxy=127.0.0.1:4444 git http-fetch --recover -a 4ead982831f4097ff058a847fdb2b89cc97d2ad7 http://4grg5bjtr5m6fpat2wpwfq7axpfhkuvmsna5z4iqlthu72k3bv5a.b32.i2p/i2p.git

With the commands above, I was able to “clone” the whole I2P codebase (git-export from monotone with full commit history) hosted at one I2P router, to my laptop via another I2P router connecting to the git-host router.

Another option is to use wget or something and then pipe it into the pack parser utils in Git. To get a list of packs to download do a GET /i2p.git/objects/info/packs on the server. Basically you append /objects/info/packs to the repository url.

Short conclution: Git continued where it stopped both at a network read error, and after I killed the process, so http-fetch can indeed recover with the recover flag. Also, I’m not saying it’s not better options out there for resumable Git, but I’ve at least not found it yet in that case :)

I’ve collected some links I’ve looked over and some manuals in a gist which anyone can look into. Since this webpage also is available on I2P as 0xcc.i2p, and GitHub isn’t - I’ll embedd the markdown bellow.

# Git in depth

## How it works

* https://git-scm.com/book/en/v2/Git-Internals-Packfiles
* https://git-scm.com/book/en/v2/Git-Internals-Transfer-Protocols
* https://rovaughn.github.io/2015-2-9.html
* https://git-scm.com/docs
* http://shafiulazam.com/gitbook/7_how_git_stores_objects.html
* https://git.wiki.kernel.org/index.php/Git_FAQ
* https://mirrors.edge.kernel.org/pub/software/scm/git/docs/gitremote-helpers.html
* https://docs.google.com/document/d/1X5SnleaX4qpLCc4QMAMWdvrA5QRsUO-YxXKjhSZRPpY/edit#
* http://marklodato.github.io/visual-git-guide/index-en.html
* http://gitready.com/beginner/2009/02/17/how-git-stores-your-data.html
* https://chromium.googlesource.com/chromium/src/+/master/docs/git_cookbook.md
* https://book.git-scm.com/
* https://github.com/peritus/git-remote-couch/blob/master/src/git_remote_couch/__init__.py
* https://mirrors.edge.kernel.org/pub/software/scm/git/docs/technical/pack-format.txt
* http://repo.or.cz/w/git.git?a=tree;f=Documentation/technical;hb=HEAD
* https://www.atlassian.com/blog/git/tear-apart-repository-git-way
* https://github.com/potherca-bash/git-split-file
* https://githubengineering.com/counting-objects/
* https://msdn.microsoft.com/en-us/magazine/mt493250.aspx
* https://hackernoon.com/when-you-git-in-trouble-a-version-control-story-97e6421b5c0e

## Commands of special interest

* `git verify-pack -v .git/objects/pack/pack-*.idx` - Output data about the indexes
* `git show-index < .git/objects/pack/pack-*.idx` - Output data about the indexes (more and other format than above)
* `git rev-list REFS` - list all revs in a ref
* `find .git/objects -type f`  - get all object files
* `find .git/refs -type f` - list all refs 
* `git count-objects -v` - list objects and types


## Manuals worth looking into

Many, even hidden commands can be found @ `git help -a`

* git-cat-file
* git-ls-tree
* git-show-ref
* git-rev-list
* git-upload-pack
* git-show-index
* git-unpack-objects
* git-unpack-file
* git-index-pack
* git-fetch-pack
* git-gc
* git-fast-import
* git-ls-files
* git-pack-refs
* git-index-pack
* git-http-fetch

Tags: , ,

Updated: