The "Partial Clone" feature is a performance optimization for Git that allows Git to function without having a complete copy of the repository. The goal of this work is to allow Git better handle extremely large repositories.

During clone and fetch operations, Git downloads the complete contents and history of the repository. This includes all commits, trees, and blobs for the complete life of the repository. For extremely large repositories, clones can take hours (or days) and consume 100+GiB of disk space.

Often in these repositories there are many blobs and trees that the user does not need such as:

  1. files outside of the user’s work area in the tree. For example, in a repository with 500K directories and 3.5M files in every commit, we can avoid downloading many objects if the user only needs a narrow "cone" of the source tree.

  2. large binary assets. For example, in a repository where large build artifacts are checked into the tree, we can avoid downloading all previous versions of these non-mergeable binary assets and only download versions that are actually referenced.

Partial clone allows us to avoid downloading such unneeded objects in advance during clone and fetch operations and thereby reduce download times and disk usage. Missing objects can later be "demand fetched" if/when needed.

Use of partial clone requires that the user be online and the origin remote be available for on-demand fetching of missing objects. This may or may not be problematic for the user. For example, if the user can stay within the pre-selected subset of the source tree, they may not encounter any missing objects. Alternatively, the user could try to pre-fetch various objects if they know that they are going offline.

Non-Goals

Partial clone is a mechanism to limit the number of blobs and trees downloaded within a given range of commits — and is therefore independent of and not intended to conflict with existing DAG-level mechanisms to limit the set of requested commits (i.e. shallow clone, single branch, or fetch <refspec>).

Design Overview

Partial clone logically consists of the following parts:

Design Details

Handling Missing Objects

Fetching Missing Objects

Current Limitations

Future Work

Non-Tasks

Footnotes

[a] expensive-to-modify list of missing objects: Earlier in the design of partial clone we discussed the need for a single list of missing objects. This would essentially be a sorted linear list of OIDs that the were omitted by the server during a clone or subsequent fetches.

This file would need to be loaded into memory on every object lookup. It would need to be read, updated, and re-written (like the .git/index) on every explicit "git fetch" command and on any dynamic object fetch.

The cost to read, update, and write this file could add significant overhead to every command if there are many missing objects. For example, if there are 100M missing blobs, this file would be at least 2GiB on disk.

With the "promisor" concept, we infer a missing object based upon the type of packfile that references it.

[0] https://crbug.com/git/2 Bug#2: Partial Clone

[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/
Subject: [RFC] Add support for downloading blobs on demand
Date: Fri, 13 Jan 2017 10:52:53 -0500

[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/
Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
Date: Fri, 29 Sep 2017 13:11:36 -0700

[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/
Subject: Proposal for missing blob support in Git repos
Date: Wed, 26 Apr 2017 15:13:46 -0700

[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/
Subject: [PATCH 00/10] RFC Partial Clone and Fetch
Date: Wed, 8 Mar 2017 18:50:29 +0000

[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/
Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module
Date: Fri, 5 May 2017 11:27:52 -0400

[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/
Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand
Date: Fri, 14 Jul 2017 09:26:50 -0400