repository: Add ObjectsSize RPC to calculate fine-grained objects size (6cc0e03f) · Commits · gitlab-org / Gitaly

Commit 6cc0e03f authored 1 year ago by

Patrick Steinhardt

repository: Add ObjectsSize RPC to calculate fine-grained objects size

In order to calculate a repository's size we provide multiple different
functions. All of these have in common that they return the on-disk of
various data structures in varying degrees of detail. But none of them
provide the caller with the means to calculate the size of objects which
are reachable from a starting set of revisions. This results in multiple
problems:

    - It is hard to calculate the size for subsets of the object graph,
      e.g. for only newly pushed objects or to exclude references that
      are internal, only.

    - While `RepositorySize()` discerns normal objects from those which
      are currently waiting to be pruned via cruft packs, this metric is
      lagging behind significantly as cruft packs are only updated every
      few days.

    - Objects that exist in multiple packfiles or both as a packed and
      loose object will be accounted for multiple times.

    - It is impossible to figure out whether a subset of objects is
      deduplicated via object pools.

This information can be quite important in certain contexts though, e.g.
when trying to calculate storage size quotas.

Implement a new `ObjectsSize()` RPC that calculates the size of objects
reachable from a given set of (pseudo-)revisions via git-rev-list(1).
This is as accurate as we can get and allows for determining the size of
objects for various usecases:

    - The size of a single branch (`refs/heads/master`).

    - The size of all references (`--all`) or branches (`--branches`).

    - The size of new objects in a push (`$new_tips --not --all`).

    - The size of objects which are not deduplicated in an object
      deduplication network (`--all --not --alternate-refs`).

    - The size of objects which are deduplicated in an object
      deduplication network (`--alternate-refs`).

This RPC is thus both as accurate as possible while also being quite
flexible. It comes with the downside though that doing the graph walk to
figure out reachable objects is quite expensive depending on both the
number of references and objects. This cannot really be helped though:
the caller needs to choose between either getting fast but coarse or
slow but accurate results.

Changelog: added

parent 4d6d4bb6

No related merge requests found

Expand all Hide whitespace changes

Inline Side-by-side

Showing with 2081 additions and 1264 deletions

Please register or to comment