Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

git/{odb,githistory}: don't write unchanged objects #2541

Merged
merged 6 commits into from Sep 7, 2017

Conversation

Projects
None yet
3 participants
Owner

ttaylorr commented Aug 30, 2017 edited

This pull request teaches a typesafe Equal() to *Blob's, *Commit's, and *Tree's, and uses that function in the git/githistory package to avoid writing unchanged objects to disk.

Previously, the git lfs migrate command would write the contents of every object that it examined in the migration, regardless of whether that object's contents changed. This is sub-optimal, because it involves not only marshaling the contents of unchanged objects to a buffer and performing an expensive save, but also because it effectively has the behavior of unpacking all objects in a repository. That's no good.

To address this, this pull request opts for the following option that I originally proposed in #2359:

Teach either:

  1. [...]
  2. Packfile APIs in order to write either a) packed objects, or b) skip writing packed objects if the object already exists in the database (loose or packed).

We can detect whether objects already exist in the object database in either a loose or packed format by determining whether or not the contents of the blob/tree/commit changed when rewriting it. In order to open the object, it must exist, so therefore, if it is unchanged, the object is already saved to the database.

By defining (t *<T>) Equal(<T> other) on each object type in package git/odb, and comparing the "rewritten" instances of each object against the original, we can avoid writing objects to the database (which involves several filesystem-level operations, open(), write(), a syscall to move, etc.) and instead return early, avoiding all of the previously mentioned operations.

This yields a significant performance boost. When examining the git-lfs/git-lfs repository prior to this change, it takes about 38.12 seconds to examing all ~6,000 commits:

~/g/git-lfs (master) $ time git lfs migrate info
migrate: Sorting commits: ..., done
migrate: Rewriting commits: 100% (5840/5840), done
# ...
git lfs migrate info  36.30s user 19.80s system 147% cpu 38.127 total

Including this change, however, the operation takes only 18.14 seconds:

~/g/git-lfs (master) $ time git lfs migrate info
migrate: Sorting commits: ..., done
migrate: Examining commits: 100% (5840/5840), done
# ...
git lfs migrate info  23.74s user 5.71s system 162% cpu 18.144 total

Closes: #2359.


/cc @git-lfs/core

ttaylorr added some commits Aug 30, 2017

@ttaylorr ttaylorr added the review label Aug 30, 2017

@ttaylorr ttaylorr added this to the v2.3.0 milestone Aug 30, 2017

@ttaylorr ttaylorr requested a review from technoweenie Aug 30, 2017

This looks like a great optimization. I have 1 question about the *Blob implementation though.

+ }
+
+ if b != nil {
+ return b.Contents == other.Contents &&
@technoweenie

technoweenie Aug 30, 2017

Owner

I'm not sure comparing io.Reader is the right solution here. I'd expect a test like this to pass:

func TestBlobEqualReturnsTrueWithUnchangedButReadContents(t *testing.T) {
	c1 := strings.NewReader("Hello, world!")
	c2 := strings.NewReader("Hello, world!")

	b1 := &Blob{Size: int64(c1.Len()), Contents: c1}
	b2 := &Blob{Size: int64(c2.Len()), Contents: c2}

	assert.True(t, b1.Equal(b2))
}

Based on the way that the blob rewrite function usually returns a new *Blob, I think this will always fail the Equal() check, unless both Contents are pointers to the same reader.

BlobFn: func(path string, b *odb.Blob) (*odb.Blob, error) {
if filepath.Base(path) == ".gitattributes" {
return b, nil
}
var buf bytes.Buffer
if _, err := clean(&buf, b.Contents, path, b.Size); err != nil {
return nil, err
}
if ext := filepath.Ext(path); len(ext) > 0 {
exts.Add(fmt.Sprintf("*%s filter=lfs diff=lfs merge=lfs -text", ext))
}
return &odb.Blob{
Contents: &buf, Size: int64(buf.Len()),
}, nil
},

I suppose this works out in practice, since any call to BlobFn means that blob is definitely being converted.

@ttaylorr

ttaylorr Aug 30, 2017

Owner

Great question -- you're absolutely right. Two io.Reader's that yield the same content but are instantiated differently will fail the equality check.

That's fine in practice here, since as you pointed out, all calls to the BlobFn mean the blob is definitely being converted. That said, there are cases in the 'info' migrator where the contents of the blob does not change:

BlobFn: func(path string, b *odb.Blob) (*odb.Blob, error) {
ext := fmt.Sprintf("*%s", filepath.Ext(path))
if len(ext) > 1 {
entry := exts[ext]
if entry == nil {
entry = &MigrateInfoEntry{Qualifier: ext}
}
entry.Total++
entry.BytesTotal += b.Size
if b.Size > int64(migrateInfoAbove) {
entry.TotalAbove++
entry.BytesAbove += b.Size
}
exts[ext] = entry
}
return b, nil
},

A more complete implementation might involve some buffering elsewhere, and keeping track of the SHA as it changes, but I think that this will do for now.

@ttaylorr ttaylorr merged commit a2981c1 into master Sep 7, 2017

5 checks passed

GitHub CLA @ttaylorr has accepted the GitHub Contributor License Agreement.
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ttaylorr ttaylorr deleted the migrate-no-write-unchanged branch Sep 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment