Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

commands: teach 'delay' capability to `git-lfs-filter-process(1)` #2511

Merged
merged 20 commits into from Aug 24, 2017

Conversation

Projects
None yet
4 participants
Owner

ttaylorr commented Aug 15, 2017 edited

This pull requests teaches the git-lfs-filter-process(1) command the 'delay' capability that was proposed by @larsxschneider in #1632 (comment) and added to Git in git/git@51b8aec.

This implementation allows the process filter to respond with a status=delayed when asked to smudge the contents of a file that it does not already have in the local .git/lfs/objects cache.

If any items were delayed, Git will later ask for blobs that were available with the list_available_blobs command, which Git LFS will either respond with the immediately available blobs, wait until at least one becomes available, or respond with an empty set indicating that the transfer is complete.

To implement this, I added a new delayedSmudge helper function, which behaves the same way as smudge(), but instead adds the pointer to *tq.TransferQueue given as an argument.

Here are some benchmarks when running git clone on GitHub.com and a local copy of GitHub:

lfs.concurrenttransfers=1 ...=3 ...=8 ...=16 ...=32
v2.13.2
github.com 29.2297 - - - -
github.dev 18.4657 - - - -
v2.14.1.next
github.com 45.6183 28.6867 14.9877 13.8893 13.7767
github.dev 9.02767 7.41867 7.70833 7.33933 7.71067

For a delta of:

lfs.concurrenttransfers=1 ...=3 ...=8 ...=16 ...=32
github.com -43.3866 0.543 14.242 15.3404 15.453
github.dev 9.43803 11.04703 10.75737 11.12637 10.75503

Excluding outliers:

v2.13.2 v2.14.1.next delta delta %
github.com 29.2297 17.8351 11.4s faster 164% faster
github.dev 18.4657 7.840934 10.6s faster 236% faster

There were three big performance changes that impact this improvement:

  1. #2483: Infinitely buffering the tq.(*TransferQueue).Add() function, which allows the TransferQueue to accept new items sent by Git, instead of waiting until the first batch of items were transferred.
  2. #2506: Increasing the default number of concurrently running transfers from 3 to 8 allows LFS to take advantage of more network bandwidth and process items faster.
  3. Not blocking on list_available_blobs until tq.Wait() returns. Instead, read from tq.Watch() until a read becomes blocking (via select { case <-tq.Wait(): ... default: }. This allows Git to perform the checkout while the tq.(*TransferQueue) is still transferring items.

Following this, I'd like to deprecate the git lfs clone command, since git clone is now as fast or faster than git lfs clone as of the latest tip of git.git.

EDIT: the each number is the average of 5 runs of a given command, with the outliers removed. The benchmark command is written as: benchmark.sh. The average-timings command is written as: average-timings.sh.

Closes: #2466.


/cc @git-lfs/core
/cc @larsxschneider

ttaylorr added some commits Aug 12, 2017

@ttaylorr ttaylorr added the review label Aug 15, 2017

@ttaylorr ttaylorr added this to the v2.3.0 milestone Aug 15, 2017

@ttaylorr ttaylorr requested review from technoweenie and larsxschneider Aug 15, 2017

Looking great. I pointed out a few things below:

commands/command_filter_process.go
+ }
+ }
+ case "list_available_blobs":
+ closeOnce.Do(func() { go q.Wait() })
@technoweenie

technoweenie Aug 16, 2017

Owner

I could use some clarification here. It's my understanding that a transfer queue can accept new transfers until Wait() is called. If I'm interpreting the protocol docs correctly, then the process should look like this:

  1. Git sends all pointers through smudge with can-delay=1
  2. Git sends list_available_blobs. The transfer queue is closed.
  3. Git sends more smudge commands for each available blob, never calling list_available_blobs again.
  4. Profit!!
@ttaylorr

ttaylorr Aug 16, 2017

Owner

🤔 not quite -- I think the confusion has to do with the readAvailable() func. readAvailable does as many reads as possible before <-tq.Watch() becomes blocking, responding to a list_available_blobs command with any blobs that it currently has ready, or blocking until at least one blob is ready.

The process looks like this:

  1. Git sends all pointers through command=smudge with can-delay=1.
  2. Git issues a list_available_blobs command, indicating that no more new blobs will be sent as above. The transfer queue is accordingly closed.
  3. We read as many currently available blobs as we can before blocking, or block until a) at least one blob becomes available, or b) the transfer queue has no more blobs to send, in which case we send an empty list, and the process is complete.
  4. Git send a command=smudge with can-delay=0 for each blob in the list above.
  5. Git issues another list_available_blobs command.
  6. See step 3.

I did my best to clarify this in 8816463.

commands/command_filter_process.go
+ status = delayedStausFromErr(err)
+ } else {
+ status = statusFromErr(err)
+ }
@technoweenie

technoweenie Aug 16, 2017

Owner

IMO, this nested if is complicated. Would something like this work? It looks like calling Flush() on nil is a no-op any way.

if delayed {
  s.WriteStatus(delayedStausFromErr(err))
} else if ferr := w.Flush(); ferr != nil {
  s.WriteStatus(statusFromErr(ferr))
} else {
  s.WriteStatus(statusFromErr(err))
}

Continuing to declare status outside of that scope is fine too, but it didn't look like it was needed anywhere else.

BTW delayedStausFromErr has a typo :)

@ttaylorr

ttaylorr Aug 16, 2017

Owner

Way better -- thanks for this suggestion. Changed things up in 1001e1c (and I got that typo!)

commands/command_smudge.go
+//
+// delayedSmudge returns the number of bytes written, whether the checkout was
+// delayed, the *lfs.Pointer that was smudged, and an error, if one occurred.
+func delayedSmudge(s *git.FilterProcessScanner, to io.Writer, from io.Reader, tq *tq.TransferQueue, filename string, skip bool, filter *filepathfilter.Filter) (int64, bool, *lfs.Pointer, error) {
@technoweenie

technoweenie Aug 16, 2017

Owner

The tq parameter clobbers the tq package name. Not critical of course, but may be worth changing.

@ttaylorr

ttaylorr Aug 16, 2017

Owner

Good call -- I think we should be fine for now, since we don't use any other exported members from the tq package, but this will be a good way to prevent an ambiguous diff later on if that changes. Covered this one in: a9942cd.

ttaylorr added some commits Aug 16, 2017

+ if !ok {
+ return ts
+ }
+ return append(ts, t)
@technoweenie

technoweenie Aug 16, 2017

Owner

Does this part need to exist? If ts is empty, the for should start another iteration that'll attempt to read from the channel. Very possible there's some detail about select that I'm forgetting...

@ttaylorr

ttaylorr Aug 17, 2017

Owner

I think so, though the duplication is unfortunate. Specifically, this return append, could be turned into an append only, but the blocking <-ch read causes the enclosing for loop to not spin.

@technoweenie

technoweenie Aug 17, 2017

Owner

Jeez, the return makes all the difference in the world. Thanks.

ttaylorr added some commits Aug 17, 2017

test/test-track.sh
@@ -29,6 +29,10 @@ begin_test "track"
mkdir -p a/b
+ if [ ! -d .git/info ]; then
+ mkdir .git/info
+ fi
@technoweenie

technoweenie Aug 17, 2017

Owner

FWIW I think you could replace this with mkdir -p .git/info.

@ttaylorr

ttaylorr Aug 17, 2017

Owner

Good call -- I got this one in: bc376b8.

Really excited to see this land!

ttaylorr added some commits Aug 17, 2017

Contributor

larsxschneider commented Aug 20, 2017

As suggested by @ttaylorr, I added a progress indicator for filtering content in Git and posted the patch to the Git mailing list.

If you clone a repo it would look like this:

$ git clone https://server/some/repo.git
Cloning into 'repo'...
remote: Counting objects: 6077, done.
remote: Total 6077 (delta 0), reused 0 (delta 0), pack-reused 6077
Receiving objects: 100% (6077/6077), 6.00 MiB | 1.24 MiB/s, done.
Resolving deltas: 100% (4177/4177), done.
Filtering content: 100% (62/62), 7.63 MiB | 304.00 KiB/s, done.

If you checkout a commit that references many LFS files that you don't have locally on your machine yet, then you would see something like this:

$ git checkout mybranch
Filtering content: 100% (62/62), 7.63 MiB | 312.00 KiB/s, done.
Switched to branch 'mybranch'

A nice side effect is that users can easily see the bandwidth performance of their LFS servers 🙂

Contributor

larsxschneider commented Aug 20, 2017

There might be a bug in the current (7115b57) delay implementation. Try these steps:

(1) Checkout a branch of a repo with files in LFS
(2) Delete all files in that repo except for the .git folder
(3) Run git checkout -f HEAD

The checkout operation should be blazingly fast as all objects should be available locally already. However, it looks like as if Git LFS downloads all objects, again.

Owner

ttaylorr commented Aug 20, 2017 edited

The checkout operation should be blazingly fast as all objects should be available locally already. However, it looks like as if Git LFS downloads all objects, again.

@larsxschneider great catch, this was definitely a bug. With c04745c:

~/D/git-lfs-test (master) $ find .git/lfs/objects -type f | wc -l
     187

~/D/git-lfs-test (master) $ rm -rf ./000 ./001 ./002 ./003 ./004 ./005 ./006 ./007 ./008 ./009 ./010 ./script

~/D/git-lfs-test (master!) $ time git checkout -f HEAD
Your branch is up-to-date with 'origin/master'.
git checkout -f HEAD  0.05s user 0.07s system 84% cpu 0.146 total

EDIT: the above commit had a bug where it would not smudge pointers for which the local object was cached. I fixed that in: 50ecd67.

commands/command_smudge.go
@@ -63,6 +63,15 @@ func delayedSmudge(s *git.FilterProcessScanner, to io.Writer, from io.Reader, q
q.Add(filename, path, ptr.Oid, ptr.Size)
return 0, true, ptr, nil
}
+
+ // Write 'statusFromErr(nil)', even though 'perr != nil', since
+ // we are about to write non-delayed smudged contents to "to".
@larsxschneider

larsxschneider Aug 20, 2017

Contributor

The fix seems to work! Thank you 👍! However, I still try to decipher the comment 😄. perr should be nil here, no? TBH I don't really understand the same comment a few lines above either. Can you give me a hint?

@larsxschneider

larsxschneider Aug 20, 2017

Contributor

Plus, could it be that this was the reason for the degraded performance in the lfs.concurrenttransfers=1 case?

@ttaylorr

ttaylorr Aug 21, 2017

Owner

Ah, good question -- I think the comment is out of date but the code is correct here: this case occurs when we have the contents locally and can smudge without delay. It is certainly possible that this could affect cases where lfs.concurrenttransfers=1, but I think this would be negligible. I fixed this up in 4cefa24.

Contributor

larsxschneider commented Aug 20, 2017

Cloning repositories via SSH with a large number of LFS files seems to consistently result in a "batch response: Bad credentials" error:

$ git clone git@github.com:larsxschneider/lfstest-manyfiles.git testrepo
Cloning into 'testrepo'...
warning: templates not found /Users/lars/share/git-core/templates
remote: Counting objects: 15012, done.
remote: Total 15012 (delta 0), reused 0 (delta 0), pack-reused 15012
Receiving objects: 100% (15012/15012), 2.02 MiB | 1.57 MiB/s, done.
Downloading 1/test11791.bin (6 B)00), 17.69 KiB | 0 bytes/s
Error downloading object: 1/test11791.bin (c5a39d8): Smudge error: Error downloading 1/test11791.bin (c5a39d8045fd604bdcf307bbe06e7e41dd59a2399dc2a1807e4a6e3dfef5148b): batch response: Bad credentials

Errors logged to /Users/lars/Temp/lfstest/testrepo/.git/lfs/objects/logs/20170820T190408.392912345.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: 1/test11791.bin: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

Plus, in case of an error Git LFS prints Downloading 1/test11791.bin (6 B) over the Filtering content: 100% (62/62), 7.63 MiB | 312.00 KiB/s progress.
Can/should we disable the Downloading... output in Git LFS?

Owner

technoweenie commented Aug 20, 2017

Cloning repositories via SSH with a large number of LFS files seems to consistently result in a "batch response: Bad credentials" error:

Hmm, I wouldn't think this PR would affect that. It's probably an unrelated issue. Can you try with an HTTPS remote (just to confirm whether this is a general transfer queue issue, an ssh auth issue, or a delay filter issue)?

As for the SSH thing, my hunch is that the expiration timeout is not being respected. Is this happening at around the 10 minute mark?

$ ssh git@github.com git-lfs-authenticate github/git-lfs-test download
{
  "href": "https://lfs.github.com/github/git-lfs-test",
  "header": {
    "Authorization": "RemoteAuth booya"
  },
  "expires_at": "2017-08-20T18:10:12Z",
  "expires_in": 599
}

Two potential spots that I can think of:

  • Does the Batch API try to refresh the Authorization header after it expires? I thought it recognizes that and automatically retries. Maybe there's a race condition, and it should refresh the token if expires_in is < 5 or something.
  • Does the SSH cred cacher respect the expires_in? It should be falling back to a real ssh call after expires_in seconds have passed.

Can/should we disable the Downloading... output in Git LFS?

I think we still need some output for legacy smudges. The filter should disable them (or even implement its own progress output).

Contributor

larsxschneider commented Aug 20, 2017

@technoweenie it seems to work without issues for HTTPS remotes and the SSH error happens exactly at the 10min mark.

Owner

technoweenie commented Aug 20, 2017

it seems to work without issues for HTTPS remotes and the SSH error happens exactly at the 10min mark.

Got it. For this PR, let's focus on the delay filter itself if possible. I've opened up #2519 as a place to discuss and fix the SSH timeout issue you experienced. Thanks for bringing it up :)

commands/command_filter_process.go
- s.WriteStatus(statusFromErr(nil))
+ if !(req.Header["command"] == "smudge" && req.Header["can-delay"] == "1") && !(req.Header["command"] == "list_available_blobs") {
+ s.WriteStatus(statusFromErr(nil))
+ }
@larsxschneider

larsxschneider Aug 24, 2017

Contributor

Nit: I feel this makes the protocol a bit hard to read.
I would rather add s.WriteStatus(statusFromErr(nil)) in the switch statement twice below.

@ttaylorr

ttaylorr Aug 24, 2017

Owner

I agree, changed in: 8c9a180.

+// 2. There is one item available, or ...
+// 3. The 'tq.TransferQueue' is completed.
+func readAvailable(ch <-chan *tq.Transfer) []*tq.Transfer {
+ ts := make([]*tq.Transfer, 0, 100)
@larsxschneider

larsxschneider Aug 24, 2017

Contributor

I am curious, why do we initalize the array with 100 items?

@ttaylorr

ttaylorr Aug 24, 2017

Owner

Since the transfer queue has a batch size of 100, and Watch() is buffered at 100 items, all 100 could be ready at once. This is done to prevent growing the array.

@larsxschneider

larsxschneider Aug 25, 2017

Contributor

Wouldn't it make sense to use a constant here?

commands/command_filter_process.go
+ if err != nil && err != io.EOF {
+ return "error"
+ }
+ return "delayed"
@larsxschneider

larsxschneider Aug 24, 2017

Contributor

nit: I wouldn't duplicate the error code here. how about:

func delayedStatusFromErr(err error) string {
    status := statusFromErr(err)
    if status == "success" {
        return "delayed"
    }
    return status
}
@ttaylorr

ttaylorr Aug 24, 2017

Owner

Good call. I was hesitant to write this in the way you suggested originally, because it binds the implementation of this function to the protocol having the value "success". That said, I think this change is well motivated, and I am going to follow up with a future pull request to make these values non protocol-specific.

+//
+// delayedSmudge returns the number of bytes written, whether the checkout was
+// delayed, the *lfs.Pointer that was smudged, and an error, if one occurred.
+func delayedSmudge(s *git.FilterProcessScanner, to io.Writer, from io.Reader, q *tq.TransferQueue, filename string, skip bool, filter *filepathfilter.Filter) (int64, bool, *lfs.Pointer, error) {
@larsxschneider

larsxschneider Aug 24, 2017

Contributor

nit: line break?

@ttaylorr

ttaylorr Aug 24, 2017

Owner

Even though this is a long line, I can't think of any other function signatures that we wrap over multiple lines in LFS. For the sake of consistency, I'm going to leave this here. This long line has the added benefit of pointing out a smell that should be fixed.

test/test-malformed-pointers.sh
@@ -83,7 +83,9 @@ begin_test "empty pointers"
pushd .. >/dev/null
clone_repo "$reponame" "$reponame-assert"
- [ "0" -eq "$(grep -c "empty.dat" clone.log)" ]
+ # The below assertion is known to be broken on Git v2.14.next.
@larsxschneider

larsxschneider Aug 24, 2017

Contributor

Why is that? Does this require further investigation?

@larsxschneider

larsxschneider Aug 24, 2017

Contributor

Plus, we could switch back to test master, again, no?

@ttaylorr

ttaylorr Aug 24, 2017

Owner

@larsxschneider I'm not sure, but it appears that this is fixed on 'master', whatever was causing it, so I removed the assertion skip in 8bd7aac.

@ttaylorr

ttaylorr Aug 24, 2017

Owner

I'll follow up in a separate PR to remove the test on the 'next' branch.

I added a few nits that you can address if you want (most relevant from my point of view is the delayedStatusFromErr suggestion).

I am happy to see that merged 👍

ttaylorr added some commits Aug 24, 2017

Owner

ttaylorr commented Aug 24, 2017

🙇 thanks for the review, everybody!

@ttaylorr ttaylorr merged commit ef961a1 into master Aug 24, 2017

4 checks passed

GitHub CLA @ttaylorr has accepted the GitHub Contributor License Agreement.
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ttaylorr ttaylorr deleted the delay branch Aug 24, 2017

jjgod commented Sep 14, 2017 edited

What git version do I need for this?

~$ git lfs version
git-lfs/2.3.0 (GitHub; darwin amd64; go 1.8.3; git 4dd2bf73)
~$ git --version
git version 2.14.1

Yet it doesn't seem to work, no concurrent transfers when I do git clone .

For cloning the same repository:

git lfs clone:

real	0m28.256s
user	0m24.853s
sys	0m9.224s

git clone:

real	2m5.616s
user	0m15.168s
sys	0m10.439s
Contributor

larsxschneider commented Sep 15, 2017

@jjgod You need the upcoming Git version 2.15 😄
See https://twitter.com/kit3bus/status/908579232422486016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment