Submodule optimization by novalis · Pull Request #4016 · libgit2/libgit2 · GitHub
Skip to content

Submodule optimization#4016

Merged
ethomson merged 3 commits into
libgit2:masterfrom
novalis:submodule-optimization
Jan 21, 2017
Merged

Submodule optimization#4016
ethomson merged 3 commits into
libgit2:masterfrom
novalis:submodule-optimization

Conversation

@novalis

@novalis novalis commented Nov 24, 2016

Copy link
Copy Markdown
Contributor

The bulk of this work was done by @bpeabody. I just made some minor adjustments. This fixes #3756.

@novalis

novalis commented Nov 28, 2016

Copy link
Copy Markdown
Contributor Author

@novalis novalis force-pushed the submodule-optimization branch from 5d3de99 to beeaf92 Compare November 28, 2016 20:05
@novalis

novalis commented Nov 28, 2016

Copy link
Copy Markdown
Contributor Author

I just did a no-op amend and now Travis passes. AppVeyor is still bad, but as I noted, it frequently fails this way.

@novalis novalis force-pushed the submodule-optimization branch from beeaf92 to 89d0deb Compare November 29, 2016 16:13
@novalis

novalis commented Nov 29, 2016

Copy link
Copy Markdown
Contributor Author

Actually, after talking to some colleagues, I decided to rename the functions. No substantive changes.

@novalis novalis force-pushed the submodule-optimization branch from 89d0deb to deac747 Compare December 1, 2016 16:27
@novalis

novalis commented Dec 2, 2016

Copy link
Copy Markdown
Contributor Author

BTW this fixes #3756

@ethomson

ethomson commented Dec 4, 2016

Copy link
Copy Markdown
Member

I mentioned in slack that this is probably not in scope for this release, which we're just finishing up.

There's not a lot of documentation around how one might want to use this - when would I, as a consumer, need to clear the cache? Would I then benefit from repopulating it?

I think that @carlosmn asked in #3756 - why can this not be a thing that the library handles for you? Why would I, as a consumer, have better knowledge of when caching should happen than the library itself?

@bpeabody

bpeabody commented Dec 4, 2016

Copy link
Copy Markdown

@ethomson I can think of a couple of ways the library could handle it for you.

  • The library could do caching automatically, invalidating it as necessary, etc.
  • Library operations (such as the diff that happens in the beginning of a rebase) that will access all submodules could be changed to load them in a batch operation so that caching isn't necessary.

@bpeabody

bpeabody commented Dec 4, 2016

Copy link
Copy Markdown

Additionally, we could probably separate into a different pull request the performance fixes that don't involve a cache (e.g., s.t. all_submodules isn't O(N^2)).

@novalis

novalis commented Dec 5, 2016

Copy link
Copy Markdown
Contributor Author

Looking at the automatic caching idea:

  1. The idea of manual caching is that the user sometimes knows better than libgit2 when it is safe to use the cache.

In the absence of the cache, each git_submodule_lookup checks the index, HEAD, and the working directory. The index check does not usually involve any i/o (because it is cached). The HEAD and wd checks do. This is somewhat inconsistent: why do we assume that HEAD could have changed out from under us, but not make that assumption about the index?

User code (e.g our git-meta, via nodegit) might know that it's going to open a repository once, do a thing, and then close it. So it's OK for that code to turn on caching as soon as it opens the module.

But the Atom editor (for instance) might not be wiling to make that assumption; maybe at some point you go in and checkout a new branch from the command-line. (I don't actually know if Atom uses this code -- it's just an example). And at that point, the cache needs invalidation.

  1. More generally, git_submodule_lookup is a user-visible function. Imagine that it did do automatic caching, with 100% correct invalidation (modulo the index issue mentioned above). This would mean most every call to `git_submodule_lookup would have to do six iops (open HEAD, read HEAD, open refs/heads/master, read refs/heads/master, stat .gitmodules (to check whether it's clean), stat the submodule's wd path). The only thing the cache would be saving would be parsing .gitmodules, which admittedly is somewhat expensive. Which is more expensive depends on the size of the gitmodules file and the ratio of disk speed to cpu/memory speed.

  2. We could also have git_submodule_lookup not expose a user-visible way of manipulating the cache -- instead, ordinary user calls to git_submodule_lookup would be uncached. But internal calls (e.g. from git_diff or git_checkout_index could first initialize the cache, then do their multiple calls, then tear down the cache. But this would not help users who do multiple submodule lookups on their own.

@novalis

novalis commented Dec 15, 2016

Copy link
Copy Markdown
Contributor Author

My conclusion from the above, btw, is that automatic caching doesn't make sense and that we ought to just merge this as-is.

@ethomson

Copy link
Copy Markdown
Member

Thanks for the writeup @novalis . I think that makes sense. I have some review comments to make but I think the overall strategy makes sense.

Comment thread src/submodule.c Outdated
location, NULL, NULL, NULL, sm, GIT_SUBMODULE_IGNORE_ALL);
}

int git_repository_submodule_cache_all(git_repository *repo)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't put git_repository functions in a file that is not repository.c. Either: 1) these should either go in repository.c, or 2) they should be git_submodule_cache_all and git_submodule_cache_clear. The latter is obviously more brief, but I think (1) makes more sense.

Comment thread include/git2/repository.h Outdated
*
* @param repo the repository whose submodules will be cached.
*/
GIT_EXTERN(int) git_repository_submodule_cache_all(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no indication of when somebody should call cache_clear. You need to describe what operations would cause a cache coherency problem; it's not at all obvious to the reader and I suspect it's wildly dangerous to call cache_all and then (say) check out a new branch that would update a submodule? I'm just guessing, I don't know, and I'd like this to expand upon that.

I also think that you should move this into include/git2/sys/repository.h. This seems to me to be a cache that you have to manage rather carefully, and the sys/ directory is for the things that will hurt you badly if you get wrong.

Comment thread src/submodule.c Outdated
static void free_submodule_names(git_strmap *names)
{
git_buf *name = 0;
if (0 == names)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't reverse these tests. I know that many people have gotten into a habit of this and find it a good practice, but please follow our existing code style.

Please don't use 0 for pointers.

This should be if (names == NULL).

Comment thread src/submodule.c Outdated

assert(repo && name);

if (0 != repo->submodule_cache) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (repo->submodule_cache != NULL), please.

Comment thread src/submodule.c Outdated
{
git_submodule *sm;
assert(repo);
if (0 == repo->submodule_cache) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (repo->submodule_cache == NULL)...

Comment thread tests/submodule/lookup.c Outdated
git_repository_submodule_cache_all(g_repo);
cl_assert(0 == git_submodule_lookup(&sm, g_repo, "sm_unchanged"));
cl_assert(0 == git_submodule_lookup(&sm2, g_repo, "sm_unchanged"));
cl_assert(sm == sm2);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using cl_assert is not ideal here, since it just will say "an assertion failed". We have a variety of helper functions that provide more explicit messages, which are helpful in debugging and understanding test failures:

For git functions, please use cl_git_pass, which will show the error messages properly when a function fails and returns nonzero. (eg, cl_git_pass(git_submodule_lookup...))

To compare two pointers, please use cl_assert_equal_p which will show more information in a nice format about the actual and expected values.

Comment thread tests/submodule/lookup.c Outdated

/* and that we get new objects again after clearing the cache. */
git_repository_submodule_cache_clear(g_repo);
cl_assert(0 == git_submodule_lookup(&sm2, g_repo, "sm_unchanged"));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, too, this should be cl_git_pass(...).

@ethomson

Copy link
Copy Markdown
Member

I have some comments. Note that I haven't looked at the submodule logic to know if that was just changed whitespace or if there are substantive changes, too. I'll have to do that before we merge, unless you revert it. (Mixing whitespace changes with actual code changes is very hard to follow in many diff tools.)

@novalis

novalis commented Dec 16, 2016

Copy link
Copy Markdown
Contributor Author

I have a separate commit for the whitespace changes. Are you seeing other whitespace changes outside that commit?

@novalis novalis force-pushed the submodule-optimization branch from deac747 to 1f5c2d0 Compare December 16, 2016 17:56
@ethomson

Copy link
Copy Markdown
Member

I have a separate commit for the whitespace changes.

Yep, thanks, that's fine then; I was looking at this in the GitHub web view and didn't realize that.

@ethomson ethomson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, after giving this a final look, I noticed two more very minor issues. I hope you don't mind fixing these real quick so that I can merge it; apologies that I didn't catch these the first time around. Thanks!

Comment thread src/submodule.c Outdated
/* refresh the HEAD OID */
if (submodule_update_head(sm) < 0)
return -1;
if (0 == sm->repo->submodule_cache) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also reverse this test?

Comment thread src/submodule.c Outdated
} lfc_data;

static int all_submodules(git_repository *repo, git_strmap *map)
int all_submodules(git_repository *repo, git_strmap *map)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this is public (within the library anyway), this should have a name that indicates it's internal to the library. Something like git_submodule__all or git_submodule__map, whichever makes more sense to you.

@novalis novalis force-pushed the submodule-optimization branch from 1f5c2d0 to 4e9aca8 Compare January 18, 2017 20:21
novalis and others added 3 commits January 20, 2017 17:33
Signed-off-by: David Turner <dturner@twosigma.com>
Added `git_repository_submodule_cache_all` to initialze a cache of
submodules on the repository so that operations looking up N
submodules are O(N) and not O(N^2).  Added a
`git_repository_submodule_cache_clear` function to remove the cache.

Also optimized the function that loads all submodules as it was itself
O(N^2) w.r.t the number of submodules, having to loop through the
`.gitmodules` file once per submodule.  I changed it to process the
`.gitmodules` file once, into a map.

Signed-off-by: David Turner <dturner@twosigma.com>
`git_submodule_status` is very slow, bottlenecked on
`git_repository_head_tree`, which it uses through `submodule_update_head`.  If
the user has requested submodule caching, assume that they want this status
cached too and skip it.

Signed-off-by: David Turner <dturner@twosigma.com>
@novalis novalis force-pushed the submodule-optimization branch from 4e9aca8 to 673dff8 Compare January 20, 2017 22:34
@novalis

novalis commented Jan 21, 2017

Copy link
Copy Markdown
Contributor Author

OK, I think I've addressed all comments; there are no merge conflicts; the tests are green.

@ethomson

Copy link
Copy Markdown
Member

Thanks for doing this @bpeabody and @novalis - and thank you especially for putting up with my laziness and bikeshedding. 😅

@ethomson ethomson merged commit 98f5387 into libgit2:master Jan 21, 2017
@bpeabody

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

git_submodule_lookup is O(N) w.r.t. number of submodules

3 participants