Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loosening the branching constraint #3

Open
christopher-beckham opened this issue Apr 8, 2019 · 2 comments
Open

Loosening the branching constraint #3

christopher-beckham opened this issue Apr 8, 2019 · 2 comments

Comments

@christopher-beckham
Copy link

Hi,

I have a question related how branching works in this, and if it is possible to play around with this constraint. I'll give an example.

Suppose I have some experiment experiment1, with the hash aaabbb. When I run this experiment, it will periodically dump checkpoints and output files to say, $ROOT/aaabbb/results. Suppose I have trained this experiment for 50 epochs already (and therefore the results folder contains a checkpoint file for epoch 50).

Now suppose in the next few days I decide to commit various changes to parts of the code. These code changes could be things which impact very little / nothing in experiment aaabbb (e.g. suppose I add new files irrelevant to that experiment, or change the way metrics are logged). If I decide to resume experiment aaabbb and train it for another 50 epochs, it won't allow me to since the environment has changed. If I decide to resume aaabbb using something like --allow-any-change then it will simply branch off that experiment and start over again (suppose the branched experiment is called cccddd). This means that the experiment will actually be run again from scratch, since cccddd is definitely not going to inherit the checkpoint file from aaabbb. This leaves us with a few options:

  • (1) Allow the user to sin and let them resume aaabbb without branching (but maybe for peace of mind, somehow let the user know that these are 'dirty' experiments, so that they can separate them from the truly reproducible experiments ;) )
  • (2) Force the user to resume aaabbb outside of using kleio. But this means you'll have to account for the fact that kleio sets the working directory to be $ROOT/<id> instead of $ROOT. (I don't really like this one)
  • (3) Have the branching code copy the contents of $ROOT/<id> to $ROOT/<id_branched>, though this might use a lot of disk space if $ROOT/<id> already has a ton of stuff in it (like big checkpoint files).
  • (4) Modify the command line for the branched experiment and add something like --resume_from=$ROOT/aaabbb/epoch_50.pkl. Maybe this is a reasonable one? I assume this is already possible. Though (1) would definitely be the most convenient for me (I may not always want to spread my experiment over multiple branches).

Thanks!

@bouthilx
Copy link
Member

bouthilx commented Apr 8, 2019

Oh, but reusing the checkpoint is precisely one of the main goals of the branching structure. How do you save your checkpoint files? You should use Kleio to do that so you can seamlessly load a branched trial and work on it as if it was a single tria from time 0 to T. For example, say you saved your checkpoint in aaabbb with

import io
from kleio.client.logger import kleio_logger
import torch

file_like_object = io.BytesIO()
torch.save(dict(model=model.state_dict(), optimizer=optimizer.state_dict()),
           file_like_object)
file_like_object.seek(0)
# Here epoch and iteration are just meta_data saved with the checkpoint to help 
# querying latter. You can add any key that will be saved as meta_data as long as
# it is light weight and serializable by bson. The large object is `file_like_object`.
kleio_logger.log_artifact(filename, file_like_object, epoch=epoch, iteration=iteration)

This means the file used for the checkpoint is indexed in the trial as an artefact and you can latter access it using kleio. It also means that any branched trial from this one will inherit from this artifact, given that they are branched after the artefact (you can branch a trial at any time, not just the end of execution). Say you now execute it with a different environment using --allow-any-change, resulting in a new trial with id cccddd. From this trial, your code should attempt resumption by querying the artefacts. Since it is a branched trial, this query have access to any artefact of prior trial aaabbb before execution time t.

artifacts = kleio_logger.load_artifacts(filename, {})

# Iterate on it to get last artefact without explicitly building a list
artifact = None
for artifact in artifacts:
    continue

if artifact is None:
    print("No artifacts found")
    return None

# epoch, iteration and other keys are fetched in metadata here
file_like_object, metadata = artifact
# Use `.download()` to actually read the local file (or download from mongodb.gridfs)
state_dict = torch.load(file_like_object.download())
model.load_state_dict(state_dict['model'])
optimizer.load_state_dict(state_dict['optimizer'])
print('Resuming from epoch: {}'.format(metadata['epoch']))

Note that if aaabbb was to be executed latter in its original context (implying there would be no branch), then any new artefact logged in aaabbb would not be accessible from cccddd. This is because the latter was branched at time t of aaabbb execution and can only access log before that time.

One nice thing about that is that you could for instance revert your code and run aaabbb for a few epochs to compare against cccddd and verify that the modifications had no impact. All this without requiring to run again from scratch. That would only be possible however if your code is well seeded and thus perfectly reproducible.

@christopher-beckham
Copy link
Author

Oh I see! You already take care of this issue using artifacts! Ok, thanks, I'll try this out and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants