What do I need K.clear_session() and del model for (Keras with Tensorflow-gpu)?

PythonTensorflowMemory ManagementKeras

Python Problem Overview


What I am doing
I am training and using a convolutional neuron network (CNN) for image-classification using Keras with Tensorflow-gpu as backend.

What I am using

  • PyCharm Community 2018.1.2
  • both Python 2.7 and 3.5 (but not both at a time)
  • Ubuntu 16.04
  • Keras 2.2.0
  • Tensorflow-GPU 1.8.0 as backend

What I want to know
In many codes I see people using

from keras import backend as K 

# Do some code, e.g. train and save model

K.clear_session()

or deleting the model after using it:

del model

The keras documentation says regarding clear_session: "Destroys the current TF graph and creates a new one. Useful to avoid clutter from old models / layers." - https://keras.io/backend/

What is the point of doing that and should I do it as well? When loading or creating a new model my model gets overwritten anyway, so why bother?

Python Solutions


Solution 1 - Python

K.clear_session() is useful when you're creating multiple models in succession, such as during hyperparameter search or cross-validation. Each model you train adds nodes (potentially numbering in the thousands) to the graph. TensorFlow executes the entire graph whenever you (or Keras) call tf.Session.run() or tf.Tensor.eval(), so your models will become slower and slower to train, and you may also run out of memory. Clearing the session removes all the nodes left over from previous models, freeing memory and preventing slowdown.


Edit 21/06/19:

TensorFlow is lazy-evaluated by default. TensorFlow operations aren't evaluated immediately: creating a tensor or doing some operations to it creates nodes in a dataflow graph. The results are calculated by evaluating the relevant parts of the graph in one go when you call tf.Session.run() or tf.Tensor.eval(). This is so TensorFlow can build an execution plan that allocates operations that can be performed in parallel to different devices. It can also fold adjacent nodes together or remove redundant ones (e.g. if you concatenated two tensors and later split them apart again unchanged). For more details, see https://www.tensorflow.org/guide/graphs

All of your TensorFlow models are stored in the graph as a series of tensors and tensor operations. The basic operation of machine learning is tensor dot product - the output of a neural network is the dot product of the input matrix and the network weights. If you have a single-layer perceptron and 1,000 training samples, then each epoch creates at least 1,000 tensor operations. If you have 1,000 epochs, then your graph contains at least 1,000,000 nodes at the end, before taking into account preprocessing, postprocessing, and more complex models such as recurrent nets, encoder-decoder, attentional models, etc.

The problem is that eventually the graph would be too large to fit into video memory (6 GB in my case), so TF would shuttle parts of the graph from video to main memory and back. Eventually it would even get too large for main memory (12 GB) and start moving between main memory and the hard disk. Needless to say, this made things incredibly, and increasingly, slow as training went on. Before developing this save-model/clear-session/reload-model flow, I calculated that, at the per-epoch rate of slowdown I experienced, my model would have taken longer than the age of the universe to finish training.

> Disclaimer: I haven't used TensorFlow in almost a year, so this might have changed. I remember there being quite a few GitHub issues around this so hopefully it has since been fixed.

Solution 2 - Python

del will delete variable in python and since model is a variable, del model will delete it but the TF graph will have no changes (TF is your Keras backend). This said, K.clear_session() will destroy the current TF graph and creates a new one. Creating a new model seems to be an independent step, but don't forget the backend :)

Solution 3 - Python

During cross-validation, I wanted to run number_of_replicates folds (a.k.a. replicates) to get an average validation loss as a basis for comparison to another algorithm. So I needed to perform cross-validation for two separate algorithms, and I have multiple GPUs available so figured this would not be a problem.

Unfortunately, I started seeing layer names get thing like _2, _3, etc. appended to them in my loss logs. I also noticed that if I ran through the replicates (a.k.a. folds) sequentially by using a loop in a single script, I ran out of memory on the GPUs.

This strategy worked for me; I have been running for hours on end now in tmux sessions on an Ubuntu lambda machine, sometimes seeing memory leaks but they are killed off by a timeout function. It requires estimating the length of time it could take to complete each cross-validation fold/replicate; in the code below that number is timeEstimateRequiredPerReplicate (best to double the number of trips through the loop in case half of them get killed off):

from multiprocessing import Process

# establish target for process workers
def machine():
    import tensorflow as tf
    from tensorflow.keras.backend import clear_session

    from tensorflow.python.framework.ops import disable_eager_execution
    import gc

    clear_session()

    disable_eager_execution()  
    nEpochs = 999 # set lower if not using tf.keras.callbacks.EarlyStopping in callbacks
    callbacks = ... # establish early stopping, logging, etc. if desired

    algorithm_model = ... # define layers, output(s), etc.
    opt_algorithm = ... # choose your optimizer
    loss_metric = ... # choose your loss function(s) (in a list for multiple outputs)
    algorithm_model.compile(optimizer=opt_algorithm, loss=loss_metric)

    trainData = ... # establish which data to train on (for this fold/replicate only)
    validateData = ... # establish which data to validate on (same caveat as above)
    algorithm_model.fit(
        x=trainData,
        steps_per_epoch=len(trainData),
        validation_data=validateData,
        validation_steps=len(validateData),
        epochs=nEpochs,
        callbacks=callbacks
    )

    gc.collect()
    del algorithm_model

    return


# establish main loop to start each process
def main_loop():
    for replicate in range(replicatesDesired - replicatesCompleted):
        print(
            '\nStarting cross-validation replicate {} '.format(
                replicate +
                replicatesCompleted + 1
            ) +
            'of {} desired:\n'.format(
                replicatesDesired
            )
        )
        p = Process(target=process_machine)
        p.start()
        flag = p.join(timeEstimateRequiredPerReplicate)
        print('\n\nSubprocess exited with code {}.\n\n'.format(flag))
    return


# enable running of this script from command line
if __name__ == "__main__":
    main_loop()

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionbenjaminView Question on Stackoverflow
Solution 1 - PythonChris SwinchattView Answer on Stackoverflow
Solution 2 - Pythonuser10595337View Answer on Stackoverflow
Solution 3 - PythonbrethvoiceView Answer on Stackoverflow