[FIXED] Accuracy of same validation dataset differs between last epoch and after fit

Issue

The following code gives a log ending with

Epoch 19/20
1/1 [==============================] - 0s 473ms/step - loss: 1.4018 - accuracy: 0.8750 - val_loss: 1.8656 - val_accuracy: 0.8900
Epoch 20/20
1/1 [==============================] - 0s 444ms/step - loss: 0.5904 - accuracy: 0.8750 - val_loss: 2.1255 - val_accuracy: 0.8700
get_dataset: validation
Found 1000 files belonging to 2 classes.
Using 100 files for validation.
4/4 [==============================] - 1s 81ms/step
eval acc: 0.81

My question is:

Why is the val_accuracy after the last epoch (0.87) different from the eval acc (0.81) after the fit?

In my code, I try to use the same dataset for the validation of each epoch during fit and the additional validation afterwards.

[Update 1, 2022-07-19:

  1. Obviously, the two accuracy calculations don’t really use the same data. How can I debug which data is actually used?
    [Update 3, 2022-07-20: I have followed the data into TensorFlow. The last thing I see is that in Model.evaluate (during fit) and Model.predict the x.filenames are equal. I did not manage to debug much further, because soon in quick_execute the __inference_test_function_248219 resp. the __inference_predict_function_231438 are evaluated outside Python, and the arguments are tensors with dtype=resource, whose contents I cannot see.]
  2. I have deliberately removed my class balancing code to keep my example small. I know that this makes the accuracies less useful, but I don’t care about that for now.
  3. Note that get_dataset('validation') is only called once at the beginning of the fit, not at each epoch.
  4. I have now also set max_queue_size=0, use_multiprocessing=False, workers=0 (as seen here, found via this related SO question about TensorFlow 1), but this did not make the accuracies equal.

]

Code:

import tensorflow as tf
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.preprocessing import image_dataset_from_directory
    
inputs = tf.keras.Input(shape=(224, 224, 3))
base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base_output = base_model(inputs)
base_model.trainable = False
out = Flatten(name='flat')(base_output)
out = Dense(1, activation='sigmoid')(out)
model = Model(inputs=inputs, outputs=out)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

def get_dataset(subset):
    print('get_dataset:', subset)
    return image_dataset_from_directory(
        'data-nodup-1000',
        labels="inferred",
        label_mode='binary',
        color_mode="rgb",
        image_size=(224, 224),
        shuffle=True,
        seed=1,
        validation_split=0.1,
        subset=subset,
        crop_to_aspect_ratio=False,
    )

model.fit(
    get_dataset('training'),
    steps_per_epoch=1,
    epochs=20,
    validation_data=get_dataset('validation'),
    max_queue_size=0,
    use_multiprocessing=False,
    workers=0,
)

val_dataset = get_dataset('validation')
true_class = tf.concat([y for x, y in val_dataset], axis=0)
pred = model.predict(val_dataset)
pred_class = pred >= .5
print('eval acc:', accuracy_score(true_class, pred_class))

[Update 2, 2022-07-19:
I can also reproduce the behavior with the deprecated ImageDataGenerator, using

from tensorflow.keras.applications.resnet50 import preprocess_input
from keras_preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    validation_split=0.1,
)

def get_dataset(subset):
    print('get_dataset:', subset)
    return datagen.flow_from_directory(
        'data-nodup-1000',
        class_mode='binary',
        target_size=(224, 224),
        shuffle=True,
        seed=1,
        subset=subset,
    )

and

true_class = val_dataset.labels

]

[Update 4, 2022-07-21: Note that deactivating shuffling of validation data by setting shuffle=(subset == 'training') makes the two validation accuracies equal. This is not a workaround, however, because the validation set then consists only of class 1, since flow_from_directory doesn’t do stratification.
]

My environment:

  • I am using all up-to-date libraries, like tensorflow 2.9.1 and sklearn 1.1.1 (via pip-compile -U).
  • The folder data-nodup-1000 contains one subfolder with 113 files of class 0, and one subfolder with 887 files of class 1.

Solution

I have now found out that in TensorFlow 2.9.1 model.predict uses the second iteration of the dataset, which is shuffled differently than the first iteration!
It even uses the second iteration when I directly call model.predict(get_dataset('validation'))!

Therefore, the entries of true_class and pred do not match.

Switching to TensorFlow 2.10.0-rc3 and its tf.keras.utils.split_dataset makes the accuracies equal.

Here’s the updated code:

import tensorflow as tf
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.preprocessing import image_dataset_from_directory
    
inputs = tf.keras.Input(shape=(224, 224, 3))
base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base_output = base_model(inputs)
base_model.trainable = False
out = Flatten(name='flat')(base_output)
out = Dense(1, activation='sigmoid')(out)
model = Model(inputs=inputs, outputs=out)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

dataset = image_dataset_from_directory(
    'data-synthetic',
    labels="inferred",
    label_mode='binary',
    color_mode="rgb",
    image_size=(224, 224),
    shuffle=True,
    seed=1,
    crop_to_aspect_ratio=False,
)
train_dataset, val_dataset = tf.keras.utils.split_dataset(dataset, right_size=0.1)

model.fit(
    train_dataset,
    steps_per_epoch=1,
    epochs=20,
    validation_data=val_dataset,
    max_queue_size=0,
    use_multiprocessing=False,
    workers=0,
)

true_class = tf.concat([y for x, y in val_dataset], axis=0)
pred = model.predict(val_dataset)
pred_class = pred >= .5
print('eval acc:', accuracy_score(true_class, pred_class))

which correctly yields:

Epoch 19/20
1/1 [==============================] - 0s 438ms/step - loss: 0.4426 - accuracy: 0.9062 - val_loss: 0.4658 - val_accuracy: 0.8800
Epoch 20/20
1/1 [==============================] - 0s 444ms/step - loss: 2.1619 - accuracy: 0.8438 - val_loss: 0.5886 - val_accuracy: 0.8900
4/4 [==============================] - 1s 87ms/step
eval acc: 0.89

Answered By – Robert Pollak

Answer Checked By – Mary Flores (Easybugfix Volunteer)

Leave a Reply

(*) Required, Your email will not be published