[FIXED] Dataset.batch doesn't work as expected with a zipped dataset

Issue

I have a dataset like this:

a = tf.data.Dataset.range(1, 16)
b = tf.data.Dataset.range(16, 32)
zipped = tf.data.Dataset.zip((a, b))
list(zipped.as_numpy_iterator())

# output: 
[(0, 16),
 (1, 17),
 (2, 18),
 (3, 19),
 (4, 20),
 (5, 21),
 (6, 22),
 (7, 23),
 (8, 24),
 (9, 25),
 (10, 26),
 (11, 27),
 (12, 28),
 (13, 29),
 (14, 30),
 (15, 31)]

When I apply batch(4) to it, the expected result is an array of batches, where each batch contains four tuples:

[[(0, 16), (1, 17), (2, 18), (3, 19)],
 [(4, 20), (5, 21), (6, 22), (7, 23)],
 [(9, 24), (10, 25), (10, 26), (11, 27)],
 [(12, 28), (13, 29), (14, 30), (15, 31)]]

But this is what I receive instead:

batched = zipped.batch(4)
list(batched.as_numpy_iterator())

# Output:
[(array([0, 1, 2, 3]), array([16, 17, 18, 19])), 
 (array([4, 5, 6, 7]), array([20, 21, 22, 23])), 
 (array([ 8,  9, 10, 11]), array([24, 25, 26, 27])), 
 (array([12, 13, 14, 15]), array([28, 29, 30, 31]))]

I’m following this tutorial, he does the same steps but gets the correct output somehow.


Update: according to the documentation this is the intended behavior:

The components of the resulting element will have an additional outer dimension, which will be batch_size

But it doesn’t make any sense. To my understanding, dataset is a list of pieces of data. It doesn’t matter the shape of those pieces of data, when we are batching it we are combining the elements [whatever their shape is] into batches, therefore it should always insert the new dimention to the second position ((length, a, b, c) -> (length', batch_size, a, b, c)).

So my questions are: I wonder what is the purpose of batch() being implemented this way? And what is the alternative that does what I described?

Solution

One thing you can try doing is something like this:

import tensorflow as tf

a = tf.data.Dataset.range(16)
b = tf.data.Dataset.range(16, 32)
zipped = tf.data.Dataset.zip((a, b)).batch(4).map(lambda x, y: tf.transpose([x, y]))

list(zipped.as_numpy_iterator())
[array([[ 0, 16],
        [ 1, 17],
        [ 2, 18],
        [ 3, 19]]), 
 array([[ 4, 20],
        [ 5, 21],
        [ 6, 22],
        [ 7, 23]]), 
 array([[ 8, 24],
        [ 9, 25],
        [10, 26],
        [11, 27]]), 
 array([[12, 28],
        [13, 29],
        [14, 30],
        [15, 31]])]

but they are still not tuples. Or:

zipped = tf.data.Dataset.zip((a, b)).batch(4).map(lambda x, y: tf.unstack(tf.transpose([x, y]), num = 4))
[(array([ 0, 16]), array([ 1, 17]), array([ 2, 18]), array([ 3, 19])), (array([ 4, 20]), array([ 5, 21]), array([ 6, 22]), array([ 7, 23])), (array([ 8, 24]), array([ 9, 25]), array([10, 26]), array([11, 27])), (array([12, 28]), array([13, 29]), array([14, 30]), array([15, 31]))]

Answered By – AloneTogether

Answer Checked By – Timothy Miller (Easybugfix Admin)

Leave a Reply

(*) Required, Your email will not be published