I have a dataset for an object detection algorithm containing pictures (.jpg) and corresponding .xml files containing bounding boxes.

I want to write a script that randomly splits the dataset into train and test set which means i have to make sure i allocate the jpg with it’s corresponding XML to the same directory.

How should i edit the following code in order to fulfill this?

Also, is this the "best" way of doing this or is it better to split the dataset after xml-to-csv conversion or after generating csv to tfrecords conversion?

import shutil, os, glob, random

# List all files in a directory using os.listdir
basepath = '/home/bis/hans/bis/workspace/images/Synced_dataset'
filenames = []

for entry in os.listdir(basepath):
    if os.path.isfile(os.path.join(basepath, entry)):

filenames.sort()  # make sure that the filenames have a fixed order before shuffling
random.shuffle(filenames) # shuffles the ordering of filenames (deterministic given the chosen seed)

split = int(0.8 * len(filenames))
train_filenames = filenames[:split]
test_filenames = filenames[split:]


The best option to me is to create two list of files (filenames for jpg and xmlnames for xml) in the correct order and one list of indices indices=[i for i in range(len(filenames))].

Then you can shuffle your indices list :


Finally, you create your train and test sets for both your jpg and xml files:

split = int(0.8 * len(filenames))
file_train = [filenames[idx] for idx in indices[:split]]
file_test = [filenames[idx] for idx in indices[split:]]
xml_train = [xmlnames[idx] for idx in indices[:split]]
xml_test = [xmlnames[idx] for idx in indices[split:]]

