Reading the data

Count examples in CSV file

import tensorflow as tf
filename_queue = tf.train.string_input_producer(["file.csv"], num_epochs=1)
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
col1, col2 = tf.decode_csv(value, record_defaults=[[0], [0]])

with tf.Session() as sess:
  sess.run(tf.initialize_local_variables())
  tf.train.start_queue_runners()
  num_examples = 0
  try:
    while True:
      c1, c2 = sess.run([col1, col2])
      num_examples += 1
  except tf.errors.OutOfRangeError:
    print "There are", num_examples, "examples"

num_epochs=1 makes string_input_producer queue to close after processing each file on the list once. It leads to raising OutOfRangeError which is caught in try:. By default, string_input_producer produces the filenames infinitely.

tf.initialize_local_variables() is a tensorflow Op, which, when executed, initializes num_epoch local variable inside string_input_producer.

tf.train.start_queue_runners() start extra treads that handle adding data to the queues asynchronically.

Read & Parse TFRecord file

TFRecord files is the native tensorflow binary format for storing data (tensors). To read the file you can use a code similar to the CSV example:

import tensorflow as tf
filename_queue = tf.train.string_input_producer(["file.tfrecord"], num_epochs=1)
reader = tf.TFRecordReader()
key, serialized_example = reader.read(filename_queue)

Then, you need to parse the examples from serialized_example Queue. You can do it either using tf.parse_example, which requires previous batching, but is faster or tf.parse_single_example:

batch = tf.train.batch([serialized_example], batch_size=100)
parsed_batch = tf.parse_example(batch, features={
  "feature_name_1": tf.FixedLenFeature(shape=[1], tf.int64),
  "feature_name_2": tf.FixedLenFeature(shape=[1], tf.float32)
})

tf.train.batch joins consecutive values of given tensors of shape [x, y, z] to tensors of shape [batch_size, x, y, z]. features dict maps names of the features to tensorflow’s definitions of features. You use parse_single_example in a similar way:

parsed_example = tf.parse_single_example(serialized_example, {
  "feature_name_1": tf.FixedLenFeature(shape=[1], tf.int64),
  "feature_name_2": tf.FixedLenFeature(shape=[1], tf.float32)
})

tf.parse_example and tf.parse_single_example return a dictionary mapping feature names to the tensor with the values.

To batch examples coming from parse_single_example you should extract the tensors from the dict and use tf.train.batch as before:

parsed_batch = dict(zip(parsed_example.keys(),
    tf.train.batch(parsed_example.values(), batch_size=100)

You read the data as before, passing the list of all the tensors to evaluate to sess.run:

with tf.Session() as sess:
  sess.run(tf.initialize_local_variables())
  tf.train.start_queue_runners()
  try:
    while True:
      data_batch = sess.run(parsed_batch.values())
      # process data
  except tf.errors.OutOfRangeError:
    pass

Random shuffling the examples

To randomly shuffle the examples, you can use tf.train.shuffle_batch function instead of tf.train.batch, as follows:

parsed_batch = tf.train.shuffle_batch([serialized_example],
    batch_size=100, capacity=1000,
    min_after_dequeue=200)

tf.train.shuffle_batch (as well as tf.train.batch) creates a tf.Queue and keeps adding serialized_examples to it.

capacity measures how many elements can be stored in Queue in one time. Bigger capacity leads to bigger memory usage, but lower latency caused by threads waiting to fill it up.

min_after_dequeue is the minimum number of elements present in the queue after getting elements from it. The shuffle_batch queue is not shuffling elements perfectly uniformly - it is designed with huge data, not-fitting-memory one, in mind. Instead, it reads between min_after_dequeue and capacity elements, store them in memory and randomly chooses a batch of them. After that it enqueues some more elements, to keep its number between min_after_dequeue and capacity. Thus, the bigger value of min_after_dequeue, the more random elements are - the choice of batch_size elements is guaranteed to be taken from at least min_after_dequeue consecutive elements, but the bigger capacity has to be and the longer it takes to fill the queue initially.

Reading data for n epochs with batching

Assume your data examples are already read to a python’s variable and you would like to read it n times, in batches of given size:

import numpy as np
import tensorflow as tf
data = np.array([1, 2, 3, 4, 5])
n = 4

To merge data in batches, possibly with random shuffling, you can use tf.train.batch or tf.train.batch_shuffle, but you need to pass to it the tensor that would produce whole data n times:

limited_tensor = tf.train.limit_epochs(data, n)
batch = tf.train.shuffle_batch([limited_tensor], batch_size=3, enqueue_many=True, capacity=4)

The limit_epochs converts the numpy array to tensor under the hood and returns a tensor producing it n times and throwing an OutOfRangeError afterwards. The enqueue_many=True argument passed to shuffle_batch denotes that each tensor in the tensor list [limited_tensor] should be interpreted as containing a number of examples. Note that capacity of the batching queue can be smaller than the number of examples in the tensor.

One can process the data as usual:

with tf.Session() as sess:
  sess.run(tf.initialize_local_variables())
  tf.train.start_queue_runners()
  try:
    while True:
      data_batch = sess.run(batch)
      # process data
  except tf.errors.OutOfRangeError:
    pass

How to load images and labels from a TXT file

It has not been explained in the Tensorflow documentation how to load images and labels directly from a TXT file. The code below illustrates how I achieved it. However, it does not mean that is the best way to do it and that this way will help in further steps.

For instance, I’m loading the labels in one single integer value {0,1} while the documentation uses a one-hot vector [0,1].

# Learning how to import images and labels from a TXT file
#
# TXT file format
#
# path/to/imagefile_1 label_1
# path/to/imagefile_2 label_2
# ...                 ...
#
# where label_X is either {0,1}

#Importing Libraries
import os
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.python.framework import ops
from tensorflow.python.framework import dtypes

#File containing the path to images and the labels [path/to/images label]
filename = '/path/to/List.txt'

#Lists where to store the paths and labels
filenames = []
labels = []

#Reading file and extracting paths and labels
with open(filename, 'r') as File:
    infoFile = File.readlines() #Reading all the lines from File
    for line in infoFile: #Reading line-by-line
        words = line.split() #Splitting lines in words using space character as separator
        filenames.append(words[0])
        labels.append(int(words[1]))

NumFiles = len(filenames)

#Converting filenames and labels into tensors
tfilenames = ops.convert_to_tensor(filenames, dtype=dtypes.string)
tlabels = ops.convert_to_tensor(labels, dtype=dtypes.int32)

#Creating a queue which contains the list of files to read and the value of the labels
filename_queue = tf.train.slice_input_producer([tfilenames, tlabels], num_epochs=10, shuffle=True, capacity=NumFiles)

#Reading the image files and decoding them
rawIm= tf.read_file(filename_queue[0])
decodedIm = tf.image.decode_png(rawIm) # png or jpg decoder

#Extracting the labels queue
label_queue = filename_queue[1]

#Initializing Global and Local Variables so we avoid warnings and errors
init_op = tf.group(tf.local_variables_initializer() ,tf.global_variables_initializer())

#Creating an InteractiveSession so we can run in iPython
sess = tf.InteractiveSession()

with sess.as_default():
    sess.run(init_op)
    
    # Start populating the filename queue.
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    for i in range(NumFiles): #length of your filenames list
        nm, image, lb = sess.run([filename_queue[0], decodedIm, label_queue])
        
        print image.shape
        print nm
        print lb
        
        #Showing the current image
        plt.imshow(image)
        plt.show()

    coord.request_stop()
    coord.join(threads)