Understanding Keras LSTMs: Role of Batch-size and Statefulness

I am still not sure what is the correct approach for my task regarding statefulness and determining batch_size. I have about 1000 independent time series ( samples ) that have a length of about 600 days ( timesteps ) each (actually variable length, but I thought about trimming the data to a constant timeframe) with 8 features (or input_dim ) for each timestep (some of the features are identical to every sample, some individual per sample). Input shape = (1000, 600, 8) One of the features is the one I want to predict, while the others are (supposed to be) supportive for the prediction of this one “master feature”. I will do that for each of the 1000 time series. What would be the best strategy to model this problem? Output shape = (1000, 600, 1)

What is a Batch?

Keras uses fast symbolic mathematical libraries as a backend, such as TensorFlow and Theano. A downside of using these libraries is that the shape and size of your data must be defined once up front and held constant regardless of whether you are training your network or making predictions. […] This does become a problem when you wish to make fewer predictions than the batch size. For example, you may get the best results with a large batch size, but are required to make predictions for one observation at a time on something like a time series or sequence problem.

This sounds to me like a “batch” would be splitting the data along the timesteps -dimension. However, [3] states that:

Said differently, whenever you train or test your LSTM, you first have to build your input matrix X of shape nb_samples, timesteps, input_dim where your batch size divides nb_samples . For instance, if nb_samples=1024 and batch_size=64 , it means that your model will receive blocks of 64 samples, compute each output (whatever the number of timesteps is for every sample), average the gradients and propagate it to update the parameters vector.

When looking deeper into the examples of [1] and [4], Jason is always splitting his time series to several samples that only contain 1 timestep (the predecessor that in his example fully determines the next element in the sequence). So I think the batches are really split along the samples -axis. (However his approach of time series splitting doesn’t make sense to me for a long-term dependency problem.) Conclusion So let’s say I pick batch_size=10 , that means during one epoch the weights are updated 1000 / 10 = 100 times with 10 randomly picked, complete time series containing 600 x 8 values, and when I later want to make predictions with the model, I’ll always have to feed it batches of 10 complete time series (or use solution 3 from [4], copying the weights to a new model with different batch_size). Principles of batch_size understood – however still not knowing what would be a good value for batch_size. and how to determine it

Statefulness

The KERAS documentation tells us

You can set RNN layers to be 'stateful', which means that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch.

If I’m splitting my time series into several samples (like in the examples of [1] and [4]) so that the dependencies I’d like to model span across several batches, or the batch-spanning samples are otherwise correlated with each other, I may need a stateful net, otherwise not. Is that a correct and complete conclusion? So for my problem I suppose I won’t need a stateful net. I’d build my training data as a 3D array of the shape (samples, timesteps, features) and then call model.fit with a batch_size yet to determine. Sample code could look like:

model = Sequential() model.add(LSTM(32, input_shape=(600, 8))) # (timesteps, features) model.add(LSTM(32)) model.add(LSTM(32)) model.add(LSTM(32)) model.add(Dense(1, activation='linear')) model.compile(loss='mean_squared_error', optimizer='adam') model.fit(X, y, epochs=500, batch_size=batch_size, verbose=2)