Skip to main content

Sequences

Data streams are everywhere in our lives. Weather station sensor data arrive in streams indexed by time, financial trading data and obviously reading comprehension - one can think of many others. We are interested to fit sequenced data with a model and therefore we need a hypothesis set, that is rich enough for tasks that require some form of memory. In the following we use tt as the index variable without necessarily implying any time semantics. Dynamical systems are such rich models where the recurrent state evolution can be represented as: s_t=ft(s_t1,at;θt)\mathbf{s}\_t = f_t(\mathbf{s}\_{t-1}, \mathbf{a}_t ; \mathbf \theta_t) where s\mathbf s is the evolving state, a\mathbf a is an external action or control and θ\mathbf \theta is a set of parameters that specify the state evolution model ff. This innocent looking equation can capture quite a lot of complexity.
  1. The state space which is the set of states can depend on tt.
  2. The action space similarly can depend on tt
  3. Finally, the function that maps previous states and actions to a new state can also depend on tt
So the dynamical system above can indeed offer a very profound modeling flexibility.

RNN Architecture

The RNN architecture is a constrained implementation of the above dynamical system: h_t=f(h_t1,xt;θ)\mathbf{h}\_t = f(\mathbf{h}\_{t-1}, \mathbf{x}_t ; \mathbf \theta) RNNs implement the same function (parametrized by θ\mathbf \theta) across the sequence 1:τ1:\tau. The state is latent and is denoted with h\mathbf h to match the notation we used earlier for the DNN hidden layers. There is also no dependency on tt of the parameters θ\mathbf \theta and this means that the network shares parameters across the sequence. We have seen parameter sharing in CNNs as well but if you recall the sharing was over the relatively small span of the filter. But the most striking difference between CNNs and RNNs is in recursion itself. rnn-recurrence Recursive state representation in RNNs The weights h\mathbf h in CNNs were not a function of previous weights and this means that they cannot remember previous hidden states in the classification or regression task they try to solve. This is perhaps the most distinguishing element of the RNN architecture - its ability to remember via the hidden state who is dimensioned according to the task at hand. There is a way using sliding windows to allow DNNs to remember past inputs as shown in the figure below for an NLP application. dnn-sequential-processing DNNs can create models from sequential data (such as the language modeling use case shown here). At each step tt the network with a sliding window span of τ=3\tau=3 that acts as memory, will concatenate the word embeddings and use a hidden layer h\mathbf h to predict the the next element in the sequence. However, notice that (a) the span is limited and fixed (b) words such as “the ground” will appear in multiple sliding windows forcing the network to learn two different patterns for this constituent (“in the ground”, “the ground there”). RNNs have a wide variety of architectures. rnn-use-cases From left to right, CNNs, Image Captioning, Sentiment Classification, Machine Translation, Multi-Object Tracking. RNNs are able to work with variable size input and output sequenced data.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.