(520|600).666 Information Extraction Homework # 6
Due Thursday, April 27, 2023.
Connectionist Temporal Classification
Consider the task of recognizing an M length sequence of tokens, y1M , from a T length input xT1 . The CTC objective function is one objective for such sequence transduction tasks which works by converting y1M into an alignment sequence sT1 of the same length as xT1 by using an additional symbol ? to fill the extra space. Note that it assumes T ≥ M. β?1?y1M? represents the set of all possible T -length alignments, sT1 , for the sequence y1M .
T
logp?y1M|xT1 ? = log X Yp?st|xT1 ? (1)
sT1 ∈β?1(y1M)t=1
We will consider that p ?st|xT1 ? is obtained by normalizing the outputs of a neural network, φ. Element, φtst ∝ p ?st|xT1 ?, of the matrix, φ, can be used to compute the CTC objective
where
and |S| is the number of output units in the neural network. 1. WFST Representation of the CTC Objective
Consider the task of modeling the word “all” using characters as the tokens. In this case M = 3. For this question, we will model words using both the HMM, and CTC objective functions and examine the different unweighted (i.e., the weights on each arc are 1.0) WFST topologies used for each. Remember that final states are indicated by using a double circle instead of a single circle around the state label. Mark all the input and output labels.
(a) Draw the WFST corresponding using a 1-state HMM for each letter and uniform state transitions. Assume the output labels could be fed into a pronunciation lexicon to recover the word, i.e., a - l - l → all
(b) Draw the WFST corresponding to the CTC topology. Recall that it should be able to accept strings starting or ending with ?.
(c) Enumerate all possible length 5 alignments of the word “all” accepted by the HMM topology where the HMMs for each letter are strung together with null arcs.
(d) Enumerate all possible length 5 alignments of the word “all” accepted by the CTC topology
(e) Draw the HMM trellis, often called a lattice, corresponding to all possible length 5 sequences as a WFST. (It should still look very similar to a normal trellis). Ignore the weights on the arcs. Do not draw any unnecessary (unused) nodes. Please include input and output labels.
(f) Repeat (e) but for the CTC trellis, corresponding to all possible length 5 se- quences as a WFST.
2. WFST Composition
The outputs of a neural network, φ, can themselves be represented as a WFST. Imagine the neural network has 4 outputs. In other words, it outputs vectors φt ∈ R1 x 4. The first output corresponds to the ? symbol, the second symbol corresponds to a, the third symbol to b and the fourth to l.
For this problem consider the matrix of scores, φ, where each column corresponds to a time index, and each row corresponds to one of the network outputs. These are like to observation probabilities in HMMs. We can draw this matrix as a WFST with 6 states corresponding to the length of the neural network output (5+1, where +1 is for a start state). Between each node there are as many arcs as there are output symbols (i.e., rows in the matrix). The input and output labels for this WFST will be the same (i.e., it is a WFSA), and the weights on the arcs correspond to the scores of those symbols at that position in time. Use the following matrix.
(3)
(a) Draw this WFST. Feel free to use some shorthand notation provided it is logical if this process seems tedious. We will call it Φ.
(b) Use the WFST, Φ, from part (a) and compose it, with the CTC WFST, which we will call T , from problem (1.) using the log-semiring, i.e., Φ ? T . Assume unweighted arcs all had a weight of 1. Recall that in the log-semiring, a L b = logea +eb,aNb=a+b,1=0,and0=∞. Showeachstepofthecomposition by denoting states with the pairs of labels on states used in Φ and T, like (Φi,Tj). Feel free to use any computational aid, i.e. a program or calculator, to help. Remember to remove any branches that don’t end in final states. A final state occurs when the both states in the pair of WFSTs are final.
(c) Compare the resulting WFST with the CTC trellis computed in problem (1.f) and comment.
(d) Use the forward algorithm on the result of part (b) to compute the CTC objective for this utterance. Again feel free to use any computational aid.
3. Prove that the posterior distribution modeled by CTC is globally normalized, i.e., the probability of a sequence is computed as the score of the sequence normalized by the sum of scores over all possible sequences and can be expressed as
Ep(y′) e 1 Make sure to state any assumptions used.
4. Using the result of problem (3.), show that maximizing the CTC objective maximizes a lower bound on the mutual information, I (X; Y ), between input and output sequences.