Hence, there are no more than 2·m iterations of the while-loop
⇒ failure function can be computed in O(m) time
COMP9024 22T2 ♢Dynamic Data Structures♢ [36/66]
<< ∧ >>
❖ Boyer-Moore vs KMP
Boyer-Moore algorithm
decides how far to jump ahead based on the mismatched character in the text
works best on large alphabets and natural language texts (e.g. English)
Knuth-Morris-Pratt algorithm
uses information embodied in the pattern to determine where the next match could begin
works best on small alphabets (e.g. A,C,G,T)
For the keen: The article "Average running time of the Boyer-Moore-Horspool algorithm" shows that the time is inversely proportional to size of alphabet
COMP9024 22T2 ♢Dynamic Data Structures♢ [37/66]
<< ∧ >>
❖ Word Matching With Tries
COMP9024 22T2 ♢Dynamic Data Structures♢ [38/66]
<< ∧ >>
❖ Preprocessing Strings
Preprocessing the pattern speeds up pattern matching queries
After preprocessing P, KMP algorithm performs pattern matching in time proportional to the text length
If the text is large, immutable and searched for often (e.g., works by Shakespeare)
we can preprocess the text instead of the pattern
COMP9024 22T2 ♢Dynamic Data Structures♢ [39/66]
<< ∧ >>
❖ Preprocessing Strings (cont)
A trie …
is a compact data structure for representing a set of strings
e.g. all the words in a text, a dictionary etc.
supports pattern matching queries in time proportional to the pattern size
Note: Trie comes from retrieval, but is pronounced like "try" to distinguish it from "tree"
COMP9024 22T2 ♢Dynamic Data Structures♢ [40/66]
<< ∧ >>
❖ Tries
Tries are trees organised using parts of keys (rather than whole keys)
[Diagram:Pic/trie-example.png]
COMP9024 22T2 ♢Dynamic Data Structures♢ [41/66]
<< ∧ >>
❖ Tries (cont)
Each node in a trie …
contains one part of a key (typically one character)
may have up to 26 children
may be tagged as a "finishing" node
but even "finishing" nodes may have children
Depth d of trie = length of longest key value
Cost of searching O(d) (independent of n)
COMP9024 22T2 ♢Dynamic Data Structures♢ [42/66]
<< ∧ >>
❖ Tries (cont)
Possible trie representation:
#define ALPHABET_SIZE 26
typedef struct Node *Trie;
typedef struct Node {
bool finish; // last char in key?
Item data; // no Item if !finish
Trie child[ALPHABET_SIZE];
} Node;
typedef char *Key;
COMP9024 22T2 ♢Dynamic Data Structures♢ [43/66]
<< ∧ >>
❖ Tries (cont)
Note: Can also use BST-like nodes for more space-efficient implementation of tries
[Diagram:Pic/trie-as-bst.png]
COMP9024 22T2 ♢Dynamic Data Structures♢ [44/66]
<< ∧ >>
❖ Trie Operations
Basic operations on tries:
search for a key
insert a key
COMP9024 22T2 ♢Dynamic Data Structures♢ [45/66]
<< ∧ >>
❖ Trie Operations (cont)
[Diagram:Pic/trie-search.png]
COMP9024 22T2 ♢Dynamic Data Structures♢ [46/66]
<< ∧ >>
❖ Trie Operations (cont)
Traversing a path, using char-by-char from Key:
find(trie,key):
| Input trie, key
| Output pointer to element in trie if key found
| NULL otherwise
|
| node=trie
| for each char in key do
| | if node.child[char] exists then
| | node=node.child[char] // move down one level
| | else
| | return NULL
| | end if
| end for
| if node.finish then // "finishing" node reached?
| return node
| else
| return NULL
| end if
COMP9024 22T2 ♢Dynamic Data Structures♢ [47/66]
<< ∧ >>
❖ Trie Operations (cont)
Insertion into Trie:
insert(trie,item,key):
| Input trie, item with key of length m
| Output trie with item inserted
|
| if trie is empty then
| t=new trie node
| end if
| if m=0 then
| t.finish=true, t.data=item
| else
| t.child[key[0]]=insert(t.child[key[0]],item,key[1..m-1])
| end if
| return t
COMP9024 22T2 ♢Dynamic Data Structures♢ [48/66]
<< ∧ >>
❖ Exercise: Trie Insertion
Insert cat, cats and carer into this trie:
[Diagram:Pic/trie1.png]
COMP9024 22T2 ♢Dynamic Data Structures♢ [49/66]
<< ∧ >>
[Diagram:Pic/trie2.png]
COMP9024 22T2 ♢Dynamic Data Structures♢ [50/66]
<< ∧ >>
❖ Exercise: Trie Insertion (cont)
Analysis of standard tries:
O(n) space
insertion and search in time O(m)
n … total size of text (e.g. sum of lengths of all strings in a given dictionary)
m … size of the string parameter of the operation (the "key")
COMP9024 22T2 ♢Dynamic Data Structures♢ [51/66]
<< ∧ >>
❖ Word Matching With Tries
COMP9024 22T2 ♢Dynamic Data Structures♢ [52/66]
<< ∧ >>
❖ Word Matching with Tries
Preprocessing the text:
Insert all searchable words of a text into a trie
Each finishing node stores the occurrence(s) of the associated word in the text
COMP9024 22T2 ♢Dynamic Data Structures♢ [53/66]
<< ∧ >>
❖ Word Matching with Tries (cont)
Example text and corresponding trie of searchable words:
[Diagram:Pic/text-trie-example.png]
COMP9024 22T2 ♢Dynamic Data Structures♢ [54/66]
<< ∧ >>
❖ Text Compression
COMP9024 22T2 ♢Dynamic Data Structures♢ [55/66]
<< ∧ >>
❖ Text Compression
Problem: Efficiently encode a given string X by a smaller string Y
Applications:
Save memory and/or bandwidth
Huffman's algorithm
computes frequency f(c) for each character c
encodes high-frequency characters with short code
no code word is a prefix of another code word
uses optimal encoding tree to determine the code words
COMP9024 22T2 ♢Dynamic Data Structures♢ [56/66]
<< ∧ >>
❖ Text Compression (cont)
Code … mapping of each character to a binary code word
Prefix code … binary code such that no code word is prefix of another code word
prefix code: Each code can uniquely identify a character without the need for a marker/delimiter
e.g., set {A, B, C, D}. A prefix code for this set might be:
A = 00
B = 01
C = 10
D = 111
Then: 0111101101100
Encoding tree …
represents a prefix code
each leaf stores a character
code word given by the path from the root to the leaf (0 for left child, 1 for right child)
COMP9024 22T2 ♢Dynamic Data Structures♢ [57/66]
<< ∧ >>
❖ Text Compression (cont)
Example:
[Diagram:Pic/encoding-tree.png]
COMP9024 22T2 ♢Dynamic Data Structures♢ [58/66]
<< ∧ >>
❖ Text Compression (cont)
Text compression problem
Given a text T, find a prefix code that yields the shortest encoding of T
short codewords for frequent characters
long code words for rare characters
COMP9024 22T2 ♢Dynamic Data Structures♢ [59/66]
<< ∧ >>
❖ Text Compression (cont)
Example: T = abracadabra
[Diagram:Pic/encoding-trees.png]
T1 requires 29 bits to encode text T,
T2 requires 24 bits
01011011010000101001011011010 vs 001011000100001100101100
COMP9024 22T2 ♢Dynamic Data Structures♢ [60/66]
<< ∧ >>
❖ Text Compression (cont)
Huffman's algorithm
computes frequency f(c) for each character
successively combines pairs of lowest-frequency characters to build encoding tree "bottom-up"
Example: abracadabra
[Diagram:Pic/huffman-tree.png]
COMP9024 22T2 ♢Dynamic Data Structures♢ [61/66]
<< ∧ >>
❖ Huffman Code
Huffman's algorithm using priority queue:
HuffmanCode(T):
| Input string T of size n
| Output optimal encoding tree for T
|
| compute frequency array
| Q=new priority queue
| for all characters c do
| | T=new single-node tree storing c
| | join(Q,T) with frequency(c) as key
| end for
| while |Q|≥2 do
| | f1=Q.minKey(), T1=leave(Q)
| | f2=Q.minKey(), T2=leave(Q)
| | T=new tree node with subtrees T1 and T2
| | join(Q,T) with f1+f2 as key
| end while
| return leave(Q)
COMP9024 22T2 ♢Dynamic Data Structures♢ [62/66]
<< ∧ >>
❖ Exercise: Huffman Code
Construct a Huffman tree for: a fast runner need never be afraid of the dark
[Diagram:Pic/larger-huffman-tree-array.png]
COMP9024 22T2 ♢Dynamic Data Structures♢ [63/66]
<< ∧ >>
[Diagram:Pic/larger-huffman-tree.png]
COMP9024 22T2 ♢Dynamic Data Structures♢ [64/66]
<< ∧ >>
❖ Exercise: Huffman Code (cont)
Analysis of Huffman's algorithm:
O(n+d·log d) time
n … length of the input text T
d … number of distinct characters in T
COMP9024 22T2 ♢Dynamic Data Structures♢ [65/66]
<< ∧
❖ Summary
Alphabets and words
Pattern matching
Boyer-Moore, Knuth-Morris-Pratt
Tries
Text compression
Huffman code
Suggested reading (optional):
tries … Sedgewick, Ch. 15.2
approximation … Moffat, Ch. 9.4
COMP9024 22T2 ♢Dynamic Data Structures♢ [66/66]
Produced: 13 Apr 2023