COMP9024 22T2 ♢Dynamic Data Structures Week 9: String Algorithms

Week 9: String Algorithms

COMP9024 22T2 ♢Dynamic Data Structures♢ [0/66]

∧ >>

❖ Week 9

Things to Note …

Assignment due on Tuesday 18th April 23:59

Read the spec yourself. Think about an approach youself first

Attend one of the online help sessions to get some advice

Weekly assessments continue - check the due dates

MyExperience will be out soon ... please let us know what was good/bad.

Final exam details this week ...

Computer-based exam platform called Inspera

Information and resources will be provided well before the exam

In This Lecture …

String algorithms

pattern matching, matching with tries, text compression (slides, [S] Ch. 15.2)

Coming Up …

Algorithm and ethics (Note: guest lecture for the ethics topic!)

Course Review, exam preview

Any other loose ends that we need to tie up

COMP9024 22T2 ♢Dynamic Data Structures♢ [1/66]

<< ∧ >>

❖ The Tao of Programming

Next in a series of advices from the Tao of Programming …

COMP9024 22T2 ♢Dynamic Data Structures♢ [2/66]

<< ∧ >>

❖ The Tao of Programming (cont)

Book 3

Chapter 3.2

There once was a Master Programmer who wrote unstructured programs.

A novice programmer, seeking to imitate him, also began to write unstructured programs.

When the novice asked the Master to evaluate his progress, the Master criticized him for writing unstructured programs, saying,

"What is appropriate for the Master is not appropriate for the novice.

You must understand Tao before transcending structure."

COMP9024 22T2 ♢Dynamic Data Structures♢ [3/66]

<< ∧ >>

❖ Strings

COMP9024 22T2 ♢Dynamic Data Structures♢ [4/66]

<< ∧ >>

❖ Strings

A string is a sequence of characters.

An alphabet Σ is the set of possible characters in strings.

Examples of strings:

C program

HTML document

DNA sequence

Digitised image

Examples of alphabets:

ASCII

Unicode

{A,C,G,T}

{0,1}

COMP9024 22T2 ♢Dynamic Data Structures♢ [5/66]

<< ∧ >>

❖ Strings (cont)

Notation:

length(P) … #characters in P

λ … empty string (length(λ) = 0)

Σm … set of all strings of length m over alphabet Σ

(e.g., if Σ is {0,1}, m = 3, Σm = {000,001,010,011,...}

Σ* … set of all strings over alphabet Σ

νω denotes the concatenation of strings ν and ω

Note: length(νω) = length(ν)+length(ω) λω = ω = ωλ

COMP9024 22T2 ♢Dynamic Data Structures♢ [6/66]

<< ∧ >>

❖ Strings (cont)

Notation:

substring of P … any string Q such that P = νQω, for some ν,ω∈Σ*

prefix of P … any string Q such that P = Qω, for some ω∈Σ*

suffix of P … any string Q such that P = ωQ, for some ω∈Σ*

COMP9024 22T2 ♢Dynamic Data Structures♢ [7/66]

<< ∧ >>

❖ Exercise: Strings

The string a/a of length 3 over the ASCII alphabet has

how many prefixes?

how many suffixes?

how many substrings?

COMP9024 22T2 ♢Dynamic Data Structures♢ [8/66]

<< ∧ >>

4 prefixes: "" "a" "a/" "a/a"

4 suffixes: "a/a" "/a" "a" ""

6 substrings: "" "a" "/" "a/" "/a" "a/a"

Note:

"" means the same as λ (= empty string)

COMP9024 22T2 ♢Dynamic Data Structures♢ [9/66]

<< ∧ >>

❖ Exercise: Strings (cont)

ASCII (American Standard Code for Information Interchange)

Specifies mapping of 128 characters to integers 0..127

The characters encoded include:

upper and lower case English letters: A-Z and a-z

digits: 0-9

common punctuation symbols

special non-printing characters: e.g. newline and space

[Diagram:Pic/ascii.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [10/66]

<< ∧ >>

❖ Pattern Matching

COMP9024 22T2 ♢Dynamic Data Structures♢ [11/66]

<< ∧ >>

❖ Pattern Matching

Example (pattern checked backwards):

[Diagram:Pic/pattern-matching.png]

Text … abacaab

Pattern … abacab

COMP9024 22T2 ♢Dynamic Data Structures♢ [12/66]

<< ∧ >>

❖ Pattern Matching (cont)

Given two strings T (text) and P (pattern),

the pattern matching problem consists of finding a substring of T equal to P

Applications:

Text editors

Search engines

Biological research

In many pattern matching problems, P is much smaller than T.

COMP9024 22T2 ♢Dynamic Data Structures♢ [13/66]

<< ∧ >>

❖ Pattern Matching (cont)

Naive pattern matching algorithm

checks for each possible shift of P relative to T

until a match is found, or

all placements of the pattern have been tried

NaiveMatching(T,P):

| Input text T of length n, pattern P of length m

| Output starting index of a substring of T equal to P

| -1 if no such substring exists

| for all i=0..n-m do

| | j=0 // check from left to right

| | while j

| | | j=j+1

| | | if j=m then

| | | return i // entire pattern checked

| | | end if

| | end while

| end for

| return -1 // no match found

COMP9024 22T2 ♢Dynamic Data Structures♢ [14/66]

<< ∧ >>

❖ Analysis of Naive Pattern Matching

Naive pattern matching runs in O(n·m)

Examples of worst case (forward checking):

T = aaa…ah

P = aaah

may occur in DNA sequences

unlikely in English text

COMP9024 22T2 ♢Dynamic Data Structures♢ [15/66]

<< ∧ >>

❖ Exercise: Naive Matching

Suppose all characters in P are different.

Can you accelerate NaiveMatching to run in O(n) on an n-character text T?

COMP9024 22T2 ♢Dynamic Data Structures♢ [16/66]

<< ∧ >>

When a mismatch occurs between P[j] and T[i+j], shift the pattern all the way to align P[0] with T[i+j]

⇒ each character in T checked at most twice

Example:

abcdabcdeabcc abcdabcdeabcc

abcdexxxxxxxx xxxxabcde

COMP9024 22T2 ♢Dynamic Data Structures♢ [17/66]

<< ∧ >>

❖ Boyer-Moore Algorithm

The Boyer-Moore pattern matching algorithm is based on two heuristics:

Looking-glass heuristic: Compare P with subsequence of T moving backwards

Character-jump heuristic: When a mismatch occurs at T[i]=c

if P contains c ⇒ shift P so as to align the last occurrence of c in P with T[i]

otherwise ⇒ shift P so as to align P[0] with T[i+1] (a.k.a. "big jump")

COMP9024 22T2 ♢Dynamic Data Structures♢ [18/66]

<< ∧ >>

❖ Boyer-Moore Algorithm (cont)

Example:

[Diagram:Pic/boyer-moore.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [19/66]

<< ∧ >>

❖ Boyer-Moore Algorithm (cont)

Boyer-Moore algorithm preprocesses pattern P and alphabet Σ to build

last-occurrence function L

L maps Σ to integers such that L(c) is defined as

the largest index i such that P[i]=c, or

-1 if no such index exists

Example: Σ = {a,b,c,d}, P = acab

c a b c d

L(c) 2 3 1 -1

L can be represented by an array indexed by the numeric codes of the characters

L can be computed in O(m+s) time (m … length of pattern, s … size of Σ)

COMP9024 22T2 ♢Dynamic Data Structures♢ [20/66]

<< ∧ >>

❖ Boyer-Moore Algorithm (cont)

BoyerMooreMatch(T,P,Σ):

| Input text T of length n, pattern P of length m, alphabet Σ

| Output starting index of a substring of T equal to P

| -1 if no such substring exists

| L=lastOccurenceFunction(P,Σ)

| i=m-1, j=m-1 // start at end of pattern

| repeat

| | if T[i]=P[j] then

| | | if j=0 then

| | | return i // match found at i

| | | else

| | | i=i-1, j=j-1 // keep comparing

| | | end if

| | else // character-jump

| | | i=i+m-min(j,1+L[T[i]])

| | | j=m-1

| | end if

| until i≥n

| return -1 // no match

Biggest jump (m characters ahead) occurs when L[T[i]] = -1

COMP9024 22T2 ♢Dynamic Data Structures♢ [21/66]

<< ∧ >>

❖ Boyer-Moore Algorithm (cont)

Case 1: j ≤ 1+L[c]

[Diagram:Pic/boyer-moore-case1.png]

Case 2: 1+L[c] < j

[Diagram:Pic/boyer-moore-case2.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [22/66]

<< ∧ >>

❖ Exercise: Boyer-Moore algorithm

For the alphabet Σ = {a,b,c,d}

compute last-occurrence function L for pattern P = abacab

trace Boyer-More on P and text T = abacaabadcabacabaabb

how many comparisons are needed?

COMP9024 22T2 ♢Dynamic Data Structures♢ [23/66]

<< ∧ >>

c a b c d

L(c) 4 5 3 -1

[Diagram:Pic/boyer-moore-example.png]

13 comparisons in total

COMP9024 22T2 ♢Dynamic Data Structures♢ [24/66]

<< ∧ >>

❖ Exercise: Boyer-Moore algorithm (cont)

Analysis of Boyer-Moore algorithm:

Runs in O(nm+s) time

m … length of pattern n … length of text s … size of alphabet

Example of worst case:

T = aaa … a

P = baaa

Worst case may occur in images and DNA sequences but unlikely in English texts

⇒ Boyer-Moore significantly faster than naive matching on English text

COMP9024 22T2 ♢Dynamic Data Structures♢ [25/66]

<< ∧ >>

❖ Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt algorithm …

compares the pattern to the text left-to-right

but shifts the pattern more intelligently than the naive algorithm

Reminder:

Q is a prefix of P … P = Qω, for some ω∈Σ*

Q is a suffix of P … P = ωQ, for some ω∈Σ*

COMP9024 22T2 ♢Dynamic Data Structures♢ [26/66]

<< ∧ >>

❖ Knuth-Morris-Pratt Algorithm (cont)

When a mismatch occurs …

what is the most we can shift the pattern to avoid redundant comparisons?

Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]

[Diagram:Pic/kmp-shift.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [27/66]

<< ∧ >>

❖ Knuth-Morris-Pratt Algorithm (cont)

KMP preprocesses the pattern P[0..m-1] to find matches of its prefixes with itself

Failure function F(j) defined as

the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]

for each position j=0..m-1

if mismatch occurs at Pj ⇒ advance j to F(j-1)

Example: P = abaaba

j 0 1 2 3 4 5

Pj a b a a b a

F(j) 0 0 1 1 2 3

[Diagram:Pic/kmp-failure-function.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [28/66]

<< ∧ >>

❖ Knuth-Morris-Pratt Algorithm (cont)

KMPMatch(T,P):

| Input text T of length n, pattern P of length m

| Output starting index of a substring of T equal to P

| -1 if no such substring exists

| F=failureFunction(P)

| i=0, j=0 // start from left

| while i

| | if T[i]=P[j] then

| | | if j=m-1 then

| | | return i-j // match found at i-j

| | | else

| | | i=i+1, j=j+1 // keep comparing

| | | end if

| | else if j>0 then // mismatch and j>0?

| | | j=F[j-1] // → shift pattern to i-F[j-1]

| | else // mismtach and j still 0?

| | | i=i+1 // → begin at next text character

| | end if

| end while

| return -1 // no match

COMP9024 22T2 ♢Dynamic Data Structures♢ [29/66]

<< ∧ >>

❖ Exercise: KMP-Algorithm

compute failure function F for pattern P = abacab

trace Knuth-Morris-Pratt on P and text T = abacaabaccabacabaabb

how many comparisons are needed?

COMP9024 22T2 ♢Dynamic Data Structures♢ [30/66]

<< ∧ >>

j 0 1 2 3 4 5

Pj a b a c a b

F(j) 0 0 1 0 1 2

[Diagram:Pic/kmp-example.png]

19 comparisons in total

COMP9024 22T2 ♢Dynamic Data Structures♢ [31/66]

<< ∧ >>

❖ Exercise: KMP-Algorithm (cont)

Analysis of Knuth-Morris-Pratt algorithm:

Failure function can be computed in O(m) time (→ next slide)

At each iteration of the while-loop, either

i increases by one, or

the pattern is shifted ≥1 to the right ("shift amount" i-j increases since always F(j-1)

⇒ i can be incremented at most n times, pattern can be shifted at most n times

⇒ there are no more than 2·n iterations of the while-loop

⇒ KMP's algorithm runs in optimal time O(m+n)

COMP9024 22T2 ♢Dynamic Data Structures♢ [32/66]

<< ∧ >>

❖ Exercise: KMP-Algorithm (cont)

Construction of the failure function matches pattern against itself:

failureFunction(P):

| Input pattern P of length m

| Output failure function for P

| F[0]=0 // F[0] is always 0

| j=1, len=0

| while j

| | if P[j]=P[len] then

| | len=len+1 // we have matched len+1 characters

| | F[j]=len // P[0..len-1] = P[len-1..j]

| | j=j+1

| | else if len>0 then // mismatch and len>0?

| | len=F[len-1] // → use already computed F[len] for new len

| | else // mismatch and len still 0?

| | F[j]=0 // → no prefix of P[0..j] is also suffix of P[1..j]

| | j=j+1 // → continue with next pattern character

| | end if

| end while

| return F

COMP9024 22T2 ♢Dynamic Data Structures♢ [33/66]

<< ∧ >>

❖ Exercise:

Trace the failureFunction algorithm for pattern P = abaaba

COMP9024 22T2 ♢Dynamic Data Structures♢ [34/66]

<< ∧ >>

⇒ F[0]=0

j=1, len=0, P[1]≠P[0] ⇒ F[1]=0

j=2, len=0, P[2]=P[0] ⇒ len=1, F[2]=1

j=3, len=1, P[3]≠P[1] ⇒ len=F[0]=0

j=3, len=0, P[3]=P[0] ⇒ len=1, F[3]=1

j=4, len=1, P[4]=P[1] ⇒ len=2, F[4]=2

j=5, len=2, P[5]=P[2] ⇒ len=3, F[5]=3

COMP9024 22T2 ♢Dynamic Data Structures♢ [35/66]

<< ∧ >>

❖ Exercise: (cont)

Analysis of failure function computation:

At each iteration of the while-loop, either

i increases by one, or

the "shift amount" i-j increases by at least one (remember that always F(j-1)

Hence, there are no more than 2·m iterations of the while-loop

⇒ failure function can be computed in O(m) time

COMP9024 22T2 ♢Dynamic Data Structures♢ [36/66]

<< ∧ >>

❖ Boyer-Moore vs KMP

Boyer-Moore algorithm

decides how far to jump ahead based on the mismatched character in the text

works best on large alphabets and natural language texts (e.g. English)

Knuth-Morris-Pratt algorithm

uses information embodied in the pattern to determine where the next match could begin

works best on small alphabets (e.g. A,C,G,T)

For the keen: The article "Average running time of the Boyer-Moore-Horspool algorithm" shows that the time is inversely proportional to size of alphabet

COMP9024 22T2 ♢Dynamic Data Structures♢ [37/66]

<< ∧ >>

❖ Word Matching With Tries

COMP9024 22T2 ♢Dynamic Data Structures♢ [38/66]

<< ∧ >>

❖ Preprocessing Strings

Preprocessing the pattern speeds up pattern matching queries

After preprocessing P, KMP algorithm performs pattern matching in time proportional to the text length

If the text is large, immutable and searched for often (e.g., works by Shakespeare)

we can preprocess the text instead of the pattern

COMP9024 22T2 ♢Dynamic Data Structures♢ [39/66]

<< ∧ >>

❖ Preprocessing Strings (cont)

A trie …

is a compact data structure for representing a set of strings

e.g. all the words in a text, a dictionary etc.

supports pattern matching queries in time proportional to the pattern size

Note: Trie comes from retrieval, but is pronounced like "try" to distinguish it from "tree"

COMP9024 22T2 ♢Dynamic Data Structures♢ [40/66]

<< ∧ >>

❖ Tries

Tries are trees organised using parts of keys (rather than whole keys)

[Diagram:Pic/trie-example.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [41/66]

<< ∧ >>

❖ Tries (cont)

Each node in a trie …

contains one part of a key (typically one character)

may have up to 26 children

may be tagged as a "finishing" node

but even "finishing" nodes may have children

Depth d of trie = length of longest key value

Cost of searching O(d) (independent of n)

COMP9024 22T2 ♢Dynamic Data Structures♢ [42/66]

<< ∧ >>

❖ Tries (cont)

Possible trie representation:

#define ALPHABET_SIZE 26

typedef struct Node *Trie;

typedef struct Node {

bool finish; // last char in key?

Item data; // no Item if !finish

Trie child[ALPHABET_SIZE];

} Node;

typedef char *Key;

COMP9024 22T2 ♢Dynamic Data Structures♢ [43/66]

<< ∧ >>

❖ Tries (cont)

Note: Can also use BST-like nodes for more space-efficient implementation of tries

[Diagram:Pic/trie-as-bst.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [44/66]

<< ∧ >>

❖ Trie Operations

Basic operations on tries:

search for a key

insert a key

COMP9024 22T2 ♢Dynamic Data Structures♢ [45/66]

<< ∧ >>

❖ Trie Operations (cont)

[Diagram:Pic/trie-search.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [46/66]

<< ∧ >>

❖ Trie Operations (cont)

Traversing a path, using char-by-char from Key:

find(trie,key):

| Input trie, key

| Output pointer to element in trie if key found

| NULL otherwise

| node=trie

| for each char in key do

| | if node.child[char] exists then

| | node=node.child[char] // move down one level

| | else

| | return NULL

| | end if

| end for

| if node.finish then // "finishing" node reached?

| return node

| else

| return NULL

| end if

COMP9024 22T2 ♢Dynamic Data Structures♢ [47/66]

<< ∧ >>

❖ Trie Operations (cont)

Insertion into Trie:

insert(trie,item,key):

| Input trie, item with key of length m

| Output trie with item inserted

| if trie is empty then

| t=new trie node

| end if

| if m=0 then

| t.finish=true, t.data=item

| else

| t.child[key[0]]=insert(t.child[key[0]],item,key[1..m-1])

| end if

| return t

COMP9024 22T2 ♢Dynamic Data Structures♢ [48/66]

<< ∧ >>

❖ Exercise: Trie Insertion

Insert cat, cats and carer into this trie:

[Diagram:Pic/trie1.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [49/66]

<< ∧ >>

[Diagram:Pic/trie2.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [50/66]

<< ∧ >>

❖ Exercise: Trie Insertion (cont)

Analysis of standard tries:

O(n) space

insertion and search in time O(m)

n … total size of text (e.g. sum of lengths of all strings in a given dictionary)

m … size of the string parameter of the operation (the "key")

COMP9024 22T2 ♢Dynamic Data Structures♢ [51/66]

<< ∧ >>

❖ Word Matching With Tries

COMP9024 22T2 ♢Dynamic Data Structures♢ [52/66]

<< ∧ >>

❖ Word Matching with Tries

Preprocessing the text:

Insert all searchable words of a text into a trie

Each finishing node stores the occurrence(s) of the associated word in the text

COMP9024 22T2 ♢Dynamic Data Structures♢ [53/66]

<< ∧ >>

❖ Word Matching with Tries (cont)

Example text and corresponding trie of searchable words:

[Diagram:Pic/text-trie-example.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [54/66]

<< ∧ >>

❖ Text Compression

COMP9024 22T2 ♢Dynamic Data Structures♢ [55/66]

<< ∧ >>

❖ Text Compression

Problem: Efficiently encode a given string X by a smaller string Y

Applications:

Save memory and/or bandwidth

Huffman's algorithm

computes frequency f(c) for each character c

encodes high-frequency characters with short code

no code word is a prefix of another code word

uses optimal encoding tree to determine the code words

COMP9024 22T2 ♢Dynamic Data Structures♢ [56/66]

<< ∧ >>

❖ Text Compression (cont)

Code … mapping of each character to a binary code word

Prefix code … binary code such that no code word is prefix of another code word

prefix code: Each code can uniquely identify a character without the need for a marker/delimiter

e.g., set {A, B, C, D}. A prefix code for this set might be:

A = 00

B = 01

C = 10

D = 111

Then: 0111101101100

Encoding tree …

represents a prefix code

each leaf stores a character

code word given by the path from the root to the leaf (0 for left child, 1 for right child)

COMP9024 22T2 ♢Dynamic Data Structures♢ [57/66]

<< ∧ >>

❖ Text Compression (cont)

Example:

[Diagram:Pic/encoding-tree.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [58/66]

<< ∧ >>

❖ Text Compression (cont)

Text compression problem

Given a text T, find a prefix code that yields the shortest encoding of T

short codewords for frequent characters

long code words for rare characters

COMP9024 22T2 ♢Dynamic Data Structures♢ [59/66]

<< ∧ >>

❖ Text Compression (cont)

Example: T = abracadabra

[Diagram:Pic/encoding-trees.png]

T1 requires 29 bits to encode text T,

T2 requires 24 bits

01011011010000101001011011010 vs 001011000100001100101100

COMP9024 22T2 ♢Dynamic Data Structures♢ [60/66]

<< ∧ >>

❖ Text Compression (cont)

Huffman's algorithm

computes frequency f(c) for each character

successively combines pairs of lowest-frequency characters to build encoding tree "bottom-up"

Example: abracadabra

[Diagram:Pic/huffman-tree.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [61/66]

<< ∧ >>

❖ Huffman Code

Huffman's algorithm using priority queue:

HuffmanCode(T):

| Input string T of size n

| Output optimal encoding tree for T

| compute frequency array

| Q=new priority queue

| for all characters c do

| | T=new single-node tree storing c

| | join(Q,T) with frequency(c) as key

| end for

| while |Q|≥2 do

| | f1=Q.minKey(), T1=leave(Q)

| | f2=Q.minKey(), T2=leave(Q)

| | T=new tree node with subtrees T1 and T2

| | join(Q,T) with f1+f2 as key

| end while

| return leave(Q)

COMP9024 22T2 ♢Dynamic Data Structures♢ [62/66]

<< ∧ >>

❖ Exercise: Huffman Code

Construct a Huffman tree for: a fast runner need never be afraid of the dark

[Diagram:Pic/larger-huffman-tree-array.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [63/66]

<< ∧ >>

[Diagram:Pic/larger-huffman-tree.png]

COMP9024 22T2 ♢Dynamic Data Structures♢ [64/66]

<< ∧ >>

❖ Exercise: Huffman Code (cont)

Analysis of Huffman's algorithm:

O(n+d·log d) time

n … length of the input text T

d … number of distinct characters in T

COMP9024 22T2 ♢Dynamic Data Structures♢ [65/66]

<< ∧

❖ Summary

Alphabets and words

Pattern matching

Boyer-Moore, Knuth-Morris-Pratt

Tries

Text compression

Huffman code