Understanding what a subsequence is in a given text or string.
Consider the following example
Given sequence : [ A, G, O, R, T ]
We could have subsequences : ( [ A ], [ A, G ], [ A, G, R, T ], [ G, R, T ], …). In all these subsequences, the relative order of the letters remain similar to that of the original sequence [ A, G, O, R, T ].
If there are ‘n’ elements in a sequence, then the number of subsequences that can be obtained are 2 n. n C 0 + n C 1 + n C 2 + .. n C n = 2 n
Finding the longest common subsequence from two given sequences.
Consider the sequences of letters : Seq A : [ A, G, O, R, T ], Seq B : [ B, G, P, O, A, T ]
Sequence A | Sequence B | Common subsequence | Length |
---|---|---|---|
[ A, G, O, R, T ] | [ B, G, P, O, A, T ] | [ A ] | 1 |
[ A, G, O, R, T ] | [ B, G, P, O, A, T ] | [ G ] | 1 |
[ A, G, O, R, T ] | [ B, G, P, O, A, T ] | [ O ] | 1 |
[ A, G, O, R, T ] | [ B, G, P, O, A, T ] | [ A, T ] | 2 |
[ A, G, O, R, T ] | [ B, G, P, O, A, T ] | [ G, O ] | 2 |
[ A, G, O, R, T ] | [ B, G, P, O, A, T ] | [ G, T ] | 2 |
[ A, G, O, R, T ] | [ B, G, P, O, A, T ] | [ G, O, T ] | 3 |
Since the length of common subsequence [ G, O, T ] is 3, it is the longest common subsequence.
Recursive Approach
LCS ( seq_A , seq_B )
If the length of either sequence is 0, then
The length of LCS would be 0
If the last characters of the sequences seq_A & seq_B match, then
Example : [ A, G, O, R, T ] and [ B, G, P, O, A, T ] ( for every matching letter the length of the common subsequence increases by 1 )
The length of LCS would be 1 + LCS ( [ A, G, O, R ], [ B, G, P, O, A ] )
Length = 1 + LCS ( seq_A - 1, seq_B - 1 )
If the last characters of the sequences do not match, then
Example : [ A, G, O, R ] and [ B, G, P, O, A ]
The length of LCS would be max [ LCS ( [ A, G, O, R ], [ B, G, P, O ] ), LCS ( [ A, G, O ], [ B, G, P, O, A ] ) ]
Length = max [ LCS ( seq_A, seq_B - 1 ), LCS ( seq_A - 1, seq_B ) ]
Example, consider sequences [ A, G, O, R, T ] and [ B, G, P, O, A, T ]
The recursive algorithm runs in exponential time as identical subproblems are solved as we go down the recursion tree. The complexity of recursive algorithm reaches 2n in worst case when none of the characters in the two subsequences match.
Dynamic Programming Approach
Consider two subsequences A [ 0 .. a - 1 ] of length a and B [ 0 .. b - 1 ] of length b.
Note : The index of the first character of the two subsequences is 0.
We try to construct a table with rows = [ length (Sequence B) + 1 ] and columns = [ length (Sequence A) + 1 ].
Example : Seq A = [ A, G, O, R, T ], Seq B = [ B, G, P, O, A, T ].
We fill the table in bottom-up manner, i.e we try to
fill in the table for smaller sequences in Seq A and Seq B.
Base Case: Empty sequence All elements of R0 equal 0 indicating empty sequence A. All elements of C0 equal 0 indication empty sequence B.
Based on the recurrence relation described above
Case: Characters in the sequences match.
If the character in sequence A at position a-1 matches the character in sequence B at position b-1
then increase the length of LCS
i.e LCS length at location [ a ] [ b ] = 1 + LCS length at location[ a - 1 ] [ b - 1 ]
We see from the table that ( [ a - 1 ] [ b - 1 ] is the diagonal element )
Example : Consider subsequence [ A G ] in Seq A and subsequence [ B G ] in Seq B.
As the last character G in [ B G ] & [ A G ] match, we try to fill in table [ 2 ] [ 2 ] i.e we are finding the length of LCS in
subsequence [ B G ] ( of length 2 ) and subsequence [ A G ] ( of length 2 ).
Thus length of LCS in ( [ B G ] , [ A G ] ) = 1 + length of LCS in ( [ B ] ( of length 1 ), [ A ] ( of length 1 ).
i.e when the last character match we have LCS [ 2 ] [ 2 ] = 1 + LCS [ 1 ] [ 1 ]
Case: Characters in the sequences do not match.
If the character in sequence A at position a - 1 does not match the character in sequence B at position b - 1
then the length of LCS remains the same (i.e the earlier found length of LCS).
i.e LCS at location [ a ] [ b ] = max ( LCS at location [ a - 1 ] [ b ], LCS at location [ a ] [ b - 1 ] )
location [ a - 1 ] [ b ] is the element above and location [ a ] [ b - 1 ] is the element on the left
Example : Consider subsequence [ A G O ] in Seq A and subsequence [ B G ] in Seq B.
As the last character G in [ B G ] does not match with the last character O in [ A G O ] , we try to fill in table [ 2 ] [ 3 ] i.e we are finding the length of LCS in
subsequence [ B G ] ( of length 2 ) and subsequence [ A G O ] ( of length 3 ).
Thus length of LCS in ( [ B G ] , [ A G O ] ) = max length of LCS in ( [ B G ] ( length 2 ), [ A G ] ( length 2 ) or [ B ] ( length 1 ), [ A G O ] (of length 3) )
i.e when the last character does not match we have LCS [ 2 ] [ 3 ] = max ( LCS [ 2 ] [ 2 ], LCS [ 1 ] [ 3 ] ) = max ( 1, 0 ) = 1
Time complexity : O ( N * M ), where N and M are the lengths of the two sequences.
Longest common subsequence implementation
from typing import List # For annotations
# A slower recursive function to get the length of the longest common subsequence
def LCSLength (seq1 : str, seq2 : str) -> int :
sz1 = len(seq1)
sz2 = len(seq2)
if (sz1 == 0 or sz2 == 0) :
return 0
elif (seq1[sz1 - 1] == seq2[sz2 - 1]) : # Last characters in the two sequences match
return 1 + LCSLength ( seq1[:-1], seq2[:-1] )
else : # Last characters in the two sequences do not match
return max ( LCSLength (seq1, seq2[:-1]), LCSLength (seq1[:-1], seq2) )
# A faster recursive function to get the length of the longest common subsequence
def LCSLength_Faster (seq1 : str, seq2 : str, table : List[List[int]]) -> int :
sz1 = len(seq1)
sz2 = len(seq2)
if (sz1 == 0 or sz2 == 0) :
return 0
# A subproblem with sz1 and sz2 has been solved already. Return the result
if (table[sz1][sz2] != -1) :
return table[sz1][sz2]
result = 0
if (seq1[sz1 - 1] == seq2[sz2 - 1]) : # Last characters in the two sequences match
result = 1 + LCSLength_Faster (seq1[:-1], seq2[:-1], table)
else : # Last characters in the two sequences do not match
result = max ( LCSLength_Faster (seq1, seq2[:-1], table),
LCSLength_Faster (seq1[:-1], seq2, table) )
table[sz1][sz2] = result
return result
# The below function uses dynamic programming to fetch the length of the LCS
def LCSLength_DP (seq1 : str, seq2 : str) :
sz1 = len(seq1)
sz2 = len(seq2)
length_lcs = 0
# Create a DP table of size [sz1 + 1][sz2 + 1]
LCSTable = [0] * ( sz1 + 1 )
for i in range ( sz1 + 1 ) :
LCSTable[i] = [0] * ( sz2 + 1 )
# Finding the length of LCS
for a in range (sz1 + 1) :
for b in range (sz2 + 1) :
if (a == 0 or b == 0) :
LCSTable[a][b] = 0
elif ( seq1[a-1] == seq2[b-1] ) :
LCSTable[a][b] = 1 + LCSTable[a-1][b-1]
else :
LCSTable[a][b] = max ( LCSTable[a-1][b], LCSTable[a][b-1] )
length_lcs = LCSTable[sz1][sz2]
# Constructing the LCS
i = sz1
j = sz2
lcs = ""
while ( i > 0 and j > 0 ) :
if ( seq1[i-1] == seq2[j-1] ) :
lcs += seq1[i-1] # Move diagonally towards top left
i -= 1
j -= 1
elif ( LCSTable[i-1][j] > LCSTable[i][j-1] ) :
i -= 1 # Move upwards
else :
j -= 1 # Move towards left
# Reverse lcs to find the actual sequence
lcs = lcs[::-1]
return length_lcs, lcs
list_seq_a = [ "ATPLBCCXWKQ", "AGORT" ]
list_seq_b = [ "FTCMXACWZYKQ", "BGPOAT" ]
for i in range(len(list_seq_a)) :
# Initialize result table with '-1'
rows = len(list_seq_a[i])
cols = len(list_seq_b[i])
table = [-1] * ( rows + 1 )
for r in range ( rows + 1 ) :
table[r] = [-1] * ( cols + 1 )
print ("\nSequence 1 : " + list_seq_a[i] + " Sequence 2 : " + list_seq_b[i])
print ("[Slower Recursion] LCS Length : " + str(LCSLength (list_seq_a[i], list_seq_b[i])))
print ("[Faster Recursion] LCS Length : " + str(LCSLength_Faster (list_seq_a[i], list_seq_b[i], table)))
length, lcs = LCSLength_DP (list_seq_a[i], list_seq_b[i])
print("[DP] LCS Length : " + str(length) + " Sequence : " + lcs)
Output
Sequence 1 : ATPLBCCXWKQ Sequence 2 : FTCMXACWZYKQ
[Slower Recursion] LCS Length : 6
[Faster Recursion] LCS Length : 6
[DP] LCS Length : 6 Sequence : TCXWKQ
Sequence 1 : AGORT Sequence 2 : BGPOAT
[Slower Recursion] LCS Length : 3
[Faster Recursion] LCS Length : 3
[DP] LCS Length : 3 Sequence : GOT
#include<iostream>
#include<vector>
#include<string>
#include<algorithm>
using namespace std;
/* A slower recursive function to get the length of the longest common subsequence */
int LCSLength (string seq1, string seq2) {
int sz1 = seq1.size();
int sz2 = seq2.size();
if (sz1 == 0 or sz2 == 0) {
return 0;
} else if (seq1.at(sz1 - 1) == seq2.at(sz2 - 1)) { // Last characters in the two sequences match
return 1 + LCSLength ( seq1.substr(0, sz1-1), seq2.substr(0, sz2-1) );
} else { // Last characters in the two sequences do not match
return max ( LCSLength (seq1.substr(0, sz1), seq2.substr(0, sz2-1)),
LCSLength (seq1.substr(0, sz1-1), seq2.substr(0, sz2)));
}
}
/* A faster recursive function to get the length of the longest common subsequence */
int LCSLength_Faster (string seq1, string seq2, vector<vector<int>> & table) {
int sz1 = seq1.size();
int sz2 = seq2.size();
if (sz1 == 0 or sz2 == 0) {
return 0;
}
// A subproblem with sz1 and sz2 has been solved already. Return the result
if (table[sz1][sz2] != -1) {
return table[sz1][sz2];
}
int result;
if (seq1.at(sz1 - 1) == seq2.at(sz2 - 1)) { // Last characters in the two sequences match
result = 1 + LCSLength_Faster (seq1.substr(0, sz1-1), seq2.substr(0, sz2-1), table);
} else { // Last characters in the two sequences do not match
result = max ( LCSLength_Faster (seq1.substr(0, sz1), seq2.substr(0, sz2-1), table),
LCSLength_Faster (seq1.substr(0, sz1-1), seq2.substr(0, sz2), table));
}
table[sz1][sz2] = result;
return result;
}
/* The below function uses dynamic programming to fetch the length of the LCS */
int LCSLength_DP (string seq1, string seq2, string& lcs) {
int sz1 = seq1.size();
int sz2 = seq2.size();
int length_lcs = 0;
int LCSTable[sz1+1][sz2+1];
// Finding the length of LCS
for (int a = 0; a <= sz1; a++) {
for (int b = 0; b <= sz2; b++) {
if (a == 0 or b == 0) {
LCSTable[a][b] = 0;
} else if (seq1.at(a-1) == seq2.at(b-1)) {
LCSTable[a][b] = 1 + LCSTable[a-1][b-1];
} else {
LCSTable[a][b] = max (LCSTable[a-1][b], LCSTable[a][b-1]);
}
}
}
length_lcs = LCSTable[sz1][sz2];
// Finding the LCS
int i = sz1, j = sz2;
while (i > 0 and j > 0) {
if (seq1[i-1] == seq2[j-1]) {
lcs += seq1[i-1]; // Move diagonally towards top left
i--, j--;
} else if (LCSTable[i-1][j] > LCSTable[i][j-1]) {
i--; // Move upwards
} else {
j--; // Move towards left
}
}
reverse (lcs.begin(), lcs.end());
return length_lcs;
}
int main() {
vector<string> vector_a = { "ATPLBCCXWKQ", "AGORT" };
vector<string> vector_b = { "FTCMXACWZYKQ", "BGPOAT" };
for (int i = 0; i < 2; i++) {
string lcs = "";
// Initialize result table with '-1'
vector<vector<int>> table (vector_a[i].size()+1, vector<int>(vector_b[i].size()+1, -1));
cout << "Sequence 1 : " << vector_a[i] << " Sequence 2 : " << vector_b[i] << endl;
cout << "[Slower Recursion] LCS Length : " << LCSLength (vector_a[i], vector_b[i]) << endl;
cout << "[Faster Recursion] LCS Length : " << LCSLength_Faster (vector_a[i], vector_b[i], table) << endl;
cout << "[DP] LCS Length : " << LCSLength_DP (vector_a[i], vector_b[i], lcs) << endl;
cout << "Longest Common Subsequence (LCS) : " << lcs << endl << endl;
}
return 0;
}
Output
Sequence 1 : ATPLBCCXWKQ Sequence 2 : FTCMXACWZYKQ
[Slower Recursion] LCS Length : 6
[Faster Recursion] LCS Length : 6
[DP] LCS Length : 6
Longest Common Subsequence (LCS) : TCXWKQ
Sequence 1 : AGORT Sequence 2 : BGPOAT
[Slower Recursion] LCS Length : 3
[Faster Recursion] LCS Length : 3
[DP] LCS Length : 3
Longest Common Subsequence (LCS) : GOT
import java.util.Arrays;
import java.util.List;
import java.lang.StringBuilder;
class LCS {
/* A slower recursive implementation to fetch the length of longest common subsequence */
int GetLength_LCS (String seq1, String seq2) {
int sz1 = seq1.length();
int sz2 = seq2.length();
if (sz1 == 0 || sz2 == 0) {
return 0;
} else if (seq1.charAt(sz1 - 1) == seq2.charAt(sz2 - 1)) { // Last characters in the two sequences match
return 1 + GetLength_LCS ( seq1.substring(0, sz1-1), seq2.substring(0, sz2-1) );
} else { // Last characters in the two sequences do not match
return Math.max (GetLength_LCS (seq1.substring(0, sz1), seq2.substring(0, sz2-1)),
GetLength_LCS (seq1.substring(0, sz1-1), seq2.substring(0, sz2)));
}
}
/* A faster recursive implementation to fetch the length of longest common subsequence */
int GetLength_FasterLCS (String seq1, String seq2, int table[][]) {
int sz1 = seq1.length();
int sz2 = seq2.length();
if (sz1 == 0 || sz2 == 0) {
return 0;
}
// A sub-problem with sz1 and sz2 has been solved already. Return the result
if (table[sz1][sz2] != -1) {
return table[sz1][sz2];
}
int result;
if (seq1.charAt(sz1 - 1) == seq2.charAt(sz2 - 1)) { // Last characters in the two sequences match
result = 1 + GetLength_FasterLCS (seq1.substring(0, sz1-1), seq2.substring(0, sz2-1), table);
} else { // Last characters in the two sequences do not match
result = Math.max(GetLength_FasterLCS (seq1.substring(0, sz1), seq2.substring(0, sz2-1), table),
GetLength_FasterLCS (seq1.substring(0, sz1-1), seq2.substring(0, sz2), table));
}
table[sz1][sz2] = result;
return result;
}
/* A dynamic programming implementation to fetch the length of longest common subsequence (LCS) */
int GetLength_LCSDP (String seq1, String seq2, StringBuilder lcs) {
int sz1 = seq1.length();
int sz2 = seq2.length();
int length_lcs = 0;
int LCSTable[][] = new int[sz1+1][sz2+1];
// Finding the length of LCS
for (int a = 0; a <= sz1; a++) {
for (int b = 0; b <= sz2; b++) {
if (a == 0 || b == 0) {
LCSTable[a][b] = 0;
} else if (seq1.charAt(a-1) == seq2.charAt(b-1)) {
LCSTable[a][b] = 1 + LCSTable[a-1][b-1];
} else {
LCSTable[a][b] = Math.max(LCSTable[a-1][b], LCSTable[a][b-1]);
}
}
}
length_lcs = LCSTable[sz1][sz2];
// Finding the LCS
int i = sz1, j = sz2;
while (i > 0 && j > 0) {
if (seq1.charAt(i-1) == seq2.charAt(j-1)) {
lcs.append(seq1.charAt(i-1)); // Move diagonally towards top left
i--;
j--;
} else if (LCSTable[i-1][j] > LCSTable[i][j-1]) {
i--; // Move upwards
} else {
j--; // Move towards left
}
}
lcs.reverse();
return length_lcs;
}
public static void main ( String[] args ) {
List<String> list_a = List.of("ATPLBCCXWKQR", "AGORTRE");
List<String> list_b = List.of("FTCMXACWZYKQR", "BGPOATRT");
LCS obj = new LCS();
for (int i = 0; i < 2; i++) {
StringBuilder lcs = new StringBuilder();
int table[][] = new int[list_a.get(i).length()+1][list_b.get(i).length()+1];
// Initialize 2 dimensional array with '-1'
for (int[] arr : table) {
Arrays.fill(arr, -1);
}
System.out.println("Sequence 1 : " + list_a.get(i) + ", Sequence 2 : " + list_b.get(i));
System.out.println("[Slower Recursion] LCS length : " + obj.GetLength_LCS (list_a.get(i), list_b.get(i)));
System.out.println("[Faster Recursion] LCS length : " + obj.GetLength_FasterLCS (list_a.get(i), list_b.get(i), table));
System.out.println("[DP] LCS Length : " + obj.GetLength_LCSDP (list_a.get(i), list_b.get(i), lcs));
System.out.println("Longest Common Subsequence (LCS) : " + lcs + "\n");
}
}
}
Output
Sequence 1 : ATPLBCCXWKQR, Sequence 2 : FTCMXACWZYKQR
[Slower Recursion] LCS length : 7
[Faster Recursion] LCS length : 7
[DP] LCS Length : 7
Longest Common Subsequence (LCS) : TCXWKQR
Sequence 1 : AGORTRE, Sequence 2 : BGPOATRT
[Slower Recursion] LCS length : 4
[Faster Recursion] LCS length : 4
[DP] LCS Length : 4
Longest Common Subsequence (LCS) : GOTR