Longest Common Subsequence

Understanding what a subsequence is in a given text or string.

  • A subsequence is a sequence of characters or letters obtained from a given sequence by excluding a number of elements.
  • The relative order of elements in a subsequence remains the same as that of the original sequence.

Consider the following example
Given sequence : [ A, G, O, R, T ]
We could have subsequences : ( [ A ], [ A, G ], [ A, G, R, T ], [ G, R, T ], …). In all these subsequences, the relative order of the letters remain similar to that of the original sequence [ A, G, O, R, T ].

If there are ‘n’ elements in a sequence, then the number of subsequences that can be obtained are 2 n.
n C 0 + n C 1 + n C 2 + .. n C n = 2 n

Finding the longest common subsequence from two given sequences.
Consider the sequences of letters : Seq A : [ A, G, O, R, T ], Seq B : [ B, G, P, O, A, T ]

Sequence A Sequence B Common subsequence Length
[ A, G, O, R, T ] [ B, G, P, O, A, T ] [ A ] 1
[ A, G, O, R, T ] [ B, G, P, O, A, T ] [ G ] 1
[ A, G, O, R, T ] [ B, G, P, O, A, T ] [ O ] 1
[ A, G, O, R, T ] [ B, G, P, O, A, T ] [ A, T ] 2
[ A, G, O, R, T ] [ B, G, P, O, A, T ] [ G, O ] 2
[ A, G, O, R, T ] [ B, G, P, O, A, T ] [ G, T ] 2
[ A, G, O, R, T ] [ B, G, P, O, A, T ] [ G, O, T ] 3

Since the length of common subsequence [ G, O, T ] is 3, it is the longest common subsequence.



Recursive Approach

LCS ( seq_A , seq_B )
If the length of either sequence is 0, then
      The length of LCS would be 0
If the last characters of the sequences seq_A & seq_B match, then
      Example : [ A, G, O, R, T ] and [ B, G, P, O, A, T ] ( for every matching letter the length of the common subsequence increases by 1 )
      The length of LCS would be 1 + LCS ( [ A, G, O, R ], [ B, G, P, O, A ] )
      Length = 1 + LCS ( seq_A - 1, seq_B - 1 )
If the last characters of the sequences do not match, then
      Example : [ A, G, O, R ] and [ B, G, P, O, A ]
      The length of LCS would be max [ LCS ( [ A, G, O, R ], [ B, G, P, O ] ), LCS ( [ A, G, O ], [ B, G, P, O, A ] ) ]
      Length = max [ LCS ( seq_A, seq_B - 1 ), LCS ( seq_A - 1, seq_B ) ]

Example, consider sequences [ A, G, O, R, T ] and [ B, G, P, O, A, T ]
LCS Recursive

The recursive algorithm runs in exponential time as identical subproblems are solved as we go down the recursion tree. The complexity of recursive algorithm reaches 2n in worst case when none of the characters in the two subsequences match.



Dynamic Programming Approach

Consider two subsequences A [ 0 .. a - 1 ] of length a and B [ 0 .. b - 1 ] of length b.
Note : The index of the first character of the two subsequences is 0.

We try to construct a table with rows = [ length (Sequence B) + 1 ] and columns = [ length (Sequence A) + 1 ].
Example : Seq A = [ A, G, O, R, T ], Seq B = [ B, G, P, O, A, T ].
We fill the table in bottom-up manner, i.e we try to fill in the table for smaller sequences in Seq A and Seq B.

Base Case: Empty sequence
      All elements of R0 equal 0 indicating empty sequence A.
      All elements of C0 equal 0 indication empty sequence B.

Based on the recurrence relation described above

Case: Characters in the sequences match.
      If the character in sequence A at position a-1 matches the character in sequence B at position b-1
      then increase the length of LCS
      i.e LCS length at location [ a ] [ b ] = 1 + LCS length at location[ a - 1 ] [ b - 1 ]
      We see from the table that ( [ a - 1 ] [ b - 1 ] is the diagonal element )

      Example : Consider subsequence [ A G ] in Seq A and subsequence [ B G ] in Seq B.
      As the last character G in [ B G ] & [ A G ] match, we try to fill in table [ 2 ] [ 2 ] i.e we are finding the length of LCS in
      subsequence [ B G ] ( of length 2 ) and subsequence [ A G ] ( of length 2 ).
      Thus length of LCS in ( [ B G ] , [ A G ] ) = 1 + length of LCS in ( [ B ] ( of length 1 ), [ A ] ( of length 1 ).
      i.e when the last character match we have LCS [ 2 ] [ 2 ] = 1 + LCS [ 1 ] [ 1 ]

Case: Characters in the sequences do not match.
      If the character in sequence A at position a - 1 does not match the character in sequence B at position b - 1
      then the length of LCS remains the same (i.e the earlier found length of LCS).
      i.e LCS at location [ a ] [ b ] = max ( LCS at location [ a - 1 ] [ b ], LCS at location [ a ] [ b - 1 ] )
      location [ a - 1 ] [ b ] is the element above and location [ a ] [ b - 1 ] is the element on the left

      Example : Consider subsequence [ A G O ] in Seq A and subsequence [ B G ] in Seq B.
      As the last character G in [ B G ] does not match with the last character O in [ A G O ] , we try to fill in table [ 2 ] [ 3 ] i.e we are finding the length of LCS in
      subsequence [ B G ] ( of length 2 ) and subsequence [ A G O ] ( of length 3 ).
      Thus length of LCS in ( [ B G ] , [ A G O ] ) = max length of LCS in ( [ B G ] ( length 2 ), [ A G ] ( length 2 ) or [ B ] ( length 1 ), [ A G O ] (of length 3) )
      i.e when the last character does not match we have LCS [ 2 ] [ 3 ] = max ( LCS [ 2 ] [ 2 ], LCS [ 1 ] [ 3 ] ) = max ( 1, 0 ) = 1

Longest_Common_Subsequence_DP

Time complexity : O ( N * M ), where N and M are the lengths of the two sequences.



Longest common subsequence implementation

from typing import List # For annotations

# A slower recursive function to get the length of the longest common subsequence
def LCSLength (seq1 : str, seq2 : str) -> int :

    sz1 = len(seq1)
    sz2 = len(seq2)

    if (sz1 == 0 or sz2 == 0) :
       return 0
    elif (seq1[sz1 - 1] == seq2[sz2 - 1]) : # Last characters in the two sequences match
       return 1 + LCSLength ( seq1[:-1], seq2[:-1] )
    else : # Last characters in the two sequences do not match
       return max ( LCSLength (seq1, seq2[:-1]), LCSLength (seq1[:-1], seq2) )

# A faster recursive function to get the length of the longest common subsequence
def LCSLength_Faster (seq1 : str, seq2 : str, table : List[List[int]]) -> int :

    sz1 = len(seq1)
    sz2 = len(seq2)

    if (sz1 == 0 or sz2 == 0) :
       return 0

    # A subproblem with sz1 and sz2 has been solved already. Return the result
    if (table[sz1][sz2] != -1) :
       return table[sz1][sz2]

    result = 0

    if (seq1[sz1 - 1] == seq2[sz2 - 1]) : # Last characters in the two sequences match
       result = 1 + LCSLength_Faster (seq1[:-1], seq2[:-1], table)
    else : # Last characters in the two sequences do not match
       result = max ( LCSLength_Faster (seq1, seq2[:-1], table),
                      LCSLength_Faster (seq1[:-1], seq2, table) )

    table[sz1][sz2] = result
    return result

# The below function uses dynamic programming to fetch the length of the LCS
def LCSLength_DP (seq1 : str, seq2 : str) :

    sz1 = len(seq1)
    sz2 = len(seq2)
    length_lcs = 0

    # Create a DP table of size [sz1 + 1][sz2 + 1]
    LCSTable = [0] * ( sz1 + 1 )
    for i in range ( sz1 + 1 ) :
        LCSTable[i] = [0] * ( sz2 + 1 )

    # Finding the length of LCS
    for a in range (sz1 + 1) :
        for b in range (sz2 + 1) :
            if (a == 0 or b == 0) :
                LCSTable[a][b] = 0
            elif ( seq1[a-1] == seq2[b-1] ) :
                LCSTable[a][b] = 1 + LCSTable[a-1][b-1]
            else :
                LCSTable[a][b] = max ( LCSTable[a-1][b], LCSTable[a][b-1] )

    length_lcs = LCSTable[sz1][sz2]

    # Constructing the LCS
    i = sz1
    j = sz2
    lcs = ""
    while ( i > 0 and j > 0 ) :
       if ( seq1[i-1] == seq2[j-1] ) :
          lcs += seq1[i-1]  # Move diagonally towards top left 
          i -= 1
          j -= 1
       elif ( LCSTable[i-1][j] > LCSTable[i][j-1] ) :
          i -= 1 # Move upwards
       else :
          j -= 1 # Move towards left

    # Reverse lcs to find the actual sequence  
    lcs = lcs[::-1]

    return length_lcs, lcs


list_seq_a = [ "ATPLBCCXWKQ", "AGORT" ]
list_seq_b = [ "FTCMXACWZYKQ", "BGPOAT" ]

for i in range(len(list_seq_a)) :

    # Initialize result table with '-1'
    rows = len(list_seq_a[i])
    cols = len(list_seq_b[i])

    table = [-1] * ( rows + 1 )
    for r in range ( rows + 1 ) :
        table[r] = [-1] * ( cols + 1 )

    print ("\nSequence 1 : " + list_seq_a[i] + " Sequence 2 : " + list_seq_b[i])
    print ("[Slower Recursion] LCS Length : " + str(LCSLength (list_seq_a[i], list_seq_b[i])))
    print ("[Faster Recursion] LCS Length : " + str(LCSLength_Faster (list_seq_a[i], list_seq_b[i], table)))
    length, lcs = LCSLength_DP (list_seq_a[i], list_seq_b[i])
    print("[DP] LCS Length : " + str(length) + " Sequence : " + lcs)

Output

Sequence 1 : ATPLBCCXWKQ Sequence 2 : FTCMXACWZYKQ
[Slower Recursion] LCS Length : 6
[Faster Recursion] LCS Length : 6
[DP] LCS Length : 6 Sequence : TCXWKQ

Sequence 1 : AGORT Sequence 2 : BGPOAT
[Slower Recursion] LCS Length : 3
[Faster Recursion] LCS Length : 3
[DP] LCS Length : 3 Sequence : GOT
#include<iostream>
#include<vector>
#include<string>
#include<algorithm>

using namespace std;

/* A slower recursive function to get the length of the longest common subsequence */
int LCSLength (string seq1, string seq2) {

    int sz1 = seq1.size();
    int sz2 = seq2.size();

    if (sz1 == 0 or sz2 == 0) {
       return 0;
    } else if (seq1.at(sz1 - 1) == seq2.at(sz2 - 1)) { // Last characters in the two sequences match
       return 1 + LCSLength ( seq1.substr(0, sz1-1), seq2.substr(0, sz2-1) );
    } else { // Last characters in the two sequences do not match
       return max ( LCSLength (seq1.substr(0, sz1), seq2.substr(0, sz2-1)),
                    LCSLength (seq1.substr(0, sz1-1), seq2.substr(0, sz2)));
    }
}

/* A faster recursive function to get the length of the longest common subsequence */
int LCSLength_Faster (string seq1, string seq2, vector<vector<int>> & table) {

    int sz1 = seq1.size();
    int sz2 = seq2.size();

    if (sz1 == 0 or sz2 == 0) {
       return 0;
    }

    // A subproblem with sz1 and sz2 has been solved already. Return the result
    if (table[sz1][sz2] != -1) {
       return table[sz1][sz2];
    }

    int result;

    if (seq1.at(sz1 - 1) == seq2.at(sz2 - 1)) { // Last characters in the two sequences match
       result = 1 + LCSLength_Faster (seq1.substr(0, sz1-1), seq2.substr(0, sz2-1), table);
    } else { // Last characters in the two sequences do not match
       result = max ( LCSLength_Faster (seq1.substr(0, sz1), seq2.substr(0, sz2-1), table),
                      LCSLength_Faster (seq1.substr(0, sz1-1), seq2.substr(0, sz2), table));
    }

    table[sz1][sz2] = result;
    return result;
}

/* The below function uses dynamic programming to fetch the length of the LCS */
int LCSLength_DP (string seq1, string seq2, string& lcs) {

    int sz1 = seq1.size();
    int sz2 = seq2.size();
    int length_lcs = 0;

    int LCSTable[sz1+1][sz2+1];

    // Finding the length of LCS
    for (int a = 0; a <= sz1; a++) {
        for (int b = 0; b <= sz2; b++) {
            if (a == 0 or b == 0) {
                LCSTable[a][b] = 0;
            } else if (seq1.at(a-1) == seq2.at(b-1)) {
                LCSTable[a][b] = 1 + LCSTable[a-1][b-1];
            } else {
                LCSTable[a][b] = max (LCSTable[a-1][b], LCSTable[a][b-1]);
            }
        }
    }

    length_lcs = LCSTable[sz1][sz2];

    // Finding the LCS
    int i = sz1, j = sz2;
    while (i > 0 and j > 0) {
       if (seq1[i-1] == seq2[j-1]) {
          lcs += seq1[i-1];  // Move diagonally towards top left 
          i--, j--;
       } else if (LCSTable[i-1][j] > LCSTable[i][j-1]) {
          i--; // Move upwards
       } else {
          j--; // Move towards left
       }
    }

    reverse (lcs.begin(), lcs.end());
    return length_lcs;
}

int main() {

    vector<string> vector_a = { "ATPLBCCXWKQ", "AGORT" };
    vector<string> vector_b = { "FTCMXACWZYKQ", "BGPOAT" };

    for (int i = 0; i < 2; i++) {

        string lcs   = "";
        // Initialize result table with '-1'
        vector<vector<int>> table (vector_a[i].size()+1, vector<int>(vector_b[i].size()+1, -1));

        cout << "Sequence 1 : " << vector_a[i] << " Sequence 2 : " << vector_b[i] << endl;
        cout << "[Slower Recursion] LCS Length : " << LCSLength (vector_a[i], vector_b[i]) << endl;
        cout << "[Faster Recursion] LCS Length : " << LCSLength_Faster (vector_a[i], vector_b[i], table) << endl;
        cout << "[DP] LCS Length : " << LCSLength_DP (vector_a[i], vector_b[i], lcs) << endl;
        cout << "Longest Common Subsequence (LCS) : " << lcs << endl << endl;
    }

    return 0;
}

Output

Sequence 1 : ATPLBCCXWKQ Sequence 2 : FTCMXACWZYKQ
[Slower Recursion] LCS Length : 6
[Faster Recursion] LCS Length : 6
[DP] LCS Length : 6
Longest Common Subsequence (LCS) : TCXWKQ

Sequence 1 : AGORT Sequence 2 : BGPOAT
[Slower Recursion] LCS Length : 3
[Faster Recursion] LCS Length : 3
[DP] LCS Length : 3
Longest Common Subsequence (LCS) : GOT
import java.util.Arrays;
import java.util.List;
import java.lang.StringBuilder;

class LCS {

    /* A slower recursive implementation to fetch the length of longest common subsequence */
    int GetLength_LCS (String seq1, String seq2) {

        int sz1 = seq1.length();
        int sz2 = seq2.length();

        if (sz1 == 0 || sz2 == 0) {
           return 0;
        } else if (seq1.charAt(sz1 - 1) == seq2.charAt(sz2 - 1)) { // Last characters in the two sequences match
           return 1 + GetLength_LCS ( seq1.substring(0, sz1-1), seq2.substring(0, sz2-1) );
        } else { // Last characters in the two sequences do not match
           return Math.max (GetLength_LCS (seq1.substring(0, sz1), seq2.substring(0, sz2-1)),
                            GetLength_LCS (seq1.substring(0, sz1-1), seq2.substring(0, sz2)));
        }
    }

    /* A faster recursive implementation to fetch the length of longest common subsequence */
    int GetLength_FasterLCS (String seq1, String seq2, int table[][]) {

        int sz1 = seq1.length();
        int sz2 = seq2.length();

        if (sz1 == 0 || sz2 == 0) {
           return 0;
        }

        // A sub-problem with sz1 and sz2 has been solved already. Return the result
        if (table[sz1][sz2] != -1) {
           return table[sz1][sz2];
        }

        int result;

        if (seq1.charAt(sz1 - 1) == seq2.charAt(sz2 - 1)) { // Last characters in the two sequences match
           result = 1 + GetLength_FasterLCS (seq1.substring(0, sz1-1), seq2.substring(0, sz2-1), table);
        } else { // Last characters in the two sequences do not match
           result = Math.max(GetLength_FasterLCS (seq1.substring(0, sz1), seq2.substring(0, sz2-1), table),
                             GetLength_FasterLCS (seq1.substring(0, sz1-1), seq2.substring(0, sz2), table));
        }

        table[sz1][sz2] = result;
        return result;
    }

    /* A dynamic programming implementation to fetch the length of longest common subsequence (LCS) */
    int GetLength_LCSDP (String seq1, String seq2, StringBuilder lcs) {

        int sz1 = seq1.length();
        int sz2 = seq2.length();
        int length_lcs = 0;

        int LCSTable[][] = new int[sz1+1][sz2+1];

        // Finding the length of LCS
        for (int a = 0; a <= sz1; a++) {
            for (int b = 0; b <= sz2; b++) {
                if (a == 0 || b == 0) {
                    LCSTable[a][b] = 0;
                } else if (seq1.charAt(a-1) == seq2.charAt(b-1)) {
                    LCSTable[a][b] = 1 + LCSTable[a-1][b-1];
                } else {
                    LCSTable[a][b] = Math.max(LCSTable[a-1][b], LCSTable[a][b-1]);
                }
            }
        }

        length_lcs = LCSTable[sz1][sz2];

        // Finding the LCS
        int i = sz1, j = sz2;
        while (i > 0 && j > 0) {
           if (seq1.charAt(i-1) == seq2.charAt(j-1)) {
              lcs.append(seq1.charAt(i-1));  // Move diagonally towards top left 
              i--;
              j--;
           } else if (LCSTable[i-1][j] > LCSTable[i][j-1]) {
              i--; // Move upwards
           } else {
              j--; // Move towards left
           }
        }

        lcs.reverse();
        return length_lcs;
    }

    public static void main ( String[] args ) {

        List<String> list_a = List.of("ATPLBCCXWKQR", "AGORTRE");
        List<String> list_b = List.of("FTCMXACWZYKQR", "BGPOATRT");

        LCS obj = new LCS();

        for (int i = 0; i < 2; i++) {

            StringBuilder lcs =  new StringBuilder();
            int table[][] = new int[list_a.get(i).length()+1][list_b.get(i).length()+1];

            // Initialize 2 dimensional array with '-1'
            for (int[] arr : table) {
                Arrays.fill(arr, -1);
            }

            System.out.println("Sequence 1 : " + list_a.get(i) + ", Sequence 2 : " + list_b.get(i));
            System.out.println("[Slower Recursion] LCS length : " + obj.GetLength_LCS (list_a.get(i), list_b.get(i)));
            System.out.println("[Faster Recursion] LCS length : " + obj.GetLength_FasterLCS (list_a.get(i), list_b.get(i), table));
            System.out.println("[DP] LCS Length : " + obj.GetLength_LCSDP (list_a.get(i), list_b.get(i), lcs));
            System.out.println("Longest Common Subsequence (LCS) : " + lcs + "\n");
        }
    }
}

Output

Sequence 1 : ATPLBCCXWKQR, Sequence 2 : FTCMXACWZYKQR
[Slower Recursion] LCS length : 7
[Faster Recursion] LCS length : 7
[DP] LCS Length : 7
Longest Common Subsequence (LCS) : TCXWKQR

Sequence 1 : AGORTRE, Sequence 2 : BGPOATRT
[Slower Recursion] LCS length : 4
[Faster Recursion] LCS length : 4
[DP] LCS Length : 4
Longest Common Subsequence (LCS) : GOTR



Copyright (c) 2019-2024, Algotree.org.
All rights reserved.