Generate all unique substrings for given string

AlgorithmLanguage Agnostic

Algorithm Problem Overview


Given a string s, what is the fastest method to generate a set of all its unique substrings?

Example: for str = "aba" we would get substrs={"a", "b", "ab", "ba", "aba"}.

The naive algorithm would be to traverse the entire string generating substrings in length 1..n in each iteration, yielding an O(n^2) upper bound.

Is a better bound possible?

(this is technically homework, so pointers-only are welcome as well)

Algorithm Solutions


Solution 1 - Algorithm

As other posters have said, there are potentially O(n^2) substrings for a given string, so printing them out cannot be done faster than that. However there exists an efficient representation of the set that can be constructed in linear time: the suffix tree.

Solution 2 - Algorithm

There is no way to do this faster than O(n2) because there are a total of O(n2) substrings in a string, so if you have to generate them all, their number will be n(n + 1) / 2 in the worst case, hence the upper lower bound of O(n2) Ω(n2).

Solution 3 - Algorithm

First one is brute force which has complexity O(N^3) which could be brought down to O(N^2 log(N)) Second One using HashSet which has Complexity O(N^2) Third One using LCP by initially finding all the suffix of a given string which has the worst case O(N^2) and best case O(N Log(N)).

First Solution:-

import java.util.Scanner;

public class DistinctSubString {
  public static void main(String[] args) {
    Scanner in = new Scanner(System.in);
    System.out.print("Enter The string");
    String s = in.nextLine();
    long startTime = System.currentTimeMillis();
    int L = s.length();
    int N = L * (L + 1) / 2;
    String[] Comb = new String[N];
    for (int i = 0, p = 0; i < L; ++i) {
      for (int j = 0; j < (L - i); ++j) {
        Comb[p++] = s.substring(j, i + j + 1);
      }
    }
    /*
     * for(int j=0;j<N;++j) { System.out.println(Comb[j]); }
     */
    boolean[] val = new boolean[N];
    for (int i = 0; i < N; ++i)
      val[i] = true;
    int counter = N;
    int p = 0, start = 0;
    for (int i = 0, j; i < L; ++i) {
      p = L - i;
      for (j = start; j < (start + p); ++j) {
        if (val[j]) {
          //System.out.println(Comb[j]);
          for (int k = j + 1; k < start + p; ++k) {
            if (Comb[j].equals(Comb[k])) {
              counter--;
              val[k] = false;
            }
          }
        }

      }

      start = j;
    }
    System.out.println("Substrings are " + N
        + " of which unique substrings are " + counter);
    long endTime = System.currentTimeMillis();
    System.out.println("It took " + (endTime - startTime) + " milliseconds");
  }
}

Second Solution:-

import java.util.*;

public class DistictSubstrings_usingHashTable {

  public static void main(String args[]) {
    // create a hash set

    Scanner in = new Scanner(System.in);
    System.out.print("Enter The string");
    String s = in.nextLine();
    int L = s.length();
    long startTime = System.currentTimeMillis();
    Set<String> hs = new HashSet<String>();
    // add elements to the hash set
    for (int i = 0; i < L; ++i) {
      for (int j = 0; j < (L - i); ++j) {
        hs.add(s.substring(j, i + j + 1));
      }
    }
    System.out.println(hs.size());
    long endTime = System.currentTimeMillis();
    System.out.println("It took " + (endTime - startTime) + " milliseconds");
  }
}

Third Solution:-

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;

public class LCPsolnFroDistinctSubString {

  public static void main(String[] args) throws IOException {

    BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
    System.out.println("Enter Desired String ");
    String string = br.readLine();
    int length = string.length();
    String[] arrayString = new String[length];
    for (int i = 0; i < length; ++i) {
      arrayString[i] = string.substring(length - 1 - i, length);
    }

    Arrays.sort(arrayString);
    for (int i = 0; i < length; ++i)
      System.out.println(arrayString[i]);

    long num_substring = arrayString[0].length();

    for (int i = 0; i < length - 1; ++i) {
      int j = 0;
      for (; j < arrayString[i].length(); ++j) {
        if (!((arrayString[i].substring(0, j + 1)).equals((arrayString)[i + 1]
            .substring(0, j + 1)))) {
          break;
        }
      }
      num_substring += arrayString[i + 1].length() - j;
    }
    System.out.println("unique substrings = " + num_substring);
  }

}

Fourth Solution:-

  public static void printAllCombinations(String soFar, String rest) {
    if(rest.isEmpty()) {
        System.out.println(soFar);
    } else {
        printAllCombinations(soFar + rest.substring(0,1), rest.substring(1));
        printAllCombinations(soFar , rest.substring(1));
    }
}

Test case:-  printAllCombinations("", "abcd");

Solution 4 - Algorithm

For big oh ... Best you could do would be O(n^2)

No need to reinvent the wheel, its not based on a strings, but on a sets, so you will have to take the concepts and apply them to your own situation.

Algorithms

Really Good White Paper from MS

In depth PowerPoint

Blog on string perms

Solution 5 - Algorithm

well, since there is potentially n*(n+1)/2 different substrings (+1 for the empty substring), I doubt you can be better than O(n*2) (worst case). the easiest thing is to generate them and use some nice O(1) lookup table (such as a hashmap) for excluding duplicates right when you find them.

Solution 6 - Algorithm

class SubstringsOfAString {
	public static void main(String args[]) {

		String string = "Hello", sub = null;

		System.out.println("Substrings of \"" + string + "\" are :-");

		for (int i = 0; i < string.length(); i++) {
			for (int j = 1; j <= string.length() - i; j++) {
				sub = string.substring(i, j + i);
				System.out.println(sub);
			}
		}
	}
}

Solution 7 - Algorithm

class program
{

        List<String> lst = new List<String>();
        String str = "abc";
        public void func()
        {
          
            subset(0, "");
            lst.Sort();
            lst = lst.Distinct().ToList();
            
            foreach (String item in lst)
            {
                Console.WriteLine(item);
            }
        }
        void subset(int n, String s)
        {
            for (int i = n; i < str.Length; i++)
            {
                lst.Add(s + str[i].ToString());
                subset(i + 1, s + str[i].ToString());
            }
        }
}

Solution 8 - Algorithm

This prints unique substrings. https://ideone.com/QVWOh0

def uniq_substring(test):
    lista=[]
    [lista.append(test[i:i+k+1]) for i in range(len(test)) for k in
    range(len(test)-i) if test[i:i+k+1] not in lista and 
    test[i:i+k+1][::-1] not in lista]
    print lista

uniq_substring('rohit')
uniq_substring('abab')

['r', 'ro', 'roh', 'rohi', 'rohit', 'o', 'oh', 'ohi', 'ohit', 'h',   'hi', 'hit', 'i', 'it', 't']
['a', 'ab', 'aba', 'abab', 'b', 'bab']

Solution 9 - Algorithm

Many answers that include 2 for loops and a .substring() call claim O(N^2) time complexity. However, it is important to note that the worst case for a .substring() call in Java (post update 6 in Java 7) is O(N). So by adding a .substring() call in your code, the order of N has increased by one.

Therefore, 2 for loops and a .substring() call within those loops equals an O(N^3) time complexity.

Solution 10 - Algorithm

It can only be done in o(n^2) time as total number of unique substrings of a string would be n(n+1)/2.

Example:

string s = "abcd"

pass 0: (all the strings are of length 1)

a, b, c, d = 4 strings

pass 1: (all the strings are of length 2)

ab, bc, cd = 3 strings

pass 2: (all the strings are of length 3)

abc, bcd = 2 strings

pass 3: (all the strings are of length 4)

abcd = 1 strings

Using this analogy, we can write solution with o(n^2) time complexity and constant space complexity.

The source code is as below:

#include<stdio.h>

void print(char arr[], int start, int end)
{
	int i;
	for(i=start;i<=end;i++)
	{
		printf("%c",arr[i]);
	}
	printf("\n");
}


void substrings(char arr[], int n)
{
	int pass,j,start,end;
	int no_of_strings = n-1;
	
	for(pass=0;pass<n;pass++)
	{
		start = 0;
		end = start+pass;
		for(j=no_of_strings;j>=0;j--)
		{
			print(arr,start, end);
			start++;
			end = start+pass;
		}
		no_of_strings--;
	}
	
}

int main()
{	
	char str[] = "abcd";
	substrings(str,4);
	return 0;
}

Solution 11 - Algorithm

Naive algorithm takes O(n^3) time instead of O(n^2) time. There are O(n^2) number of substrings. And if you put O(n^2) number of substrings, for example, set, then set compares O(lgn) comparisons for each string to check if it alrady exists in the set or not. Besides it takes O(n) time for string comparison. Therefore, it takes O(n^3 lgn) time if you use set. and you can reduce it O(n^3) time if you use hashtable instead of set.

The point is it is string comparisons not number comparisons.

So one of the best algorithm let's say if you use suffix array and longest common prefix (LCP) algorithm, it reduces O(n^2) time for this problem. Building a suffix array using O(n) time algorithm. Time for LCP = O(n) time. Since for each pair of strings in suffix array, do LCP so total time is O(n^2) time to find the length of distinct subtrings.

Besides if you want to print all distinct substrings, it takes O(n^2) time.

Solution 12 - Algorithm

Try this code using a suffix array and longest common prefix. It can also give you the total number of unique substrings. The code might give a stack overflow in visual studio but runs fine in Eclipse C++. That's because it returns vectors for functions. Haven't tested it against extremely long strings. Will do so and report back.

  // C++ program for building LCP array for given text
#include <bits/stdc++.h>
#include <vector>
#include <string>
using namespace std;

#define MAX 100000
int cum[MAX];


// Structure to store information of a suffix
struct suffix
{
	int index; // To store original index
	int rank[2]; // To store ranks and next rank pair
};

// A comparison function used by sort() to compare two suffixes
// Compares two pairs, returns 1 if first pair is smaller
int cmp(struct suffix a, struct suffix b)
{
	return (a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0):
		(a.rank[0] < b.rank[0] ?1: 0);
}

// This is the main function that takes a string 'txt' of size n as an
// argument, builds and return the suffix array for the given string
vector<int> buildSuffixArray(string txt, int n)
{
	// A structure to store suffixes and their indexes
	struct suffix suffixes[n];

	// Store suffixes and their indexes in an array of structures.
	// The structure is needed to sort the suffixes alphabatically
	// and maintain their old indexes while sorting
	for (int i = 0; i < n; i++)
	{
		suffixes[i].index = i;
		suffixes[i].rank[0] = txt[i] - 'a';
		suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a'): -1;
	}

	// Sort the suffixes using the comparison function
	// defined above.
	sort(suffixes, suffixes+n, cmp);

	// At his point, all suffixes are sorted according to first
	// 2 characters. Let us sort suffixes according to first 4
	// characters, then first 8 and so on
	int ind[n]; // This array is needed to get the index in suffixes[]
	// from original index. This mapping is needed to get
	// next suffix.
	for (int k = 4; k < 2*n; k = k*2)
	{
		// Assigning rank and index values to first suffix
		int rank = 0;
		int prev_rank = suffixes[0].rank[0];
		suffixes[0].rank[0] = rank;
		ind[suffixes[0].index] = 0;

		// Assigning rank to suffixes
		for (int i = 1; i < n; i++)
		{
			// If first rank and next ranks are same as that of previous
			// suffix in array, assign the same new rank to this suffix
			if (suffixes[i].rank[0] == prev_rank &&
					suffixes[i].rank[1] == suffixes[i-1].rank[1])
			{
				prev_rank = suffixes[i].rank[0];
				suffixes[i].rank[0] = rank;
			}
			else // Otherwise increment rank and assign
			{
				prev_rank = suffixes[i].rank[0];
				suffixes[i].rank[0] = ++rank;
			}
			ind[suffixes[i].index] = i;
		}

		// Assign next rank to every suffix
		for (int i = 0; i < n; i++)
		{
			int nextindex = suffixes[i].index + k/2;
			suffixes[i].rank[1] = (nextindex < n)?
								suffixes[ind[nextindex]].rank[0]: -1;
		}

		// Sort the suffixes according to first k characters
		sort(suffixes, suffixes+n, cmp);
	}

	// Store indexes of all sorted suffixes in the suffix array
	vector<int>suffixArr;
	for (int i = 0; i < n; i++)
		suffixArr.push_back(suffixes[i].index);

	// Return the suffix array
	return suffixArr;
}

/* To construct and return LCP */
vector<int> kasai(string txt, vector<int> suffixArr)
{
	int n = suffixArr.size();

	// To store LCP array
	vector<int> lcp(n, 0);

	// An auxiliary array to store inverse of suffix array
	// elements. For example if suffixArr[0] is 5, the
	// invSuff[5] would store 0. This is used to get next
	// suffix string from suffix array.
	vector<int> invSuff(n, 0);

	// Fill values in invSuff[]
	for (int i=0; i < n; i++)
		invSuff[suffixArr[i]] = i;

	// Initialize length of previous LCP
	int k = 0;

	// Process all suffixes one by one starting from
	// first suffix in txt[]
	for (int i=0; i<n; i++)
	{
		/* If the current suffix is at n-1, then we don’t
		have next substring to consider. So lcp is not
		defined for this substring, we put zero. */
		if (invSuff[i] == n-1)
		{
			k = 0;
			continue;
		}

		/* j contains index of the next substring to
		be considered to compare with the present
		substring, i.e., next string in suffix array */
		int j = suffixArr[invSuff[i]+1];

		// Directly start matching from k'th index as
		// at-least k-1 characters will match
		while (i+k<n && j+k<n && txt[i+k]==txt[j+k])
			k++;

		lcp[invSuff[i]] = k; // lcp for the present suffix.

		// Deleting the starting character from the string.
		if (k>0)
			k--;
	}

	// return the constructed lcp array
	return lcp;
}

// Utility function to print an array
void printArr(vector<int>arr, int n)
{
	for (int i = 0; i < n; i++)
		cout << arr[i] << " ";
	cout << endl;
}

// Driver program
int main()
{
	int t;
	cin >> t;
	//t = 1;

	while (t > 0)  {

		//string str = "banana";
		string str;

		cin >> str; // >> k;

		vector<int>suffixArr = buildSuffixArray(str, str.length());
		int n = suffixArr.size();

		cout << "Suffix Array : \n";
		printArr(suffixArr, n);

		vector<int>lcp = kasai(str, suffixArr);

		cout << "\nLCP Array : \n";
		printArr(lcp, n);

		// cum will hold number of substrings if that'a what you want (total = cum[n-1]
		cum[0] = n - suffixArr[0];
		
	//	vector <pair<int,int>> substrs[n];

		int count = 1;


		for (int i = 1; i <= n-suffixArr[0]; i++)  {
			//substrs[0].push_back({suffixArr[0],i});
			string sub_str = str.substr(suffixArr[0],i);
			cout << count << "  "  <<  sub_str << endl;
			count++;
		}

		for(int i = 1;i < n;i++)  {

			cum[i] = cum[i-1] + (n - suffixArr[i] - lcp[i - 1]);

			int end = n - suffixArr[i];
			int begin = lcp[i-1] + 1;
			int begin_suffix = suffixArr[i];

			for (int j = begin, k = 1; j <= end; j++, k++)  {

				//substrs[i].push_back({begin_suffix, lcp[i-1] + k});
		//		cout << "i push "  << i  << "   " << begin_suffix << " " << k << endl;
				string sub_str = str.substr(begin_suffix, lcp[i-1] +k);
				cout << count << "  "  <<  sub_str << endl;
				count++;

			}

		}

		/*int count = 1;

		cout << endl;

		for(int i = 0; i < n; i++){

			for (auto it = substrs[i].begin(); it != substrs[i].end(); ++it )	 {

				string sub_str = str.substr(it->first, it->second);
				cout << count << "  "  <<  sub_str << endl;
				count++;
			}

		}*/


		t--;

	}

	return 0;
}

And here's a simpler algorithm:

#include <iostream>
#include <string.h>
#include <vector>
#include <string>
#include <algorithm>
#include <time.h>

using namespace std;


char txt[100000], *p[100000];
int m, n;

int cmp(const void *p, const void *q) {
    int rc = memcmp(*(char **)p, *(char **)q, m);
    return rc;
}
int main() {
    std::cin >> txt;
	int start_s = clock();


    n = strlen(txt);
    int k; int i;
    int count = 1;
    for (m = 1; m <= n; m++) {
        for (k = 0; k+m <= n; k++)
            p[k] = txt+k;
        qsort(p, k, sizeof(p[0]), &cmp);
        for (i = 0; i < k; i++) {
            if (i != 0 && cmp(&p[i-1], &p[i]) == 0){
                continue;
            }
            char cur_txt[100000];
            memcpy(cur_txt, p[i],m);
            cur_txt[m] = '\0';
            std::cout << count << "  "  << cur_txt << std::endl;
            count++;
        }
    }

    cout << --count  << endl;

    int stop_s = clock();
    float run_time = (stop_s - start_s) / double(CLOCKS_PER_SEC);
    cout << endl << "distinct substrings  \t\tExecution time = " << run_time << " seconds" << endl;
    return 0;
}

Both algorithms listed a simply too slow for extremely long strings though. I tested the algorithms against a string of length over 47,000 and the algorithms took over 20 minutes to complete, with the first one taking 1200 seconds, and the second one taking 1360 seconds, and that's just counting the unique substrings without outputting to the terminal. So for probably strings of length up to 1000 you might get a working solution. Both solutions did compute the same total number of unique substrings though. I did test both algorithms against string lengths of 2000 and 10,000. The times were for the first algorithm: 0.33 s and 12 s; for the second algorithm it was 0.535 s and 20 s. So it looks like in general the first algorithm is faster.

Solution 13 - Algorithm

Here is my code in Python. It generates all possible substrings of any given string.

def find_substring(str_in):
    substrs = []
    if len(str_in) <= 1:
        return [str_in]
        
    s1 = find_substring(str_in[:1])
    s2 = find_substring(str_in[1:])

    substrs.append(s1)
    substrs.append(s2)
    for s11 in s1:
        substrs.append(s11)            
        for s21 in s2:            
            substrs.append("%s%s" %(s11, s21))
    
    for s21 in s2:
        substrs.append(s21)
    
    return set(substrs)

If you pass str_ = "abcdef" to the function, it generates the following results:

a, ab, abc, abcd, abcde, abcdef, abcdf, abce, abcef, abcf, abd, abde, abdef, abdf, abe, abef, abf, ac, acd, acde, acdef, acdf, ace, acef, acf, ad, ade, adef, adf, ae, aef, af, b, bc, bcd, bcde, bcdef, bcdf, bce, bcef, bcf, bd, bde, bdef, bdf, be, bef, bf, c, cd, cde, cdef, cdf, ce, cef, cf, d, de, def, df, e, ef, f

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionYuval AdamView Question on Stackoverflow
Solution 1 - AlgorithmRafał DowgirdView Answer on Stackoverflow
Solution 2 - AlgorithmIVladView Answer on Stackoverflow
Solution 3 - AlgorithmSumit Kumar SahaView Answer on Stackoverflow
Solution 4 - AlgorithmNixView Answer on Stackoverflow
Solution 5 - Algorithmback2dosView Answer on Stackoverflow
Solution 6 - AlgorithmNishant SrivastavaView Answer on Stackoverflow
Solution 7 - AlgorithmAtal KishoreView Answer on Stackoverflow
Solution 8 - AlgorithmRohit MalgaonkarView Answer on Stackoverflow
Solution 9 - AlgorithmAyomide OyekanmiView Answer on Stackoverflow
Solution 10 - AlgorithmAlienOnEarthView Answer on Stackoverflow
Solution 11 - Algorithmuser2761895View Answer on Stackoverflow
Solution 12 - Algorithmte7View Answer on Stackoverflow
Solution 13 - AlgorithmAliView Answer on Stackoverflow