Find the nth occurrence of substring in a string
PythonStringSubstringPython Problem Overview
This seems like it should be pretty trivial, but I am new at Python and want to do it the most Pythonic way.
I want to find the index corresponding to the n'th occurrence of a substring within a string.
There's got to be something equivalent to what I WANT to do which is
mystring.find("substring", 2nd)
How can you achieve this in Python?
Python Solutions
Solution 1 - Python
Here's a more Pythonic version of the straightforward iterative solution:
def find_nth(haystack, needle, n):
start = haystack.find(needle)
while start >= 0 and n > 1:
start = haystack.find(needle, start+len(needle))
n -= 1
return start
Example:
>>> find_nth("foofoofoofoo", "foofoo", 2)
6
If you want to find the nth overlapping occurrence of needle
, you can increment by 1
instead of len(needle)
, like this:
def find_nth_overlapping(haystack, needle, n):
start = haystack.find(needle)
while start >= 0 and n > 1:
start = haystack.find(needle, start+1)
n -= 1
return start
Example:
>>> find_nth_overlapping("foofoofoofoo", "foofoo", 2)
3
This is easier to read than Mark's version, and it doesn't require the extra memory of the splitting version or importing regular expression module. It also adheres to a few of the rules in the Zen of python, unlike the various re
approaches:
- Simple is better than complex.
- Flat is better than nested.
- Readability counts.
Solution 2 - Python
Mark's iterative approach would be the usual way, I think.
Here's an alternative with string-splitting, which can often be useful for finding-related processes:
def findnth(haystack, needle, n):
parts= haystack.split(needle, n+1)
if len(parts)<=n+1:
return -1
return len(haystack)-len(parts[-1])-len(needle)
And here's a quick (and somewhat dirty, in that you have to choose some chaff that can't match the needle) one-liner:
'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar')
Solution 3 - Python
This will find the second occurrence of substring in string.
def find_2nd(string, substring):
return string.find(substring, string.find(substring) + 1)
Edit: I haven't thought much about the performance, but a quick recursion can help with finding the nth occurrence:
def find_nth(string, substring, n):
if (n == 1):
return string.find(substring)
else:
return string.find(substring, find_nth(string, substring, n - 1) + 1)
Solution 4 - Python
Understanding that regex is not always the best solution, I'd probably use one here:
>>> import re
>>> s = "ababdfegtduab"
>>> [m.start() for m in re.finditer(r"ab",s)]
[0, 2, 11]
>>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence
11
Solution 5 - Python
I'm offering some benchmarking results comparing the most prominent approaches presented so far, namely @bobince's findnth()
(based on str.split()
) vs. @tgamblin's or @Mark Byers' find_nth()
(based on str.find()
). I will also compare with a C extension (_find_nth.so
) to see how fast we can go. Here is find_nth.py
:
def findnth(haystack, needle, n):
parts= haystack.split(needle, n+1)
if len(parts)<=n+1:
return -1
return len(haystack)-len(parts[-1])-len(needle)
def find_nth(s, x, n=0, overlap=False):
l = 1 if overlap else len(x)
i = -l
for c in xrange(n + 1):
i = s.find(x, i + l)
if i < 0:
break
return i
Of course, performance matters most if the string is large, so suppose we want to find the 1000001st newline ('\n') in a 1.3 GB file called 'bigfile'. To save memory, we would like to work on an mmap.mmap
object representation of the file:
In [1]: import _find_nth, find_nth, mmap
In [2]: f = open('bigfile', 'r')
In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
There is already the first problem with findnth()
, since mmap.mmap
objects don't support split()
. So we actually have to copy the whole file into memory:
In [4]: %time s = mm[:]
CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s
Wall time: 17.7 s
Ouch! Fortunately s
still fits in the 4 GB of memory of my Macbook Air, so let's benchmark findnth()
:
In [5]: %timeit find_nth.findnth(s, '\n', 1000000)
1 loops, best of 3: 29.9 s per loop
Clearly a terrible performance. Let's see how the approach based on str.find()
does:
In [6]: %timeit find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 774 ms per loop
Much better! Clearly, findnth()
's problem is that it is forced to copy the string during split()
, which is already the second time we copied the 1.3 GB of data around after s = mm[:]
. Here comes in the second advantage of find_nth()
: We can use it on mm
directly, such that zero copies of the file are required:
In [7]: %timeit find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 1.21 s per loop
There appears to be a small performance penalty operating on mm
vs. s
, but this illustrates that find_nth()
can get us an answer in 1.2 s compared to findnth
's total of 47 s.
I found no cases where the str.find()
based approach was significantly worse than the str.split()
based approach, so at this point, I would argue that @tgamblin's or @Mark Byers' answer should be accepted instead of @bobince's.
In my testing, the version of find_nth()
above was the fastest pure Python solution I could come up with (very similar to @Mark Byers' version). Let's see how much better we can do with a C extension module. Here is _find_nthmodule.c
:
#include <Python.h>
#include <string.h>
off_t _find_nth(const char *buf, size_t l, char c, int n) {
off_t i;
for (i = 0; i < l; ++i) {
if (buf[i] == c && n-- == 0) {
return i;
}
}
return -1;
}
off_t _find_nth2(const char *buf, size_t l, char c, int n) {
const char *b = buf - 1;
do {
b = memchr(b + 1, c, l);
if (!b) return -1;
} while (n--);
return b - buf;
}
/* mmap_object is private in mmapmodule.c - replicate beginning here */
typedef struct {
PyObject_HEAD
char *data;
size_t size;
} mmap_object;
typedef struct {
const char *s;
size_t l;
char c;
int n;
} params;
int parse_args(PyObject *args, params *P) {
PyObject *obj;
const char *x;
if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) {
return 1;
}
PyTypeObject *type = Py_TYPE(obj);
if (type == &PyString_Type) {
P->s = PyString_AS_STRING(obj);
P->l = PyString_GET_SIZE(obj);
} else if (!strcmp(type->tp_name, "mmap.mmap")) {
mmap_object *m_obj = (mmap_object*) obj;
P->s = m_obj->data;
P->l = m_obj->size;
} else {
PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0");
return 1;
}
P->c = x[0];
return 0;
}
static PyObject* py_find_nth(PyObject *self, PyObject *args) {
params P;
if (!parse_args(args, &P)) {
return Py_BuildValue("i", _find_nth(P.s, P.l, P.c, P.n));
} else {
return NULL;
}
}
static PyObject* py_find_nth2(PyObject *self, PyObject *args) {
params P;
if (!parse_args(args, &P)) {
return Py_BuildValue("i", _find_nth2(P.s, P.l, P.c, P.n));
} else {
return NULL;
}
}
static PyMethodDef methods[] = {
{"find_nth", py_find_nth, METH_VARARGS, ""},
{"find_nth2", py_find_nth2, METH_VARARGS, ""},
{0}
};
PyMODINIT_FUNC init_find_nth(void) {
Py_InitModule("_find_nth", methods);
}
Here is the setup.py
file:
from distutils.core import setup, Extension
module = Extension('_find_nth', sources=['_find_nthmodule.c'])
setup(ext_modules=[module])
Install as usual with python setup.py install
. The C code plays at an advantage here since it is limited to finding single characters, but let's see how fast this is:
In [8]: %timeit _find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 218 ms per loop
In [9]: %timeit _find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 216 ms per loop
In [10]: %timeit _find_nth.find_nth2(mm, '\n', 1000000)
1 loops, best of 3: 307 ms per loop
In [11]: %timeit _find_nth.find_nth2(s, '\n', 1000000)
1 loops, best of 3: 304 ms per loop
Clearly quite a bit faster still. Interestingly, there is no difference on the C level between the in-memory and mmapped cases. It is also interesting to see that _find_nth2()
, which is based on string.h
's memchr()
library function, loses out against the straightforward implementation in _find_nth()
: The additional "optimizations" in memchr()
are apparently backfiring...
In conclusion, the implementation in findnth()
(based on str.split()
) is really a bad idea, since (a) it performs terribly for larger strings due to the required copying, and (b)
it doesn't work on mmap.mmap
objects at all. The implementation in find_nth()
(based on str.find()
) should be preferred in all circumstances (and therefore be the accepted answer to this question).
There is still quite a bit of room for improvement, since the C extension ran almost a factor of 4 faster than the pure Python code, indicating that there might be a case for a dedicated Python library function.
Solution 6 - Python
Simplest way?
text = "This is a test from a test ok"
firstTest = text.find('test')
print text.find('test', firstTest + 1)
Solution 7 - Python
I'd probably do something like this, using the find function that takes an index parameter:
def find_nth(s, x, n):
i = -1
for _ in range(n):
i = s.find(x, i + len(x))
if i == -1:
break
return i
print find_nth('bananabanana', 'an', 3)
It's not particularly Pythonic I guess, but it's simple. You could do it using recursion instead:
def find_nth(s, x, n, i = 0):
i = s.find(x, i)
if n == 1 or i == -1:
return i
else:
return find_nth(s, x, n - 1, i + len(x))
print find_nth('bananabanana', 'an', 3)
It's a functional way to solve it, but I don't know if that makes it more Pythonic.
Solution 8 - Python
This will give you an array of the starting indices for matches to yourstring
:
import re
indices = [s.start() for s in re.finditer(':', yourstring)]
Then your nth entry would be:
n = 2
nth_entry = indices[n-1]
Of course you have to be careful with the index bounds. You can get the number of instances of yourstring
like this:
num_instances = len(indices)
Solution 9 - Python
Here is another approach using re.finditer.
The difference is that this only looks into the haystack as far as necessary
from re import finditer
from itertools import dropwhile
needle='an'
haystack='bananabanana'
n=2
next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start()
Solution 10 - Python
Here's another re
+ itertools
version that should work when searching for either a str
or a RegexpObject
. I will freely admit that this is likely over-engineered, but for some reason it entertained me.
import itertools
import re
def find_nth(haystack, needle, n = 1):
"""
Find the starting index of the nth occurrence of ``needle`` in \
``haystack``.
If ``needle`` is a ``str``, this will perform an exact substring
match; if it is a ``RegexpObject``, this will perform a regex
search.
If ``needle`` doesn't appear in ``haystack``, return ``-1``. If
``needle`` doesn't appear in ``haystack`` ``n`` times,
return ``-1``.
Arguments
---------
* ``needle`` the substring (or a ``RegexpObject``) to find
* ``haystack`` is a ``str``
* an ``int`` indicating which occurrence to find; defaults to ``1``
>>> find_nth("foo", "o", 1)
1
>>> find_nth("foo", "o", 2)
2
>>> find_nth("foo", "o", 3)
-1
>>> find_nth("foo", "b")
-1
>>> import re
>>> either_o = re.compile("[oO]")
>>> find_nth("foo", either_o, 1)
1
>>> find_nth("FOO", either_o, 1)
1
"""
if (hasattr(needle, 'finditer')):
matches = needle.finditer(haystack)
else:
matches = re.finditer(re.escape(needle), haystack)
start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1))
try:
return next(start_here)[1].start()
except StopIteration:
return -1
Solution 11 - Python
Building on modle13's answer, but without the re
module dependency.
def iter_find(haystack, needle):
return [i for i in range(0, len(haystack)) if haystack[i:].startswith(needle)]
I kinda wish this was a builtin string method.
>>> iter_find("http://stackoverflow.com/questions/1883980/", '/')
[5, 6, 24, 34, 42]
Solution 12 - Python
For the special case where you search for the n'th occurence of a character (i.e. substring of length 1), the following function works by building a list of all positions of occurences of the given character:
def find_char_nth(string, char, n):
"""Find the n'th occurence of a character within a string."""
return [i for i, c in enumerate(string) if c == char][n-1]
If there are fewer than n
occurences of the given character, it will give IndexError: list index out of range
.
This is derived from @Zv_oDD's answer and simplified for the case of a single character.
Solution 13 - Python
>>> s="abcdefabcdefababcdef"
>>> j=0
>>> for n,i in enumerate(s):
... if s[n:n+2] =="ab":
... print n,i
... j=j+1
... if j==2: print "2nd occurence at index position: ",n
...
0 a
6 a
2nd occurence at index position: 6
12 a
14 a
Solution 14 - Python
Providing another "tricky" solution, which use split
and join
.
In your example, we can use
len("substring".join([s for s in ori.split("substring")[:2]]))
Solution 15 - Python
# return -1 if nth substr (0-indexed) d.n.e, else return index
def find_nth(s, substr, n):
i = 0
while n >= 0:
n -= 1
i = s.find(substr, i + 1)
return i
Solution 16 - Python
Solution without using loops and recursion.
> Use the required pattern in compile method and enter the desired > occurrence in variable 'n' and the last statement will print the > starting index of the nth occurrence of the pattern in the given > string. Here the result of finditer i.e. iterator is being converted > to list and directly accessing the nth index.
import re
n=2
sampleString="this is history"
pattern=re.compile("is")
matches=pattern.finditer(sampleString)
print(list(matches)[n].span()[0])
Solution 17 - Python
Here is my solution for finding n
th occurrance of b
in string a
:
from functools import reduce
def findNth(a, b, n):
return reduce(lambda x, y: -1 if y > x + 1 else a.find(b, x + 1), range(n), -1)
It is pure Python and iterative. For 0 or n
that is too large, it returns -1. It is one-liner and can be used directly. Here is an example:
>>> reduce(lambda x, y: -1 if y > x + 1 else 'bibarbobaobaotang'.find('b', x + 1), range(4), -1)
7
Solution 18 - Python
I used findnth() function and ran into some issues, so I rewrote a faster version of the function (no list splitting):
def findnth(haystack, needle, n):
if not needle in haystack or haystack.count(needle) < n:
return -1
last_index = 0
cumulative_last_index = 0
for i in range(0, n):
last_index = haystack[cumulative_last_index:].find(needle)
cumulative_last_index += last_index
# if not last element, then jump over it
if i < n-1:
cumulative_last_index += len(needle)
return cumulative_last_index
Solution 19 - Python
The replace one liner is great but only works because XX and bar have the same lentgh
A good and general def would be:
def findN(s,sub,N,replaceString="XXX"):
return s.replace(sub,replaceString,N-1).find(sub) - (len(replaceString)-len(sub))*(N-1)
Solution 20 - Python
Def:
def get_first_N_words(mytext, mylen = 3):
mylist = list(mytext.split())
if len(mylist)>=mylen: return ' '.join(mylist[:mylen])
To use:
get_first_N_words(' One Two Three Four ' , 3)
Output:
'One Two Three'
Solution 21 - Python
Avoid a failure or incorrect output when the input value for occurrence provided is higher than the actual count of occurrence. For example, in a string 'overflow' if you would check the 3rd occurrence of 'o' ( it has only 2 occurrences ) then below code will return a warning or message indicating that the occurrence value has exceeded.
Input Occurrence entered has exceeded the actual count of Occurrence.
def check_nth_occurrence (string, substr, n):
## Count the Occurrence of a substr
cnt = 0
for i in string:
if i ==substr:
cnt = cnt + 1
else:
pass
## Check if the Occurrence input has exceeded the actual count of Occurrence
if n > cnt:
print (f' Input Occurrence entered has exceeded the actual count of Occurrence')
return
## Get the Index value for first Occurrence of the substr
index = string.find(substr)
## Get the Index value for nth Occurrence of Index
while index >= 0 and n > 1:
index = string.find(substr, index+ 1)
n -= 1
return index
Solution 22 - Python
Here's a simple and fun way to do it:
def index_of_nth(text, substring, n) -> int:
index = 0
for _ in range(n):
index = text.index(substring, index) + 1
return index - 1
Solution 23 - Python
Just in-case anyone wants to find n-th from the back:
def find_nth_reverse(haystack: str, needle: str, n: int) -> int:
end = haystack.rfind(needle)
while end >= 0 and n > 1:
end = haystack.rfind(needle, 0, end - len(needle))
n -= 1
return end
Solution 24 - Python
I solved it like this.
def second_index(text: str, symbol: str) -> [int, None]:
"""
returns the second index of a symbol in a given text
"""
first = text.find(symbol)
result = text.find(symbol,first+1)
if result > 0: return result
Solution 25 - Python
This is the answer you really want:
def Find(String,ToFind,Occurence = 1):
index = 0
count = 0
while index <= len(String):
try:
if String[index:index + len(ToFind)] == ToFind:
count += 1
if count == Occurence:
return index
break
index += 1
except IndexError:
return False
break
return False
Solution 26 - Python
A simple solution for those with basic programming knowledge:
# Function to find the nth occurrence of a substring in a text
def findnth(text, substring, n):
# variable to store current index in loop
count = -1
# n count
occurance = 0
# loop through string
for letter in text:
# increment count
count += 1
# if current letter in loop matches substring target
if letter == substring:
# increment occurance
occurance += 1
# if this is the nth time the substring is found
if occurance == n:
# return its index
return count
# otherwise indicate there is no match
return "No match"
# example of how to call function
print(findnth('C$100$150xx', "$", 2))
Solution 27 - Python
How about:
c = os.getcwd().split('\\')
print '\\'.join(c[0:-2])