Excluding directories in os.walk

Python

Python Problem Overview


I'm writing a script that descends into a directory tree (using os.walk()) and then visits each file matching a certain file extension. However, since some of the directory trees that my tool will be used on also contain sub directories that in turn contain a LOT of useless (for the purpose of this script) stuff, I figured I'd add an option for the user to specify a list of directories to exclude from the traversal.

This is easy enough with os.walk(). After all, it's up to me to decide whether I actually want to visit the respective files / dirs yielded by os.walk() or just skip them. The problem is that if I have, for example, a directory tree like this:

root--
     |
     --- dirA
     |
     --- dirB
     |
     --- uselessStuff --
                       |
                       --- moreJunk
                       |
                       --- yetMoreJunk

and I want to exclude uselessStuff and all its children, os.walk() will still descend into all the (potentially thousands of) sub directories of uselessStuff, which, needless to say, slows things down a lot. In an ideal world, I could tell os.walk() to not even bother yielding any more children of uselessStuff, but to my knowledge there is no way of doing that (is there?).

Does anyone have an idea? Maybe there's a third-party library that provides something like that?

Python Solutions


Solution 1 - Python

Modifying dirs in-place will prune the (subsequent) files and directories visited by os.walk:

# exclude = set(['New folder', 'Windows', 'Desktop'])
for root, dirs, files in os.walk(top, topdown=True):
    dirs[:] = [d for d in dirs if d not in exclude]

From help(os.walk):

> When topdown is true, the caller can modify the dirnames list in-place > (e.g., via del or slice assignment), and walk will only recurse into > the subdirectories whose names remain in dirnames; this can be used to > prune the search...

Solution 2 - Python

... an alternative form of @unutbu's excellent answer that reads a little more directly, given that the intent is to exclude directories, at the cost of O(n**2) vs O(n) time.

(Making a copy of the dirs list with list(dirs) is required for correct execution)

# exclude = set([...])
for root, dirs, files in os.walk(top, topdown=True):
    [dirs.remove(d) for d in list(dirs) if d in exclude]

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionantredView Question on Stackoverflow
Solution 1 - PythonunutbuView Answer on Stackoverflow
Solution 2 - PythonDmitriView Answer on Stackoverflow