Using File Extension Wildcards in os.listdir(path)
PythonPython Problem Overview
I have a directory of files that I am trying to parse using Python. I wouldn't have a problem if they were all the same extension, but for whatever reason they are created with sequential numeric extensions after their original extension. For example: foo.log foo.log.1 foo.log.2 bar.log bar.log.1 bar.log.2 etc.
On top of that, foo.log is in XML format, while bar.log is not. What's the best route to take in order to read and parse only the foo.log.*
and foo.log
files? The bar.log
files do not need to be read. Below is my code:
import os
from lxml import etree
path = 'C:/foo/bar//'
listing = os.listdir(path)
for files in listing:
if files.endswith('.log'):
print files
data = open(os.path.join(path, files), 'rb').read()
tree = etree.fromstring(data)
search = tree.findall('.//QueueEntry')
This doesn't work as it doesn't read any .log.*
files and the parser chokes on the files that are read, but are not in xml format. Thanks!
Python Solutions
Solution 1 - Python
Maybe the glob module can help you:
import glob
listing = glob.glob('C:/foo/bar/foo.log*')
for filename in listing:
# do stuff
Solution 2 - Python
> What's the best route to take in order to read and parse only the foo.log.*
and foo.log
files? The bar.log files do not need to be read.
Your code does this:
if files.endswith('.log'):
You've just translated your English description into Python a bit wrong. What you write in Python is: "read and parse only the *.log
files", meaning bar.log
is included, and foo.log.1
is not.
But if you think for a second, you can translate your English description directly into Python:
if files == 'foo.log' or files.startswith('foo.log.'):
And if you think about it, as long as there are no files named foo.log.
(with that extra dot) that you want to skip, you can collapse the two cases into one:
if files.startswith('foo.log'):
However, if you know anything about POSIX shells, foo.log*
matches exactly the same thing. (That's not true for Windows shells, where wildcards treat extensions specially, which is why you have to type *.*
instead of *
.) And Python comes with a module that does POSIX-style wildcards, even on Windows, called glob
. See stranac's answer for how to use this.
I think the glob
answer is better than manually filtering listdir
. It's simpler, it's a more direct match for what your question title says you want to do (just do exactly what you hoped would work with os.listdir
, but with glob.glob
instead), and it's more flexible. So, unless you're worried about getting confused by the two slightly different meanings of wildcards, I'd suggest accepting that instead of this one.
Solution 3 - Python
This'll give you bash-like regexes:
import glob
print(glob.glob("/tmp/o*"))
Alternatively, you could os.listdir the entire directory, and throw away files that don't match a regex via the re module.
Solution 4 - Python
As several already mentioned: you could use glob.glob to find files using wildcards. I can't write a comment and it is a very old question, but... Someone suggested, the glob.glob can't expand ~ in the path. So, you can use os.path.expanduser for it, and os.path.expandvars to expand environment variables.