sscanf in Python

PythonParsingSplitScanfProcfs

Python Problem Overview


I'm looking for an equivalent to sscanf() in Python. I want to parse /proc/net/* files, in C I could do something like this:

int matches = sscanf(
        buffer,
        "%*d: %64[0-9A-Fa-f]:%X %64[0-9A-Fa-f]:%X %*X %*X:%*X %*X:%*X %*X %*d %*d %ld %*512s\n",
        local_addr, &local_port, rem_addr, &rem_port, &inode);

I thought at first to use str.split, however it doesn't split on the given characters, but the sep string as a whole:

>>> lines = open("/proc/net/dev").readlines()
>>> for l in lines[2:]:
>>>     cols = l.split(string.whitespace + ":")
>>>     print len(cols)
1

Which should be returning 17, as explained above.

Is there a Python equivalent to sscanf (not RE), or a string splitting function in the standard library that splits on any of a range of characters that I'm not aware of?

Python Solutions


Solution 1 - Python

There is also the parse module.

parse() is designed to be the opposite of format() (the newer string formatting function in Python 2.6 and higher).

>>> from parse import parse
>>> parse('{} fish', '1')
>>> parse('{} fish', '1 fish')
<Result ('1',) {}>
>>> parse('{} fish', '2 fish')
<Result ('2',) {}>
>>> parse('{} fish', 'red fish')
<Result ('red',) {}>
>>> parse('{} fish', 'blue fish')
<Result ('blue',) {}>

Solution 2 - Python

When I'm in a C mood, I usually use zip and list comprehensions for scanf-like behavior. Like this:

input = '1 3.0 false hello'
(a, b, c, d) = [t(s) for t,s in zip((int,float,strtobool,str),input.split())]
print (a, b, c, d)

Note that for more complex format strings, you do need to use regular expressions:

import re
input = '1:3.0 false,hello'
(a, b, c, d) = [t(s) for t,s in zip((int,float,strtobool,str),re.search('^(\d+):([\d.]+) (\w+),(\w+)$',input).groups())]
print (a, b, c, d)

Note also that you need conversion functions for all types you want to convert. For example, above I used something like:

strtobool = lambda s: {'true': True, 'false': False}[s]

Solution 3 - Python

Python doesn't have an sscanf equivalent built-in, and most of the time it actually makes a whole lot more sense to parse the input by working with the string directly, using regexps, or using a parsing tool.

Probably mostly useful for translating C, people have implemented sscanf, such as in this module: http://hkn.eecs.berkeley.edu/~dyoo/python/scanf/

In this particular case if you just want to split the data based on multiple split characters, re.split is really the right tool.

Solution 4 - Python

You can split on a range of characters using the re module.

>>> import re
>>> r = re.compile('[ \t\n\r:]+')
>>> r.split("abc:def  ghi")
['abc', 'def', 'ghi']

Solution 5 - Python

You can parse with module re using http://docs.python.org/library/re.html#regular-expression-syntax">named groups. It won't parse the substrings to their actual datatypes (e.g. int) but it's very convenient when parsing strings.

Given this sample line from /proc/net/tcp:

line="   0: 00000000:0203 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 335 1 c1674320 300 0 0 0"

An example mimicking your sscanf example with the variable could be:

import re
hex_digit_pattern = r"[\dA-Fa-f]"
pat = r"\d+: " + \
      r"(?P<local_addr>HEX+):(?P<local_port>HEX+) " + \
      r"(?P<rem_addr>HEX+):(?P<rem_port>HEX+) " + \
      r"HEX+ HEX+:HEX+ HEX+:HEX+ HEX+ +\d+ +\d+ " + \
      r"(?P<inode>\d+)"
pat = pat.replace("HEX", hex_digit_pattern)

values = re.search(pat, line).groupdict()

import pprint; pprint values
# prints:
# {'inode': '335',
#  'local_addr': '00000000',
#  'local_port': '0203',
#  'rem_addr': '00000000',
#  'rem_port': '0000'}

Solution 6 - Python

you can turn the ":" to space, and do the split.eg

>>> f=open("/proc/net/dev")
>>> for line in f:
...     line=line.replace(":"," ").split()
...     print len(line)

no regex needed (for this case)

Solution 7 - Python

You could install pandas and use pandas.read_fwf for fixed width format files. Example using /proc/net/arp:

In [230]: df = pandas.read_fwf("/proc/net/arp")

In [231]: print(df)
       IP address HW type Flags         HW address Mask Device
0   141.38.28.115     0x1   0x2  84:2b:2b:ad:e1:f4    *   eth0
1   141.38.28.203     0x1   0x2  c4:34:6b:5b:e4:7d    *   eth0
2   141.38.28.140     0x1   0x2  00:19:99:ce:00:19    *   eth0
3   141.38.28.202     0x1   0x2  90:1b:0e:14:a1:e3    *   eth0
4    141.38.28.17     0x1   0x2  90:1b:0e:1a:4b:41    *   eth0
5    141.38.28.60     0x1   0x2  00:19:99:cc:aa:58    *   eth0
6   141.38.28.233     0x1   0x2  90:1b:0e:8d:7a:c9    *   eth0
7    141.38.28.55     0x1   0x2  00:19:99:cc:ab:00    *   eth0
8   141.38.28.224     0x1   0x2  90:1b:0e:8d:7a:e2    *   eth0
9   141.38.28.148     0x1   0x0  4c:52:62:a8:08:2c    *   eth0
10  141.38.28.179     0x1   0x2  90:1b:0e:1a:4b:50    *   eth0

In [232]: df["HW address"]
Out[232]:
0     84:2b:2b:ad:e1:f4
1     c4:34:6b:5b:e4:7d
2     00:19:99:ce:00:19
3     90:1b:0e:14:a1:e3
4     90:1b:0e:1a:4b:41
5     00:19:99:cc:aa:58
6     90:1b:0e:8d:7a:c9
7     00:19:99:cc:ab:00
8     90:1b:0e:8d:7a:e2
9     4c:52:62:a8:08:2c
10    90:1b:0e:1a:4b:50

In [233]: df["HW address"][5]
Out[233]: '00:19:99:cc:aa:58'

By default it tries to figure out the format automagically, but there are options you can give for more explicit instructions (see documentation). There are also other IO routines in pandas that are powerful for other file formats.

Solution 8 - Python

There is an example in the official python docs about how to use sscanf from libc:

# import libc
from ctypes import CDLL
if(os.name=="nt"):
	libc = cdll.msvcrt 
else:
	# assuming Unix-like environment
	libc = cdll.LoadLibrary("libc.so.6")
	libc = CDLL("libc.so.6")  # alternative

# allocate vars
i = c_int()
f = c_float()
s = create_string_buffer(b'\000' * 32)

# parse with sscanf
libc.sscanf(b"1 3.14 Hello", "%d %f %s", byref(i), byref(f), s)

# read the parsed values
i.value  # 1
f.value  # 3.14
s.value # b'Hello'

Solution 9 - Python

If the separators are ':', you can split on ':', and then use x.strip() on the strings to get rid of any leading or trailing whitespace. int() will ignore the spaces.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMatt JoinerView Question on Stackoverflow
Solution 1 - PythonCraig McQueenView Answer on Stackoverflow
Solution 2 - PythonChris DellinView Answer on Stackoverflow
Solution 3 - PythonMike GrahamView Answer on Stackoverflow
Solution 4 - PythonDietrich EppView Answer on Stackoverflow
Solution 5 - PythonoripView Answer on Stackoverflow
Solution 6 - Pythonghostdog74View Answer on Stackoverflow
Solution 7 - PythongerritView Answer on Stackoverflow
Solution 8 - PythoneadmasterView Answer on Stackoverflow
Solution 9 - PythonLennart RegebroView Answer on Stackoverflow