How to use sed to extract substring

LinuxShellUbuntuXml ParsingSed

Linux Problem Overview


I have a file containing the following lines:

  <parameter name="PortMappingEnabled" access="readWrite" type="xsd:boolean"></parameter>
  <parameter name="PortMappingLeaseDuration" access="readWrite" activeNotify="canDeny" type="xsd:unsignedInt"></parameter>
  <parameter name="RemoteHost" access="readWrite"></parameter>
  <parameter name="ExternalPort" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="ExternalPortEndRange" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="InternalPort" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="PortMappingProtocol" access="readWrite"></parameter>
  <parameter name="InternalClient" access="readWrite"></parameter>
  <parameter name="PortMappingDescription" access="readWrite"></parameter>

I want to execute command on this file to extract only the parameter names as displayed in the following output:

$sedcommand file.txt
PortMappingEnabled
PortMappingLeaseDuration
RemoteHost
ExternalPort
ExternalPortEndRange
InternalPort
PortMappingProtocol
InternalClient
PortMappingDescription

What could be this command?

Linux Solutions


Solution 1 - Linux

grep was born to extract things:

grep -Po 'name="\K[^"]*'

test with your data:

kent$  echo '<parameter name="PortMappingEnabled" access="readWrite" type="xsd:boolean"></parameter>
  <parameter name="PortMappingLeaseDuration" access="readWrite" activeNotify="canDeny" type="xsd:unsignedInt"></parameter>
  <parameter name="RemoteHost" access="readWrite"></parameter>
  <parameter name="ExternalPort" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="ExternalPortEndRange" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="InternalPort" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="PortMappingProtocol" access="readWrite"></parameter>
  <parameter name="InternalClient" access="readWrite"></parameter>
  <parameter name="PortMappingDescription" access="readWrite"></parameter>
'|grep -Po 'name="\K[^"]*'
PortMappingEnabled
PortMappingLeaseDuration
RemoteHost
ExternalPort
ExternalPortEndRange
InternalPort
PortMappingProtocol
InternalClient
PortMappingDescription

Solution 2 - Linux

sed 's/[^"]*"\([^"]*\).*/\1/'

does the job.

explanation of the part inside ' '

  • s - tells sed to substitute
  • / - start of regex string to search for
  • [^"]* - any character that is not ", any number of times. (matching parameter name=)
  • " - just a ".
  • ([^"]) - anything inside () will be saved for reference to use later. The \ are there so the brackets are not considered as characters to search for. [^"] means the same as above. (matching RemoteHost for example)
  • .* - any character, any number of times. (matching " access="readWrite"> /parameter)
  • / - end of the search regex, and start of the substitute string.
  • \1 - reference to that string we found in the brackets above.
  • / end of the substitute string.

basically s/search for this/replace with this/ but we're telling him to replace the whole line with just a piece of it we found earlier.

Solution 3 - Linux

You want awk.

This would be a quick and dirty hack:

awk -F "\"" '{print $2}' /tmp/file.txt

PortMappingEnabled
PortMappingLeaseDuration
RemoteHost
ExternalPort
ExternalPortEndRange
InternalPort
PortMappingProtocol
InternalClient
PortMappingDescription

Solution 4 - Linux

You should not parse XML using tools like sed, or awk. It's error-prone.

If input changes, and before name parameter you will get new-line character instead of space it will fail some day producing unexpected results.

If you are really sure, that your input will be always formated this way, you can use cut. It's faster than sed and awk:

cut -d'"' -f2 < input.txt

It will be better to first parse it, and extract only parameter name attribute:

xpath -q -e //@name input.txt | cut -d'"' -f2

To learn more about xpath, see this tutorial: http://www.w3schools.com/xpath/

Solution 5 - Linux

Explaining how you can use cut:

It will 'cut' all the lines in the file based on " delimiter, and will take the 2nd field , which is what you wanted.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMOHAMEDView Question on Stackoverflow
Solution 1 - LinuxKentView Answer on Stackoverflow
Solution 2 - LinuxunxnutView Answer on Stackoverflow
Solution 3 - LinuxChrisView Answer on Stackoverflow
Solution 4 - LinuxMichał ŠrajerView Answer on Stackoverflow
Solution 5 - LinuxRushi AgrawalView Answer on Stackoverflow