How to urlencode data for curl command?

BashShellCurlScriptingUrlencode

Bash Problem Overview


I am trying to write a bash script for testing that takes a parameter and sends it through curl to web site. I need to url encode the value to make sure that special characters are processed properly. What is the best way to do this?

Here is my basic script so far:

#!/bin/bash
host=${1:?'bad host'}
value=$2
shift
shift
curl -v -d "param=${value}" http://${host}/somepath $@

Bash Solutions


Solution 1 - Bash

Use curl --data-urlencode; from man curl:

> This posts data, similar to the other --data options with the exception that this performs URL-encoding. To be CGI-compliant, the <data> part should begin with a name followed by a separator and a content specification.

Example usage:

curl \
    --data-urlencode "paramName=value" \
    --data-urlencode "secondParam=value" \
    http://example.com

See the man page for more info.

This requires curl 7.18.0 or newer (released January 2008). Use curl -V to check which version you have.

You can as well encode the query string:

curl --get \
    --data-urlencode "p1=value 1" \
    --data-urlencode "p2=value 2" \
    http://example.com
    # http://example.com?p1=value%201&p2=value%202

Solution 2 - Bash

Here is the pure BASH answer.

Update: Since many changes have been discussed, I have placed this on https://github.com/sfinktah/bash/blob/master/rawurlencode.inc.sh for anybody to issue a PR against.

Note: This solution is not intended to encode unicode or multi-byte characters - which are quite outside BASH's humble native capabilities. It's only intended to encode symbols that would otherwise ruin argument passing in POST or GET requests, e.g. '&', '=' and so forth.

Very Important Note: DO NOT ATTEMPT TO WRITE YOUR OWN UNICODE CONVERSION FUNCTION, IN ANY LANGUAGE, EVER. See end of answer.

rawurlencode() {
  local string="${1}"
  local strlen=${#string}
  local encoded=""
  local pos c o

  for (( pos=0 ; pos<strlen ; pos++ )); do
     c=${string:$pos:1}
     case "$c" in
        [-_.~a-zA-Z0-9] ) o="${c}" ;;
        * )               printf -v o '%%%02x' "'$c"
     esac
     encoded+="${o}"
  done
  echo "${encoded}"    # You can either set a return variable (FASTER) 
  REPLY="${encoded}"   #+or echo the result (EASIER)... or both... :p
}

You can use it in two ways:

easier:  echo http://url/q?=$( rawurlencode "$args" )
faster:  rawurlencode "$args"; echo http://url/q?${REPLY}

[edited]

Here's the matching rawurldecode() function, which - with all modesty - is awesome.

# Returns a string in which the sequences with percent (%) signs followed by
# two hex digits have been replaced with literal characters.
rawurldecode() {

  # This is perhaps a risky gambit, but since all escape characters must be
  # encoded, we can replace %NN with \xNN and pass the lot to printf -b, which
  # will decode hex for us

  printf -v REPLY '%b' "${1//%/\\x}" # You can either set a return variable (FASTER)

  echo "${REPLY}"  #+or echo the result (EASIER)... or both... :p
}

With the matching set, we can now perform some simple tests:

$ diff rawurlencode.inc.sh \
        <( rawurldecode "$( rawurlencode "$( cat rawurlencode.inc.sh )" )" ) \
        && echo Matched

Output: Matched

And if you really really feel that you need an external tool (well, it will go a lot faster, and might do binary files and such...) I found this on my OpenWRT router...

replace_value=$(echo $replace_value | sed -f /usr/lib/ddns/url_escape.sed)

Where url_escape.sed was a file that contained these rules:

# sed url escaping
s:%:%25:g
s: :%20:g
s:<:%3C:g
s:>:%3E:g
s:#:%23:g
s:{:%7B:g
s:}:%7D:g
s:|:%7C:g
s:\\:%5C:g
s:\^:%5E:g
s:~:%7E:g
s:\[:%5B:g
s:\]:%5D:g
s:`:%60:g
s:;:%3B:g
s:/:%2F:g
s:?:%3F:g
s^:^%3A^g
s:@:%40:g
s:=:%3D:g
s:&:%26:g
s:\$:%24:g
s:\!:%21:g
s:\*:%2A:g

While it is not impossible to write such a script in BASH (probably using xxd and a very lengthy ruleset) capable of handing UTF-8 input, there are faster and more reliable ways. Attempting to decode UTF-8 into UTF-32 is a non-trivial task to do with accuracy, though very easy to do inaccurately such that you think it works until the day it doesn't.

Even the Unicode Consortium removed their sample code after discovering it was no longer 100% compatible with the actual standard.

The Unicode standard is constantly evolving, and has become extremely nuanced. Any implementation you can whip together will not be properly compliant, and if by some extreme effort you managed it, it wouldn't stay compliant.

Solution 3 - Bash

Another option is to use jq:

$ printf %s 'encode this'|jq -sRr @uri
encode%20this
$ jq -rn --arg x 'encode this' '$x|@uri'
encode%20this

-r (--raw-output) outputs the raw contents of strings instead of JSON string literals. -n (--null-input) doesn't read input from STDIN.

-R (--raw-input) treats input lines as strings instead of parsing them as JSON, and -sR (--slurp --raw-input) reads the input into a single string. You can replace -sRr with -Rr if your input only contains a single line, or if you don't want to replace linefeeds with %0A:

$ printf %s\\n 'multiple lines' 'of text'|jq -Rr @uri
multiple%20lines
of%20text
$ printf %s\\n 'multiple lines' 'of text'|jq -sRr @uri
multiple%20lines%0Aof%20text%0A

Or this percent-encodes all bytes:

xxd -p|tr -d \\n|sed 's/../%&/g'

Solution 4 - Bash

Use Perl's URI::Escape module and uri_escape function in the second line of your bash script:

...

value="$(perl -MURI::Escape -e 'print uri_escape($ARGV[0]);' "$2")"
...

Edit: Fix quoting problems, as suggested by Chris Johnsen in the comments. Thanks!

Solution 5 - Bash

One of variants, may be ugly, but simple:

urlencode() {
    local data
    if [[ $# != 1 ]]; then
        echo "Usage: $0 string-to-urlencode"
        return 1
    fi
    data="$(curl -s -o /dev/null -w %{url_effective} --get --data-urlencode "$1" "")"
    if [[ $? != 3 ]]; then
        echo "Unexpected error" 1>&2
        return 2
    fi
    echo "${data##/?}"
    return 0
}

Here is the one-liner version for example (as suggested by Bruno):

date | curl -Gso /dev/null -w %{url_effective} --data-urlencode @- "" | cut -c 3-

# If you experience the trailing %0A, use
date | curl -Gso /dev/null -w %{url_effective} --data-urlencode @- "" | sed -E 's/..(.*).../\1/'

Solution 6 - Bash

for the sake of completeness, many solutions using sed or awk only translate a special set of characters and are hence quite large by code size and also dont translate other special characters that should be encoded.

a safe way to urlencode would be to just encode every single byte - even those that would've been allowed.

echo -ne 'some random\nbytes' | xxd -plain | tr -d '\n' | sed 's/\(..\)/%\1/g'

xxd is taking care here that the input is handled as bytes and not characters.

edit:

xxd comes with the vim-common package in Debian and I was just on a system where it was not installed and I didnt want to install it. The altornative is to use hexdump from the bsdmainutils package in Debian. According to the following graph, bsdmainutils and vim-common should have an about equal likelihood to be installed:

http://qa.debian.org/popcon-png.php?packages=vim-common%2Cbsdmainutils&show_installed=1&want_legend=1&want_ticks=1

but nevertheless here a version which uses hexdump instead of xxd and allows to avoid the tr call:

echo -ne 'some random\nbytes' | hexdump -v -e '/1 "%02x"' | sed 's/\(..\)/%\1/g'

Solution 7 - Bash

I find it more readable in python:

encoded_value=$(python3 -c "import urllib.parse; print urllib.parse.quote('''$value''')")

the triple ' ensures that single quotes in value won't hurt. urllib is in the standard library. It work for example for this crazy (real world) url:

"http://www.rai.it/dl/audio/" "1264165523944Ho servito il re d'Inghilterra - Puntata 7

Solution 8 - Bash

I've found the following snippet useful to stick it into a chain of program calls, where URI::Escape might not be installed:

perl -p -e 's/([^A-Za-z0-9])/sprintf("%%%02X", ord($1))/seg'

(source)

Solution 9 - Bash

If you wish to run GET request and use pure curl just add --get to @Jacob's solution.

Here is an example:

curl -v --get --data-urlencode "access_token=$(cat .fb_access_token)" https://graph.facebook.com/me/feed

Solution 10 - Bash

This may be the best one:

after=$(echo -e "$before" | od -An -tx1 | tr ' ' % | xargs printf "%s")

Solution 11 - Bash

Direct link to awk version : http://www.shelldorado.com/scripts/cmds/urlencode
I used it for years and it works like a charm

:
##########################################################################
# Title      :	urlencode - encode URL data
# Author     :	Heiner Steven ([email protected])
# Date       :	2000-03-15
# Requires   :	awk
# Categories :	File Conversion, WWW, CGI
# SCCS-Id.   :	@(#) urlencode	1.4 06/10/29
##########################################################################
# Description
#	Encode data according to
#	    RFC 1738: "Uniform Resource Locators (URL)" and
#	    RFC 1866: "Hypertext Markup Language - 2.0" (HTML)
#
#	This encoding is used i.e. for the MIME type
#	"application/x-www-form-urlencoded"
#
# Notes
#    o	The default behaviour is not to encode the line endings. This
#	may not be what was intended, because the result will be
#	multiple lines of output (which cannot be used in an URL or a
#	HTTP "POST" request). If the desired output should be one
#	line, use the "-l" option.
#
#    o	The "-l" option assumes, that the end-of-line is denoted by
#	the character LF (ASCII 10). This is not true for Windows or
#	Mac systems, where the end of a line is denoted by the two
#	characters CR LF (ASCII 13 10).
#	We use this for symmetry; data processed in the following way:
#		cat | urlencode -l | urldecode -l
#	should (and will) result in the original data
#
#    o	Large lines (or binary files) will break many AWK
#    	implementations. If you get the message
#		awk: record `...' too long
#		 record number xxx
#	consider using GNU AWK (gawk).
#
#    o	urlencode will always terminate it's output with an EOL
#    	character
#
# Thanks to Stefan Brozinski for pointing out a bug related to non-standard
# locales.
#
# See also
#	urldecode
##########################################################################

PN=`basename "$0"`			# Program name
VER='1.4'

: ${AWK=awk}

Usage () {
    echo >&2 "$PN - encode URL data, $VER
usage: $PN [-l] [file ...]
    -l:  encode line endings (result will be one line of output)

The default is to encode each input line on its own."
    exit 1
}

Msg () {
    for MsgLine
    do echo "$PN: $MsgLine" >&2
    done
}

Fatal () { Msg "$@"; exit 1; }

set -- `getopt hl "$@" 2>/dev/null` || Usage
[ $# -lt 1 ] && Usage			# "getopt" detected an error

EncodeEOL=no
while [ $# -gt 0 ]
do
    case "$1" in
    	-l)	EncodeEOL=yes;;
	--)	shift; break;;
	-h)	Usage;;
	-*)	Usage;;
	*)	break;;			# First file name
    esac
    shift
done

LANG=C	export LANG
$AWK '
    BEGIN {
	# We assume an awk implementation that is just plain dumb.
	# We will convert an character to its ASCII value with the
	# table ord[], and produce two-digit hexadecimal output
	# without the printf("%02X") feature.

	EOL = "%0A"		# "end of line" string (encoded)
	split ("1 2 3 4 5 6 7 8 9 A B C D E F", hextab, " ")
	hextab [0] = 0
	for ( i=1; i<=255; ++i ) ord [ sprintf ("%c", i) "" ] = i + 0
	if ("'"$EncodeEOL"'" == "yes") EncodeEOL = 1; else EncodeEOL = 0
    }
    {
	encoded = ""
	for ( i=1; i<=length ($0); ++i ) {
	    c = substr ($0, i, 1)
	    if ( c ~ /[a-zA-Z0-9.-]/ ) {
		encoded = encoded c		# safe character
	    } else if ( c == " " ) {
		encoded = encoded "+"	# special handling
	    } else {
		# unsafe character, encode it as a two-digit hex-number
		lo = ord [c] % 16
		hi = int (ord [c] / 16);
		encoded = encoded "%" hextab [hi] hextab [lo]
	    }
	}
	if ( EncodeEOL ) {
	    printf ("%s", encoded EOL)
	} else {
	    print encoded
	}
    }
    END {
    	#if ( EncodeEOL ) print ""
    }
' "$@"

Solution 12 - Bash

Here's a Bash solution which doesn't invoke any external programs:

uriencode() {
  s="${1//'%'/%25}"
  s="${s//' '/%20}"
  s="${s//'"'/%22}"
  s="${s//'#'/%23}"
  s="${s//'$'/%24}"
  s="${s//'&'/%26}"
  s="${s//'+'/%2B}"
  s="${s//','/%2C}"
  s="${s//'/'/%2F}"
  s="${s//':'/%3A}"
  s="${s//';'/%3B}"
  s="${s//'='/%3D}"
  s="${s//'?'/%3F}"
  s="${s//'@'/%40}"
  s="${s//'['/%5B}"
  s="${s//']'/%5D}"
  printf %s "$s"
}

Solution 13 - Bash

url=$(echo "$1" | sed -e 's/%/%25/g' -e 's/ /%20/g' -e 's/!/%21/g' -e 's/"/%22/g' -e 's/#/%23/g' -e 's/\$/%24/g' -e 's/\&/%26/g' -e 's/'\''/%27/g' -e 's/(/%28/g' -e 's/)/%29/g' -e 's/\*/%2a/g' -e 's/+/%2b/g' -e 's/,/%2c/g' -e 's/-/%2d/g' -e 's/\./%2e/g' -e 's/\//%2f/g' -e 's/:/%3a/g' -e 's/;/%3b/g' -e 's//%3e/g' -e 's/?/%3f/g' -e 's/@/%40/g' -e 's/\[/%5b/g' -e 's/\\/%5c/g' -e 's/\]/%5d/g' -e 's/\^/%5e/g' -e 's/_/%5f/g' -e 's/`/%60/g' -e 's/{/%7b/g' -e 's/|/%7c/g' -e 's/}/%7d/g' -e 's/~/%7e/g')

this will encode the string inside of $1 and output it in $url. although you don't have to put it in a var if you want. BTW didn't include the sed for tab thought it would turn it into spaces

Solution 14 - Bash

Using php from a shell script:

value="http://www.google.com"
encoded=$(php -r "echo rawurlencode('$value');")
# encoded = "http%3A%2F%2Fwww.google.com"
echo $(php -r "echo rawurldecode('$encoded');")
# returns: "http://www.google.com"
  1. http://www.php.net/manual/en/function.rawurlencode.php
  2. http://www.php.net/manual/en/function.rawurldecode.php

Solution 15 - Bash

If you don't want to depend on Perl you can also use sed. It's a bit messy, as each character has to be escaped individually. Make a file with the following contents and call it urlencode.sed

s/%/%25/g
s/ /%20/g
s/ /%09/g
s/!/%21/g
s/"/%22/g
s/#/%23/g
s/\$/%24/g
s/\&/%26/g
s/'\''/%27/g
s/(/%28/g
s/)/%29/g
s/\*/%2a/g
s/+/%2b/g
s/,/%2c/g
s/-/%2d/g
s/\./%2e/g
s/\//%2f/g
s/:/%3a/g
s/;/%3b/g
s//%3e/g
s/?/%3f/g
s/@/%40/g
s/\[/%5b/g
s/\\/%5c/g
s/\]/%5d/g
s/\^/%5e/g
s/_/%5f/g
s/`/%60/g
s/{/%7b/g
s/|/%7c/g
s/}/%7d/g
s/~/%7e/g
s/      /%09/g

To use it do the following.

STR1=$(echo "https://www.example.com/change&$ ^this to?%checkthe@-functionality" | cut -d\? -f1)
STR2=$(echo "https://www.example.com/change&$ ^this to?%checkthe@-functionality" | cut -d\? -f2)
OUT2=$(echo "$STR2" | sed -f urlencode.sed)
echo "$STR1?$OUT2"

This will split the string into a part that needs encoding, and the part that is fine, encode the part that needs it, then stitches back together.

You can put that into a sh script for convenience, maybe have it take a parameter to encode, put it on your path and then you can just call:

urlencode https://www.exxample.com?isThisFun=HellNo

source

Solution 16 - Bash

Python 3 based on @sandro's good answer from 2010:

echo "Test & /me" | python -c "import urllib.parse;print (urllib.parse.quote(input()))"

Test%20%26%20/me

Solution 17 - Bash

For those of you looking for a solution that doesn't need perl, here is one that only needs hexdump and awk:

url_encode() {
 [ $# -lt 1 ] && { return; }

 encodedurl="$1";

 # make sure hexdump exists, if not, just give back the url
 [ ! -x "/usr/bin/hexdump" ] && { return; }

 encodedurl=`
   echo $encodedurl | hexdump -v -e '1/1 "%02x\t"' -e '1/1 "%_c\n"' |
   LANG=C awk '
     $1 == "20"                    { printf("%s",   "+"); next } # space becomes plus
     $1 ~  /0[adAD]/               {                      next } # strip newlines
     $2 ~  /^[a-zA-Z0-9.*()\/-]$/  { printf("%s",   $2);  next } # pass through what we can
                                   { printf("%%%s", $1)        } # take hex value of everything else
   '`
}

Stitched together from a couple of places across the net and some local trial and error. It works great!

Solution 18 - Bash

uni2ascii is very handy:

$ echo -ne '你好世界' | uni2ascii -aJ
%E4%BD%A0%E5%A5%BD%E4%B8%96%E7%95%8C

Solution 19 - Bash

You can emulate javascript's encodeURIComponent in perl. Here's the command:

perl -pe 's/([^a-zA-Z0-9_.!~*()'\''-])/sprintf("%%%02X", ord($1))/ge'

You could set this as a bash alias in .bash_profile:

alias encodeURIComponent='perl -pe '\''s/([^a-zA-Z0-9_.!~*()'\''\'\'''\''-])/sprintf("%%%02X",ord($1))/ge'\'

Now you can pipe into encodeURIComponent:

$ echo -n 'hèllo wôrld!' | encodeURIComponent
h%C3%A8llo%20w%C3%B4rld!

Solution 20 - Bash

Simple PHP option:

echo 'part-that-needs-encoding' | php -R 'echo urlencode($argn);'

Solution 21 - Bash

This nodejs-based answer will use encodeURIComponent on stdin:

uriencode_stdin() {
    node -p 'encodeURIComponent(require("fs").readFileSync(0))'
}

echo -n $'hello\nwörld' | uriencode_stdin
hello%0Aw%C3%B6rld

Solution 22 - Bash

Here's the node version:

uriencode() {
  node -p "encodeURIComponent('${1//\'/\\\'}')"
}

Solution 23 - Bash

The question is about doing this in bash and there's no need for python or perl as there is in fact a single command that does exactly what you want - "urlencode".

value=$(urlencode "${2}")

This is also much better, as the above perl answer, for example, doesn't encode all characters correctly. Try it with the long dash you get from Word and you get the wrong encoding.

Note, you need "gridsite-clients" installed to provide this command.

Solution 24 - Bash

Another php approach:

echo "encode me" | php -r "echo urlencode(file_get_contents('php://stdin'));"

Solution 25 - Bash

Here is a POSIX function to do that:

url_encode() {
   awk 'BEGIN {
      for (n = 0; n < 125; n++) {
         m[sprintf("%c", n)] = n
      }
      n = 1
      while (1) {
         s = substr(ARGV[1], n, 1)
         if (s == "") {
            break
         }
         t = s ~ /[[:alnum:]_.!~*\47()-]/ ? t s : t sprintf("%%%02X", m[s])
         n++
      }
      print t
   }' "$1"
}

Example:

value=$(url_encode "$2")

Solution 26 - Bash

What would parse URLs better than javascript?

node -p "encodeURIComponent('$url')"

Solution 27 - Bash

Ruby, for completeness

value="$(ruby -r cgi -e 'puts CGI.escape(ARGV[0])' "$2")"

Solution 28 - Bash

Here is my version for busybox ash shell for an embedded system, I originally adopted Orwellophile's variant:

urlencode()
{
    local S="${1}"
    local encoded=""
    local ch
    local o
    for i in $(seq 0 $((${#S} - 1)) )
    do
        ch=${S:$i:1}
        case "${ch}" in
            [-_.~a-zA-Z0-9]) 
                o="${ch}"
                ;;
            *) 
                o=$(printf '%%%02x' "'$ch")                
                ;;
        esac
        encoded="${encoded}${o}"
    done
    echo ${encoded}
}

urldecode() 
{
    # urldecode <string>
    local url_encoded="${1//+/ }"
    printf '%b' "${url_encoded//%/\\x}"
}

Solution 29 - Bash

Here's a one-line conversion using Lua, similar to blueyed's answer except with all the RFC 3986 Unreserved Characters left unencoded (like this answer):

url=$(echo 'print((arg[1]:gsub("([^%w%-%.%_%~])",function(c)return("%%%02X"):format(c:byte())end)))' | lua - "$1")

Additionally, you may need to ensure that newlines in your string are converted from LF to CRLF, in which case you can insert a gsub("\r?\n", "\r\n") in the chain before the percent-encoding.

Here's a variant that, in the non-standard style of application/x-www-form-urlencoded, does that newline normalization, as well as encoding spaces as '+' instead of '%20' (which could probably be added to the Perl snippet using a similar technique).

url=$(echo 'print((arg[1]:gsub("\r?\n", "\r\n"):gsub("([^%w%-%.%_%~ ]))",function(c)return("%%%02X"):format(c:byte())end):gsub(" ","+"))' | lua - "$1")

Solution 30 - Bash

Having php installed I use this way:

URL_ENCODED_DATA=`php -r "echo urlencode('$DATA');"`

Solution 31 - Bash

This is the ksh version of orwellophile's answer containing the rawurlencode and rawurldecode functions (link: https://stackoverflow.com/questions/296536/urlencode-from-a-bash-script/10660730#10660730). I don't have enough rep to post a comment, hence the new post..

#!/bin/ksh93

function rawurlencode
{
    typeset string="${1}"
    typeset strlen=${#string}
    typeset encoded=""

    for (( pos=0 ; pos<strlen ; pos++ )); do
        c=${string:$pos:1}
        case "$c" in
            [-_.~a-zA-Z0-9] ) o="${c}" ;;
            * )               o=$(printf '%%%02x' "'$c")
        esac
        encoded+="${o}"
    done
    print "${encoded}"
}

function rawurldecode
{
    printf $(printf '%b' "${1//%/\\x}")
}

print $(rawurlencode "C++")     # --> C%2b%2b
print $(rawurldecode "C%2b%2b") # --> C++

Solution 32 - Bash

In this case, I needed to URL encode the hostname. Don't ask why. Being a minimalist, and a Perl fan, here's what I came up with.

url_encode()
  {
  echo -n "$1" | perl -pe 's/[^a-zA-Z0-9\/_.~-]/sprintf "%%%02x", ord($&)/ge'
  }

Works perfectly for me.

Solution 33 - Bash

The following is based on Orwellophile's answer, but solves the multibyte bug mentioned in the comments by setting LC_ALL=C (a trick from vte.sh). I've written it in the form of function suitable PROMPT_COMMAND, because that's how I use it.

print_path_url() {
  local LC_ALL=C
  local string="$PWD"
  local strlen=${#string}
  local encoded=""
  local pos c o

  for (( pos=0 ; pos<strlen ; pos++ )); do
     c=${string:$pos:1}
     case "$c" in
        [-_.~a-zA-Z0-9/] ) o="${c}" ;;
        * )               printf -v o '%%%02x' "'$c"
     esac
     encoded+="${o}"
  done
  printf "\033]7;file://%s%s\007" "${HOSTNAME:-}" "${encoded}"
}

Solution 34 - Bash

For one of my cases I found that the NodeJS url lib had the simplest solution. Of course YMMV

$ urlencode(){ node -e "console.log(require('url').parse(process.argv.slice(1).join('+')).href)" "$@"; }

$ urlencode "https://example.com?my_database_has=these 'nasty' query strings in it"
https://example.com/?my_database_has=these%20%27nasty%27%20query%20strings%20in%20it

Solution 35 - Bash

There is an excellent answer from Orwellophile, which does include a pure bash option (function rawurlencode), which I've used on my website (shell based CGI script, large number of URLS in response to search requests). The only draw back was high CPU during peak time.

I've found a modified solution, leverage bash "global replace" feature. With this solution processing time for url encode is 4X faster. The solution identify the characters to be escaped, and uses the "global replace" operator (${var//source/replacement}) to process all substitutions. The speed up is clearly from using bash internal loops, over explicit loop.

Performance: On core i3-8100 3.60Ghz. Test case: 1000 URL from stack overflow, similar to this ticket: "https://stackoverflow.com/questions/296536/how-to-urlencode-data-for-curl-command";.

  • Existing Solution: 0.807 sec
  • Optimized Solution: 0.162 sec (5X speedup)
url_encode()
{
    local key="${1}" varname="${2:-_rval}" prefix="${3:-_ENCKEY_}"
    local unsafe=${key//[-_.~a-zA-Z0-9 ]/} 
    local -i key_len=${#unsafe}
    local ch ch1 ch0

    while [ "$unsafe" ] ;do
        ch=${unsafe:0:1}
        ch0="\\$ch"
        printf -v ch1 '%%%02x' "'$ch'" 
        key=${key//$ch0/"$ch1"}
        unsafe=${unsafe//"$ch0"}
    done
    key=${key// /+} 

    REPLY="$key"
    # printf "%s" "$REPLY"
    return 0
}

As a minor extra, it uses '+' to encode the space. Slightly more compact URL.

Benchmark:

function t {
    local key
    for (( i=1 ; i<=$1 ; i++ )) do url_encode "$2" kkk2 ; done
    echo "K=$REPLY"
}

t 1000 "https://stackoverflow.com/questions/296536/how-to-urlencode-data-for-curl-command"

Solution 36 - Bash

Note

  • These functions are NOT made to encode URL's data but URLs.
  • Put the URLs in a file in a manner of one per line.
#!/bin/dash

replaceUnicodes () { # $1=input/output file
	if ! mv -f "$1" "$1".tmp 2>/dev/null; then return 1; fi
	output="$1" awk '
	function hexValue(chr) {
		if(chr=="0") return 0; if(chr=="1") return 1; if(chr=="2") return 2; if(chr=="3") return 3; if(chr=="4") return 4; if(chr=="5") return 5;
		if(chr=="6") return 6; if(chr=="7") return 7; if(chr=="8") return 8; if(chr=="9") return 9; if(chr=="A") return 10;
		if(chr=="B") return 11; if(chr=="C") return 12; if(chr=="D") return 13; if(chr=="E") return 14; return 15 }
	function hexToDecimal(str,	value,i,inc) {
		str=toupper(str); value=and(hexValue(substr(str,length(str),1)),15); inc=1;
		for(i=length(str)-1;i>0;i--) {
			value+=lshift(hexValue(substr(str,i,1)),4*inc++)
		} return value }
	function toDecimal(str,	value,i) {
		for(i=1;i<=length(str);i++) {
			value=(value*10)+substr(str,i,1)
		} return value }
	function to32BE(high,low) {
		# return 0x10000+((high-0xD800)*0x400)+(low-0xDC00) }
		return lshift((high-0xD800),10)+(low-0xDC00)+0x10000 }
	function toUTF8(value) {
		if(value<0x80) { 
			return sprintf("%%%02X",value)
		} else if(value>0xFFFF) {
			return sprintf("%%%02X%%%02X%%%02X%%%02X",or(0xF0,and(rshift(value,18),0x07)),or(0x80,and(rshift(value,12),0x3F)),or(0x80,and(rshift(value,6),0x3F)),or(0x80,and(rshift(value,0),0x3F)))
		} else if(value>0x07FF) {
			return sprintf("%%%02X%%%02X%%%02X",or(0xE0,and(rshift(value,12),0x0F)),or(0x80,and(rshift(value,6),0x3F)),or(0x80,and(rshift(value,0),0x3F)))
		} else { return sprintf("%%%02X%%%02X",or(0xC0,and(rshift(value,6),0x1F)),or(0x80,and(rshift(value,0),0x3F))) }
	}
	function trap(str) { sub(/^\\+/,"\\",str); return str }
	function esc(str) { gsub(/\\/,"\\\\",str); return str }
	BEGIN { output=ENVIRON["output"] }
	{
		finalStr=""; while(match($0,/[\\]+u[0-9a-fA-F]{4}/)) {
			p=substr($0,RSTART,RLENGTH); num=hexToDecimal(substr(p,RLENGTH-3,4));
			bfrStr=substr($0,1,RSTART-1); $0=substr($0,RSTART+RLENGTH,length($0)-(RSTART+RLENGTH-1));
			if(surrogate) {
				surrogate=0;
				if(RSTART!=1 || num<0xD800 || (num>0xDBFF && num<0xDC00) || num>0xDFFF) {
					finalStr=sprintf("%s%s%s%s",finalStr,trap(highP),bfrStr,toUTF8(num))
				} else if(num>0xD7FF && num<0xDC00) {
					surrogate=1; high=num; finalStr=sprintf("%s%s",finalStr,trap(highP))
				} else { finalStr=sprintf("%s%s",finalStr,toUTF8(to32BE(high,num))) }
			} else if(num>0xD7FF && num<0xDC00) {
				surrogate=1; highP=p; high=num; finalStr=sprintf("%s%s",finalStr,bfrStr)
			} else { finalStr=sprintf("%s%s%s",finalStr,bfrStr,toUTF8(num)) }
		} finalStr=sprintf("%s%s",finalStr,$0); $0=finalStr

		while(match($0,/[\\]+U[0-9a-fA-F]{8}/)) {
			str=substr($0,RSTART,RLENGTH); gsub(esc(str),toUTF8(hexToDecimal(substr(str,RLENGTH-7,8))),$0)
		}
		while(match($0,/[\\]*&#[xX][0-9a-fA-F]{1,8};/)) {
			str=substr($0,RSTART,RLENGTH); idx=index(str,"#");
			gsub(esc(str),toUTF8(hexToDecimal(substr(str,idx+2,RLENGTH-idx-2))),$0)
		}
		while(match($0,/[\\]*&#[0-9]{1,10};/)) {
			str=substr($0,RSTART,RLENGTH); idx=index(str,"#");
			gsub(esc(str),toUTF8(toDecimal(substr(str,idx+1,RLENGTH-idx-1))),$0)
		}
		printf("%s\n",$0) > output
	}' "$1".tmp
	rm -f "$1".tmp
}

replaceHtmlEntities () { # $1=input/output file
	if ! mv -f "$1" "$1".tmp 2>/dev/null; then return 1; fi
	sed 's/%3[aA]/:/g; s/%2[fF]/\//g; s/&quot;/%22/g; s/&lt;/%3C/g; s/&gt;/%3E/g; s/&nbsp;/%A0/g; s/&cent;/%A2/g; s/&pound;/%A3/g; s/&yen;/%A5/g; s/&copy;/%A9/g; s/&reg;/%AE/g; s/&amp;/\&/g; s/\\*\//\//g' "$1".tmp > "$1"
	rm -f "$1".tmp
}


# "od -v -A n -t u1 -w99999999"
# "hexdump -v -e \47/1 \42%d \42\47"
# Reminder :: Do not encode (, ), [, and ].
toUTF8Encoded () { # $1=input/output file
	if ! mv -f "$1" "$1".tmp 2>/dev/null; then return 1; fi
	if [ -s "$1".tmp ]; then
		# od -A n -t u1 -w99999999 "$1".tmp | \
		hexdump -v -e '/1 "%d "' "$1".tmp | \
		output="$1" awk 'function hexDigit(chr) { if((chr>47 && chr<58) || (chr>64 && chr<71) || (chr>96 && chr<103)) return 1; return 0 }
		BEGIN { output=ENVIRON["output"] }
		{	for(i=1;i<=NF;i++) {
				flushed=0; c=$(i);
				if(c==13) { if($(i+1)==10) i++; printf("%s\n",url) > output; url=""; flushed=1
				} else if(c==10) { printf("%s\n",url) > output; url=""; flushed=1
				} else if(c==37) {
					if(hexDigit($(i+1)) && hexDigit($(i+2))) {
						url=sprintf("%s%%%c%c",url,$(i+1),$(i+2)); i+=2
					} else { url=sprintf("%s%%25",url) }
				} else if(c>32 && c<127 && c!=34 && c!=39 && c!=96 && c!=60 && c!=62) {
					url=sprintf("%s%c",url,c)
				} else { url=sprintf("%s%%%02X",url,c) }
			} if(!flushed) printf("%s\n",url) > output
		}'
	fi
	rm -f "$1".tmp
}

Call replaceUnicodes() --> replaceHtmlEntities() --> toUTF8Encoded()

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAaronView Question on Stackoverflow
Solution 1 - BashJacob RaskView Answer on Stackoverflow
Solution 2 - BashOrwellophileView Answer on Stackoverflow
Solution 3 - BashnisetamaView Answer on Stackoverflow
Solution 4 - BashdubekView Answer on Stackoverflow
Solution 5 - BashSergeyView Answer on Stackoverflow
Solution 6 - BashjoschView Answer on Stackoverflow
Solution 7 - BashsandroView Answer on Stackoverflow
Solution 8 - BashblueyedView Answer on Stackoverflow
Solution 9 - BashPiotr CzaplaView Answer on Stackoverflow
Solution 10 - BashchenzhiweiView Answer on Stackoverflow
Solution 11 - BashMatthieuPView Answer on Stackoverflow
Solution 12 - BashdavidchambersView Answer on Stackoverflow
Solution 13 - BashmanoflinuxView Answer on Stackoverflow
Solution 14 - BashDarren WeberView Answer on Stackoverflow
Solution 15 - BashJayView Answer on Stackoverflow
Solution 16 - BashWolfgang FahlView Answer on Stackoverflow
Solution 17 - BashLouis MarascioView Answer on Stackoverflow
Solution 18 - BashkevView Answer on Stackoverflow
Solution 19 - BashKlausView Answer on Stackoverflow
Solution 20 - BashRyanView Answer on Stackoverflow
Solution 21 - BashmasterxiloView Answer on Stackoverflow
Solution 22 - BashdavidchambersView Answer on Stackoverflow
Solution 23 - BashDylanView Answer on Stackoverflow
Solution 24 - Bashjan halfarView Answer on Stackoverflow
Solution 25 - BashZomboView Answer on Stackoverflow
Solution 26 - BashNestor UrquizaView Answer on Stackoverflow
Solution 27 - Bashk107View Answer on Stackoverflow
Solution 28 - BashnulleightView Answer on Stackoverflow
Solution 29 - BashStuart P. BentleyView Answer on Stackoverflow
Solution 30 - BashajaestView Answer on Stackoverflow
Solution 31 - BashRay BurgemeestreView Answer on Stackoverflow
Solution 32 - Bashsthames42View Answer on Stackoverflow
Solution 33 - BashPer BothnerView Answer on Stackoverflow
Solution 34 - BashBruno BronoskyView Answer on Stackoverflow
Solution 35 - Bashdash-oView Answer on Stackoverflow
Solution 36 - BashDarkmanView Answer on Stackoverflow