What is a rune?
GoRuneGo Problem Overview
What is a rune
in Go?
I've been googling but Golang only says in one line: rune
is an alias for int32
.
But how come integers are used all around like swapping cases?
The following is a function swapcase.
What is all the <=
and -
?
And why doesn't switch
have any arguments?
&&
should mean and but what is r <= 'z'
?
func SwapRune(r rune) rune {
switch {
case 'a' <= r && r <= 'z':
return r - 'a' + 'A'
case 'A' <= r && r <= 'Z':
return r - 'A' + 'a'
default:
return r
}
}
Most of them are from http://play.golang.org/p/H6wjLZj6lW
func SwapCase(str string) string {
return strings.Map(SwapRune, str)
}
I understand this is mapping rune
to string
so that it can return the swapped string. But I do not understand how exactly rune
or byte
works here.
Go Solutions
Solution 1 - Go
Rune literals are just 32-bit integer values (however they're untyped constants, so their type can change). They represent unicode codepoints. For example, the rune literal 'a'
is actually the number 97
.
Therefore your program is pretty much equivalent to:
package main
import "fmt"
func SwapRune(r rune) rune {
switch {
case 97 <= r && r <= 122:
return r - 32
case 65 <= r && r <= 90:
return r + 32
default:
return r
}
}
func main() {
fmt.Println(SwapRune('a'))
}
It should be obvious, if you were to look at the Unicode mapping, which is identical to ASCII in that range. Furthermore, 32 is in fact the offset between the uppercase and lowercase codepoint of the character. So by adding 32
to 'A'
, you get 'a'
and vice versa.
Solution 2 - Go
From the Go lang release notes: http://golang.org/doc/go1#rune
Rune is a Type. It occupies 32bit and is meant to represent a Unicode CodePoint.
As an analogy the english characters set encoded in 'ASCII' has 128 code points. Thus is able to fit inside a byte (8bit). From this (erroneous) assumption C treated characters as 'bytes' char
, and 'strings' as a 'sequence of characters' char*
.
But guess what. There are many other symbols invented by humans other than the 'abcde..' symbols. And there are so many that we need 32 bit to encode them.
In golang then a string
is a sequence of bytes
. However, since multiple bytes can represent a rune code-point, a string value can also contain runes. So, it can be converted to a []rune
, or vice versa.
The unicode package http://golang.org/pkg/unicode/ can give a taste of the richness of the challenge.
Solution 3 - Go
I have tried to keep my language simple so that a layman understands rune
.
A rune is a character. That's it.
It is a single character. It's a character from any alphabet from any language from anywhere in the world.
To get a string we use
double-quotes ""
OR
back-ticks ``
A string is different than a rune. In runes we use
single-quotes ''
Now a rune is also an alias for int32
...Uh What?
The reason rune is an alias for int32
is because we see that with coding schemes such as below
each character maps to some number and so it's the number that we are storing. For example, a maps to 97 and when we store that number it's just the number and so that's way rune is an alias for int32. But is not just any number. It is a number with 32 'zeros and ones' or '4' bytes. (Note: UTF-8 is a 4-byte encoding scheme)
How runes relate to strings?
A string is a collection of runes. In the following code:
package main
import (
"fmt"
)
func main() {
fmt.Println([]byte("Hello"))
}
We try to convert a string to a stream of bytes. The output is:
[72 101 108 108 111]
We can see that each of the bytes that makes up that string is a rune.
Solution 4 - Go
I do not have enough reputation to post a comment to fabrizioM's answer, so I will have to post it here instead.
Fabrizio's answer is largely correct, and he certainly captured the essence of the problem - though there is a distinction which must be made.
A string is NOT necessarily a sequence of runes. It is a wrapper over a 'slice of bytes', a slice being a wrapper over a Go array. What difference does this make?
A rune type is necessarily a 32-bit value, meaning a sequence of values of rune types would necessarily have some number of bits x32. Strings, being a sequence of bytes, instead have a length of x8 bits. If all strings were actually in Unicode, this difference would have no impact. Since strings are slices of bytes, however, Go can use ASCII or any other arbitrary byte encoding.
String literals, however, are required to be written into the source encoded in UTF-8.
Source of information: http://blog.golang.org/strings
Solution 5 - Go
(Got a feeling that the above answers still didn't state the differences & relationships between string
and []rune
very clearly, so I would try to add another answer with an example.)
As @Strangework
's answer said, string
and []rune
are quite different.
Differences - string
& []rune
:
string value
is a read-only byte slice. And, a string literal is encoded in utf-8. Each char instring
actually takes 1 ~ 3 bytes, while eachrune
takes 4 bytes- For
string
, bothlen()
and index are based on bytes. - For
[]rune
, bothlen()
and index are based on rune (or int32).
Relationships - string
& []rune
:
- When you convert from
string
to[]rune
, each utf-8 char in that string becomes arune
. - Similarly, in the reverse conversion, when converting from
[]rune
tostring
, eachrune
becomes a utf-8 char in thestring
.
Tips:
- You can convert between
string
and[]rune
, but still they are different, in both type & overall size.
(I would add an example to show that more clearly.)
Code
string_rune_compare.go:
// string & rune compare,
package main
import "fmt"
// string & rune compare,
func stringAndRuneCompare() {
// string,
s := "hello你好"
fmt.Printf("%s, type: %T, len: %d\n", s, s, len(s))
fmt.Printf("s[%d]: %v, type: %T\n", 0, s[0], s[0])
li := len(s) - 1 // last index,
fmt.Printf("s[%d]: %v, type: %T\n\n", li, s[li], s[li])
// []rune
rs := []rune(s)
fmt.Printf("%v, type: %T, len: %d\n", rs, rs, len(rs))
}
func main() {
stringAndRuneCompare()
}
Execute:
> go run string_rune_compare.go
Output:
hello你好, type: string, len: 11
s[0]: 104, type: uint8
s[10]: 189, type: uint8
[104 101 108 108 111 20320 22909], type: []int32, len: 7
Explanation:
-
The string
hello你好
has length 11, because the first 5 chars each take 1 byte only, while the last 2 Chinese chars each take 3 bytes.- Thus,
total bytes = 5 * 1 + 2 * 3 = 11
- Since
len()
on string is based on bytes, thus the first line printedlen: 11
- Since index on string is also based on bytes, thus the following 2 lines print values of type
uint8
(sincebyte
is an alias type ofuint8
, in go).
- Thus,
-
When converting the
string
to[]rune
, it found 7 utf8 chars, thus 7 runes.- Since
len()
on[]rune
is based on rune, thus the last line printedlen: 7
. - If you operate
[]rune
via index, it will access base on rune.
Since each rune is from a utf8 char in the original string, thus you can also say bothlen()
and index operation on[]rune
are based on utf8 chars.
- Since
Solution 6 - Go
Everyone else has covered the part related to runes, so I am not going to talk about that.
However, there is also a question related to switch
not having any arguments. This is simply because in Golang, switch
without an expression is an alternate way to express if/else logic. For example, writing this:
t := time.Now()
switch {
case t.Hour() < 12:
fmt.Println("It's before noon")
default:
fmt.Println("It's after noon")
}
is same as writing this:
t := time.Now()
if t.Hour() < 12 {
fmt.Println("It's before noon")
} else {
fmt.Println("It's after noon")
}
You can read more here.
Solution 7 - Go
A rune is an int32 value, and therefore it is a Go type that is used for representing a Unicode code point. A Unicode code point or code position is a numerical value that is usually used for representing single Unicode characters;
Solution 8 - Go
Program
package main
import (
"fmt"
)
func main() {
words := "€25 or less"
fmt.Println("as string slice")
fmt.Println(words, len(words))
runes := []rune(words)
fmt.Println("\nas []rune slice")
fmt.Printf("%v, len:%d\n", runes, len(runes))
bytes := []byte(words)
fmt.Println("\nas []byte slice")
fmt.Printf("%v, len:%d\n", bytes, len(bytes))
}
Output
as string slice
€25 or less 13
as []rune slice
[8364 50 53 32 111 114 32 108 101 115 115], len:11
as []byte slice
[226 130 172 50 53 32 111 114 32 108 101 115 115], len:13
As you can see, the euro symbol '€' is represented by 3 bytes - 226, 130 & 172. The rune represents a character - any character be it hieroglyphics. The 32 bits of a rune is sufficient to represent all the characters in the world as of today. Hence, the rune representation of a euro symbol '€' is 8364.
For ASCII characters, which are 128, a byte (8 bits) is sufficient. Hence, a rune and a byte representation of digits or alphabets are the same. E.g: 2 is represented by 50.
A byte representation of a string is always greater than or equal to its rune representation in length since certain characters are represented by more than a byte but within 32 bits, which is a rune.
Solution 9 - Go
rune is an alias for int32 and is equivalent to int32 in all ways. It is used to distinguish character values from integer values.
> l = 108, o = 111