What are Unicode, ASCII, and UTF-8?
How characters are stored.
What is the type rune?
ASCII
UTF8
Hexadecimal
Octal
Rune
Code Point
In a previous chapter, we covered decimal and binary notation. This chapter will talk about hexadecimal and octal. We will also speak about ASCII and UTF-8.
\n\nTo represent a binary number, you need to align many zeros and one. This notation is verbose. To represent the decimal number 1324, we needed to use 11 characters in binary. That’s why we need to have a numbering system more convenient to express large numbers.
\nHexadecimal is also a positional numeral system that uses 16 characters to represent a number.
\nThe prefix Hexa means 6 in Latin
Decimal is coming from the Latin word Decem which means 10
Those characters are numbers and letters. We use the numbers from 0 to 9 (10 characters) and the letters from A to F (6 characters).
\nLet’s take an example: 1324 in base ten is equivalent to 52C in base 16
\nThe digits from 0 to 9 correspond to the same value in the decimal system. The letters A correspond to 10, the letter B to 11 ...etc. This is a specificity of the hexadecimal numeral system; we use letters to represent numeric values.
\nUsually, this specificity generate confusion and questions to my students, what I typically reply is that you have to admit it; we needed more characters so we took letters...
\nYou can see that we introduced in this notation letters. That’s because from 0 to 9, you have ten characters, ten digits, but with a base-16 numbering system, we need six more characters. That’s why we have taken the first six letters of the alphabet. This is a historical choice; other characters could have replaced letters, the system would have been still the same.
\nThe method you can use to convert a hexadecimal number to a decimal number is similar to the previous one. We take the rightmost character we find its decimal equivalent, then we multiply it by 16 at the power 0. In our example, we have the letter C.The equivalent of C is 12.
\nTo print the hexadecimal representation of a number, you can use fmt functions. :
\n// hexadecimal-octal-ascii-utf8-unicode-runes/hexa-lower/main.go\npackage main\n\nimport "fmt"\n\nfunc main() {\n n := 2548\n fmt.Printf("%x", n)\n}
\nThis program will output: 9f4 (which is the hexadecimal representation of the decimal number 2548).\"%x\"
is the formatting verb for hexadecimal (with letters lowercase).
Note that n is a number denoted using the decimal system.
\nYou can also use\"%X\"
to print a hexadecimal number with capitalized letters :
// hexadecimal-octal-ascii-utf8-unicode-runes/hexa-upper/main.go\npackage main\n\nimport "fmt"\n\nfunc main() {\n n := 2548\n fmt.Printf("%X", n)\n}
\nOutput : 9F4.
\nIf you want to represent a number in hexadecimal in your code, add 0x before the numeral :
\n// hexadecimal-octal-ascii-utf8-unicode-runes/hex-number/main.go\npackage main\n\nimport "fmt"\n\nfunc main() {\n n := 2548\n n2 := 0x9F4\n fmt.Printf("%X\\n", n)\n fmt.Printf("%x\\n", n2)\n}
\nOutput :
\n9F4\n9f4
\nTo print the number in base ten, you can use the verb\"%d\"
:
// /hexadecimal-octal-ascii-utf8-unicode-runes/decimal/main.go\npackage main\n\nimport "fmt"\n\nfunc main() {\n n2 := 0x9F4\n fmt.Printf("Decimal : %d\\n", n2)\n}
\nOutput :
\nDecimal : 2548
\n\nI have almost forgotten another numeral system! The octal!
\nIt uses a base 8, which means eight different characters. The numbers from 0 to 7 were chosen. The conversion from decimal to octal is similar to the methods that I have presented before. Let’s take an example :
\nWe begin by the rightmost character, and we multiply it by eight at the power 0, which is 1. Then we take the next character: 5 to multiply it by eight at the power 1, which is 8...
\nThe octal system is notably used to represent permissions on a file for Unix operating systems. (see [par:octal-file-write]).
\nIn the same fashion as hexadecimal, the fmt package defines two formating verbs for octal :
\n// /hexadecimal-octal-ascii-utf8-unicode-runes/octal/main.go\npackage main\n\nimport "fmt"\n\nfunc main() {\n n2 := 0x9F4\n fmt.Printf("Decimal : %d\\n", n2)\n\n // n3 is represented using the octal numeral system\n n3 := 02454\n // alternative : n3 := 0o2454\n\n // convert in decimal\n fmt.Printf("decimal: %d\\n", n3)\n\n // n4 is represented using the decimal numeral system\n n4 := 1324\n // output n4 (decimal) in octal\n fmt.Printf("octal: %o\\n", n4)\n // output n4 (decimal) in octal (with a 0o prefix)\n fmt.Printf("octal with prefix : %O\\n", n4)\n\n}
\nOutput :
\nDecimal : 2548\ndecimal: 1324\noctal: 2454\noctal with prefix : 0o2454
\n\"%o\"
allow you to print the number in octal
\"%O\"
allow you to print the number in octal with a\"0o\"
prefix
Bit is an abbreviation for Binary digit .For instance 10100101100 is made of 11 binary digits, in other words, 11 bits. It’s very usual to group bits together. Groups exist in various sizes:
\nA nibble is composed of 4 bits
A byte is composed of 8 bits (two nibbles)
A word is composed of 16 bits (two bytes)
A doubleword is composed of 32 bits (two words)
A quadword is composed of 16\\times4=64 bits (four words)
With Go, you can create a slice of bytes. Lots of common standard package functions and methods are taking as arguments slice of bytes. Let’s see how we can create a slice of byte.
\n// /hexadecimal-octal-ascii-utf8-unicode-runes/slice-of-byte/main.go\npackage main\n\nimport "fmt"\n\nfunc main() {\n b := make([]byte, 0)\n b = append(b, 255)\n b = append(b, 10)\n fmt.Println(b)\n}
\nIn the previous snippet, we created a slice of bytes (with the builtin make) then we appended to the slice two numbers.
\nGolang byte type is an alias of uint8. Uint8 means that we can store unsigned (without any signs, so no negative numbers) integers on 8 bits (a byte) of data. The minimum value is 0 (the binary digit 0000000_{2}) the maximum value is 255 (11111111_{2} which is equivalent to the decimal number 2^{7}+2^{6}+2^{5}+2^{4}+2^{3}+2^{2}+2^{1}+2^{0})
\nThat’s why we can only append to a byte slice numbers from 0 to 255. If you try to append a number greater than 255, you will get the following error :
\ndataRepresentation/bytes/main.go:7:15: constant 256 overflows byte
\nTo print the binary representation of a number, you can use the \"%b\"
formatting verb :
// /hexadecimal-octal-ascii-utf8-unicode-runes/decimal-binary/main.go\npackage main\n\nimport "fmt"\n\nfunc main() {\n n2 := 0x9F4\n fmt.Printf("Decimal : %d\\n", n2)\n fmt.Printf("Binary : %b\\n", n2)\n}
\nOutput :
\nDecimal : 2548\nBinary : 100111110100
\n\nWhat if you want to store something other than numbers? For instance how could we store this Haiku from Masaoki Shiki :
\nspring rain:\nbrowsing under an umbrella\nat the picture-book store
\nIs the byte type appropriate? A byte is nothing more than an unsigned integer stored on 8 bits. This Haiku is composed of letters and special characters. We have an“:” and a “-” we also have line breaks... How can we store those characters?
\nWe have to find a way to give each letter and even special characters an unique code. You have maybe heard about UTF-8, ASCII, Unicode? This section will explain what they are and how they work. Once I started programming (that was not in Go), character encoding was something obscure, and I did not find it interesting. I think that character encoding could be essential because I have spent nights at work on problems that could have been resolved with a basic understanding of character encoding.
\nThe history of character encoding is very rich. With the development of the telegraph, we needed a way to encode messages in a way that could be transportable on an electrical wire. One of the earliest attempts was the Morse code. It is composed of four symbols: short signal, long signal, short space, long space (Wikipedia). Each letter of the alphabet could be encoded in morse. For instance, A was encoded as a short signal followed by a long signal. The plus sign “+” was encoded with “short long short long short”.
\n\nWe need to define a common vocabulary to understand character encoding :
\nCharacter This can be written by our hand. It conveys a signification. For instance, the sign “+” is a character. It means adding something to something else. A character can be a letter, a sign, or an ideogram.
Character set: this a collection of distinct characters. Often you will see or hear the abbreviation “charset”.
Code point : each character from a character set has an equivalent numeric value that uniquely identify this character. This numeric value is a code point.
There is one character set that you want to know : Unicode. It is a standard that lists the vast majority of characters from living languages that are used today on computers
It is composed of 137,374 characters for it’s version 11.0
With Unicode, we have our basis, our table of characters, now the next challenge is to find a way to encode those characters, to put those code point into bytes of data. This is precisely what ASCII and UTF-8 do.
\nASCII encode characters on seven binary digits. Another binary digit is a parity bit. A parity bit is used to detect transmission errors. It’s added after the seven first bits, and its value is 0. If the number of ones is odd, then the parity bit is 1; if the number is even, it’s set to 0.
\nA byte of data can store each character (8 bits see [sec:Data-representation-bits,]). How many integers can you create with only 7 bits ? With one single bit, we can encode two values, 0 and 1, with 2 bits, we can encode four distinct values. When you add a bit, you multiply by two the number of values you can encode. With 7 bits, you can encode 128 integers. More generally, the number of unsigned integers you can encode with n binary digits is two at the power n.
\nNumber of bits | \nNumber of values | \n
---|---|
1 | \n2 | \n
2 | \n4 | \n
3 | \n8 | \n
4 | \n16 | \n
5 | \n32 | \n
6 | \n64 | \n
7 | \n128 | \n
ASCII allows you to encode 128 different characters. For each character, we have a specific code point. Unsigned integer values represent code points.
\nOn the previous figure1, you can see the USASCII code chart. This table allows you to convert a byte into a character. For instance the letter B is equivalent to 1000010 (binary) (column 4, row 2)
\n\nUTF-8 is a variable width encoding system. It means that characters are encoded using one to four bytes (a byte represents eight binary digits).
\nOn the figure 5 you can see the encoding rules of UTF-8. A character can be encoded on 1 to 4 bytes.
\nThe code points that can be encoded using only one byte are from U+0000 to U+007F (included). This range is composed of 128 characters. (from 0 to 127, there are 128 numbers2
\nBut more characters need to be encoded! That’s why the creators of UTF-8 had the idea of adding bytes to the system. The first additional byte begins with a one and a 0; those are fixed. It signals to decoders that we are now using 2 bytes to encode our characters we simply add the bits “110”. It says to UTF-8 decoders, “be careful; we are 2 !”.
\nIf we use 2 bytes, we have 11 bits free (8 * 2 - 5 (fixed bits) =11). We can encode the characters which have the Unicode code point from U+0080 to U+07FF included. How many characters does that represent?
\n0080 in hex = 128 in decimal
07FF in hex = 2047 in decimal
from 0080 to 07FF there are 2047-128+1=1920
You might ask why do we add a one to the count... That’s because characters are indexed from the code point 0.
\nIf you use 3 bytes, then the first byte will start with the fixed bits 1110. This will signal to decoders that the character is encoded using 3 bytes. In other words, the next characters will begin after the third byte. The two additional bytes are beginning with 10. With three encoding bytes, you have 16 bits free (8 * 3 - 8 (fixed bits) =16). You can encode characters from U+0800 to U+FFFF.
\nIf you have understood how it works for 3 bytes, then you should have no problem to know how the system works with 4 bytes. Inside our first byte, we fix the five first bits (11110). Then we have three additional bytes. If we subtract the fixed bits from the total number of bits, we have 21 bits available. It means that we can encode code points from U+10000 to U+10FFFF.
\n\nA string is “a sequence of characters”. For instance \"Test\"
is a string composed of 4 different characters: T, e, s, and t. Strings are prevalent; we use them to store raw text inside our program. They are generally readable by humans. For instance, the first name and the last name of an application user are two strings.
Characters can come from different character sets. If you use the character set ASCII, you have to choose from 128 characters available.
\nEach character has a corresponding code point in the character set. As we have seen before, the code point is an unsigned integer arbitrarily chosen. Strings are stored using bytes. Let’s take the example of a string composed only of ASCII characters :
\nHello
\nA single byte can store each character. This string can be stored with the following bits :
\n01001000 01100101 01101100 01101100 01101111
\nIn Go strings are immutables, meaning that they cannot be modified once created.
\n\nThere are two “types” of strings literals :
\nraw string literals. They are defined between back quotes.
\nForbidden characters are
\nDiscarded characters are
\ninterpreted string literals. They are defined between double-quotes.
\nForbidden characters are
\nnew lines
unescaped double quotes
// /hexadecimal-octal-ascii-utf8-unicode-runes/string-literals/main.go\npackage main\n\nimport "fmt"\n\nfunc main() {\n\n raw := `spring rain:\nbrowsing under an umbrella\nat the picture-book store`\n fmt.Println(raw)\n\n interpreted := "i love spring"\n fmt.Println(interpreted)\n}
\nYou can note that inside this snippet of code, we did not say to Go which character set we use. This is because string literals are implicitly encoded using UTF-8.
\n\nBehind the scene, a string is a collection of bytes. We can iterate over the bytes of a string with a for loop:
\nOutput :
\nThe message in the previous figure means “I love Golang”, the two first characters are Chinese.
\nThis program will iterate over each character of the string. Inside the for loop v is of typerune
.rune
is a built-in type that is defined as follow :
// rune is an alias for int32 and is equivalent to int32 in all ways. It is\n// used, by convention, to distinguish character values from integer values.\ntype rune = int32
\nArune
represent a Unicode code point.
Unicode code points are numeric values.
By convention, they are always noted with the following format: \"U+X\"
where X
is the hexadecimal representation of the code point. X
should have four characters.
If X
has less than four characters, we add zeros.
Ex: The character \"o\"
has a code point equal to 111 (in decimal). 111 in hexadecimal is written 6F. The decimal code point is U+006F
To print the code point in the conventional format, you can use the format verb \"%U\"
.
Note that you can create a rune by using simple quotes :
\n// /hexadecimal-octal-ascii-utf8-unicode-runes/rune/main.go\npackage main\n\nimport "fmt"\n\nfunc main(){\n var aRune rune = 'Z'\n fmt.Printf("Unicode Code point of '%c': %U\\n", aRune, aRune)\n}
\n\nTrue or false : “785G” is an hexadecimal numeral
True or false : “785f” and “785F” represent the same quantity
What is the formatting verb to represent a hexadecimal number (with a capitalized letter)?
What is the formatting verb to represent a number in decimal?
What is a code point?
Fill the blanks. _______ is a character set, ______ is an encoding standard.
True or false: UTF-8 allows you to encode fewer characters than ASCII.
How many bytes can I use to encode a character using the UTF-8 encoding system?
True or false : “785G” is an hexadecimal numeral
\nFalse
The letter G cannot be part of hexadecimal numbers.
However, the letters A to F can be part of a hexadecimal number.
True or false : “785f” and “785F” represent the same quantity
\nThis is true
The fact that a letter is capitalized does not change its signification.
What is the formatting verb to represent a hexadecimal number (with a capitalized letter)?
\nWhat is the formatting verb to represent a number in decimal?
\nWhat is a code point?
\nFill the blanks. _______ is a character set, ______ is an encoding standard.
\nTrue or false: UTF-8 allows you to encode fewer characters than ASCII.
\nHow many bytes can I use to encode a character using the UTF-8 encoding system?
\nFrom 1 to 4 bytes
It depends on the character
Hexadecimal is a numeration system like decimal and binary
With hexadecimal, a number is represented using 16 characters :
\nWith fmt functions (fmt.Sprintf and fmt.Printf
) you can use “formatting verbs” to represent a number using a specific numeral system
%b
for binary
%X
and%x
for hexadecimal
%d
for decimal
%o
for octal
Character This is something that can be written by our hand, which conveys a signification. Ex: “-”, “A” , “a”
Character set: this a collection of distinct characters. Often you will see or hear the abbreviation “charset”.
Code point : each character from a character set as an equivalent numeric value that uniquely identify this character. This numeric value is a code point.
Unicode is a character set that is composed of 137.000 + characters.
Each character has a code point. For instance\"A\"
character is equivalent to the code pointU+0041
ASCII is an encoding technique that can encode only 128 characters.
UTF-8 is an encoding technique that can encode more than 1 million characters
With UTF-8, any character is encoded using 1 to 4 bytes.
rune
is a builtin type
A rune represents the Unicode code point of a character.
To create a rune, you can use simple quotes :
var aRune rune = 'Z'
\n// /hexadecimal-octal-ascii-utf8-unicode-runes/iterate-over-string/main.go\npackage main\n\nimport "fmt"\n\nfunc main() {\n b := "hello"\n for i := 0; i < len(b); i++ {\n fmt.Println(b[i])\n }\n // will output :\n // 104\n // 101\n // 108\n // 108\n // 111\n // and NOT :\n // h\n // e\n // l\n // l\n // o\n}
\nPrevious
\n\t\t\t\t\t\t\t\t\tBinary and Decimal
\n\t\t\t\t\t\t\t\tNext
\n\t\t\t\t\t\t\t\t\tVariables, constants and basic types
\n\t\t\t\t\t\t\t\t