4.4. Locale Encodingï
utf-8- a.k.a. Unicode - international standard (should be always used!)iso-8859-1- ISO standard for Western Europe and USAiso-8859-2- ISO standard for Central Europe (including Poland)cp1250orwindows-1250- Central European encoding on Windowscp1251orwindows-1251- Eastern European encoding on Windowscp1252orwindows-1252- Western European encoding on WindowsASCII- ASCII characters onlySince Windows 10 version 1903, UTF-8 is default encoding for Notepad!
Encodings:
4.4.1. SetUpï
>>> from pathlib import Path
>>> Path('/tmp/myfile.txt').unlink(missing_ok=True)
4.4.2. ASCII Tableï
Standard (0â127)
Extended (128â255)
Standard ASCII is the same everywhere
Extended ASCII is Operating System dependent
4.4.3. Unicodeï
4.4.4. Windows Encodingï
Figure 4.36. Windows 2000 Notepad "Save As" window with possibility to select encoding. UTF-8 is not selected by default... [1]ï
Figure 4.37. Windows 10 Notepad "Save As" window with possibility to select encoding.ï
Since Windows 10.1903 (May 2019) notepad writes files in UTF-8 by default! [2] [3]
Figure 4.38. Windows 10 Notepad "Save As" window with possibility to select encoding. Since Windows 10.1903 (May 2019) notepad writes files in UTF-8 by default!ï
Figure 4.39. Windows 10 Notepad "Save As" window with possibility to select encoding. Since Windows 10.1903 (May 2019) notepad writes files in UTF-8 by default!ï
4.4.5. Str vs. Bytesï
That was a big change in Python 3
In Python 2, str was bytes
In Python 3, str is unicode (UTF-8)
>>> text = 'KsiÄżyc'
>>> text
'KsiÄżyc'
>>> text = b'KsiÄżyc'
Traceback (most recent call last):
SyntaxError: bytes can only contain ASCII literal characters
Default encoding is UTF-8. Encoding names are case insensitive.
cp1250 and windows-1250 are aliases the same codec:
>>> text = 'KsiÄżyc'
>>>
>>> text.encode()
b'Ksi\xc4\x99\xc5\xbcyc'
>>> text.encode('utf-8')
b'Ksi\xc4\x99\xc5\xbcyc'
>>> text.encode('iso-8859-2')
b'Ksi\xea\xbfyc'
>>> text.encode('cp1250')
b'Ksi\xea\xbfyc'
>>> text.encode('windows-1250')
b'Ksi\xea\xbfyc'
Note the length change while encoding:
>>> text = 'KsiÄżyc'
>>> text
'KsiÄżyc'
>>> len(text)
7
>>> text = 'KsiÄżyc'.encode()
>>> text
b'Ksi\xc4\x99\xc5\xbcyc'
>>> len(text)
9
Note also, that those characters produce longer output:
>>> 'ó'.encode()
b'\xc3\xb3'
But despite being several "characters" long, the length is different:
>>> len(b'\xc3\xb3')
2
Here's the output of all Polish diacritics (accented characters) with their encoding:
>>> 'Ä
'.encode()
b'\xc4\x85'
>>> 'Ä'.encode()
b'\xc4\x87'
>>> 'Ä'.encode()
b'\xc4\x99'
>>> 'Å'.encode()
b'\xc5\x82'
>>> 'Å'.encode()
b'\xc5\x84'
>>> 'ó'.encode()
b'\xc3\xb3'
>>> 'Å'.encode()
b'\xc5\x9b'
>>> 'ż'.encode()
b'\xc5\xbc'
>>> 'ź'.encode()
b'\xc5\xba'
Note also a different way of iterating over bytes:
>>> text = 'KsiÄżyc'
>>>
>>> for character in text:
... print(character)
K
s
i
Ä
ż
y
c
>>>
>>> for character in text.encode():
... print(character)
75
115
105
196
153
197
188
121
99
4.4.6. UTF-8ï
>>> FILE = r'/tmp/myfile.txt'
>>>
>>> with open(FILE, mode='w', encoding='utf-8') as file:
... file.write('czeÅÄ')
5
>>>
>>> with open(FILE, encoding='utf-8') as file:
... print(file.read())
czeÅÄ
4.4.7. Unicode Encode Errorï
>>> FILE = r'/tmp/myfile.txt'
>>>
>>> with open(FILE, mode='w', encoding='cp1250') as file:
... file.write('czeÅÄ')
5
4.4.8. Unicode Decode Errorï
>>> FILE = r'/tmp/myfile.txt'
>>>
>>> with open(FILE, mode='w', encoding='utf-8') as file:
... file.write('czeÅÄ')
5
>>>
>>> with open(FILE, encoding='cp1250') as file:
... print(file.read())
czeĹâºÃâ¡
4.4.9. Escape Charactersï
\r\n- is used on windows\n- is used everywhere elseMore information in Builtin Printing
Learn more at https://en.wikipedia.org/wiki/List_of_Unicode_characters
Figure 4.42. Why we have '\r\n' on Windows?ï
Frequently used escape characters:
\n- New line (ENTER)
\t- Horizontal Tab (TAB)
\'- Single quote'(escape in single quoted strings)
\"- Double quote"(escape in double quoted strings)
\\- Backslash\(to indicate, that this is not escape char)
Less frequently used escape characters:
\a- Bell (BEL)
\b- Backspace (BS)
\f- New page (FF - Form Feed)
\v- Vertical Tab (VT)
\uF680- Character with 16-bit (2 bytes) hex valueF680
\U0001F680- Character with 32-bit (4 bytes) hex value0001F680
\o755- ASCII character with octal value755
\x1F680- ASCII character with hex value1F680
Emoticons:
>>> print('\U0001F680')
ð
>>> a = '\U0001F9D1' # ð§
>>> b = '\U0000200D' # ''
>>> c = '\U0001F680' # ð
>>>
>>> astronaut = a + b + c
>>> print(astronaut)
ð§âð