# Python encoding nightmares --- #### Encodings derive different meaning from bytes: ```python3 In [1]: '你好'.encode('utf8') Out[1]: b'\xe4\xbd\xa0\xe5\xa5\xbd' In [2]: '你好'.encode('utf8').decode('latin1') Out[2]: 'ä½\xa0好' ``` <table> <tr> <td>UTF8</td> <td colspan=3>你</td> <td colspan=3>好</td> </tr> <tr> <td>bytes</td> <td>0xe4</td> <td>0xbd</td> <td>0xa0</td> <td>0xe5</td> <td>0xa5</td> <td>0xbd</td> </tr> <tr> <td>latin1</td> <td>ä</td> <td>½</td> <td> </td> <td>å</td> <td>¥</td> <td>½</td> </tr> </table> Note: Bytes don't have a meaning by them selves, only together with an encoding or a well defined protocol they become meaningfull. --- #### python2 str => array of bytes + helpers ```python2 In [1]: b'你好' == '你好' Out[1]: True In [2]: u'你好' == '你好'.decode('utf8') Out[2]: True In [3]: len('你好'), len(u'你好') Out[3]: (6, 2) In [4]: u'你好' == '你好' /usr/local/bin/ipython2:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal Out[4]: False ``` Note: You can decode() strings to unicode strings. --- #### So you want to decode? :trollface: ```python2 In [9]: '€'.encode('utf8') ---------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-9-f4201f3d0b2f> in <module>() ----> 1 '€'.encode('utf8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) ``` Note: The unicode type has not always been in python2 so you can call encode on python2 strs which already are encoded. Nowadays this will result in an implicit decode to unicode (with you default) and then the encode you asked for. That implicit decode can raise a UnicodeDecodeError. --- #### python3 str => array of unicode characters + helpers ```python3 In [1]: '你好'.encode() == '你好' Out[1]: False In [2]: b'\xe4\xbd\xa0\xe5\xa5\xbd' == '你好'.encode() Out[2]: True In [3]: len('你好') Out[3]: 2 In [4]: u'你好' == '你好' Out[4]: True ``` Note: In python3 bytes are encoded and strings are decoded. Never the opposite. --- #### `from __future__ import antigravity` ```python2 In [1]: from __future__ import unicode_literals In [2]: '你好', len('你好') Out[2]: (u'\u4f60\u597d', 2) In [3]: b'\xe4\xbd\xa0\xe5\xa5\xbd' == '你好'.encode('utf8') Out[3]: True In [4]: u'你好' == '你好' Out[4]: True ``` Note: It changes the default string type from python2 style strings to python3 style unicode strings. --- #### Fix strings but not default encodings ```python2 In [5]: '你好'.encode() ---------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) <ipython-input-11-cbd8a88207ff> in <module>() ----> 1 '你好'.encode() UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) ``` Note: It does not change the default encoding and decoding encoding from pythons 'ascii' to python3s 'utf8'. So let's just forget about python2. --- #### remote shells -> polysh -> your terminal ```shell (poly) vagrant@polysh:/vagrant$ polysh localhost ready (1)> printf '%b' '\xac' waiting (1/1)> error: uncaptured python exception, closing channel <polysh.remote_dispatcher.remote_dispatcher connected at 0x7fca92bcfe10> (<class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte [/usr/lib/python3.6/asyncore.py|readwrite|108] [/usr/lib/python3.6/asyncore.py|handle_read_event|423] [/poly/lib/python3.6/site-packages/polysh-0.5-py3.6.egg /polysh/remote_dispatcher.py|handle_read|268]) ``` Note: The python docs say: "The most important tip is: Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end." so that's what I did. It resulted in this. I would like to differ with the python docs and think we have to add: don't assume everybody will be nice and give you propper, valid utf8 encoded data and don't just decode untrusted data. --- #### remote shells -> polysh -> your terminal ```python3 In [12]: def read_from_awesome_source(): ...: for data in [b'fine', b'take this!\xca']: ...: yield data ...: ...: for data in read_from_awesome_source(): ...: print(data.decode()) ...: fine ---------------------------------------------------------- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 10: unexpected end of data ``` Note: Simplifying the failing code from the last slide by a great deal it looks basically like this. As you can see we fail decoding 0xCA as that's not a valid character in utf8. Let's try fix it. --- #### How to handle untrusted data: ```python3 In [1]: print(b'This \xca works!'.decode('utf8', 'ignore')) This works! In [2]: print(b'This \xca works!'.decode('utf8', 'replace')) This � works! In [3]: print(b'This \xca works!'.decode('utf8', 'backslashreplace')) This \xca works! In [4]: print(b'This \xca works!'.decode('utf8', 'strict')) ---------------------------------------------------------- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 10: invalid continuation byte ``` Note: There are a couple of different methods to deal with untrusted encoded data in python3. Some of the most usefull methods are these to strip out, strip and replace with a marker or to backslash escape undecodable sequences in the data. We don't do this in polysh to not mess with the users data. What comes out of the remote shell should come out of our shell. Polysh therefore now doesn't decode any content from the remote shell anymore. Which brings me to my next topic. --- #### String Formatting ```python3 In [1]: 'Named "{named_str}", indexed "{1}", objects __repr__ "{obj!r}", object attributes "{obj.__class__}", even left-padded! "{unpadded_str:>20}"').format( ...: 'first', ...: 'second', ...: named_str='named', ...: obj=list(), ...: unpadded_str='unpadded', ...: ) Out[1]: 'Named "named", indexed "second", objects __repr__ "[]", objectattributes "<class \'list\'>", even left- padded! " unpadded"' ``` Note: There are two ways in python to format a string, this is the new one I assume you have all seen and use. It uses the format function build into the python string class. PEP3101 defined the new method and also implied that the old version would eventually be deprecated. --- #### Bytes formatting - doesn't work with .format() ```python3 In [1]: b'{}'.format(1) ---------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-2-3404800b9d90> in <module>() ----> 1 b'{}'.format(1) AttributeError: 'bytes' object has no attribute 'format' In [4]: '{}'.format(1).encode() Out[4]: b'1' ``` Note: Sadly python3 bytes don't support formatting via the new method, while python2 strings do. Luckily you can mostly use string formatting and then encode the result to bytes. That wasn't really an option for polysh though as we often format things we received from the remote side. To format it we would first have to decode it. Often the python3 unicode support is sold as if strings were now called bytes and unicode now called string. For the latter that's mostly true but the former got _much_ more strict. --- #### Bytes formatting - works with `b'' % ()` ```python3 In [9]: b'Bytes %(action)b works in python%(version)f' % {b'action': b'formatting', b'version': 3.5} Out[9]: b'Bytes formatting works in python3.500000' In [10]: b'This works in python2 %s' % (u'foo') ---------------------------------------------------------- TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str' ``` Note: The percent operator '%' can take a string or bytes on the left and a tuple or dict on the right with the values to template. It's pretty powerfull with python2 strings, python2 unicode and python3 strings, but very basic with python3 bytes. Further there are small differences between what is automatically casted to the correct datatype when formatted into a string. Note how I used percent b in the first string. Percent s still works in python3, but that is only to make these format strings compatible with both python2 and python3. --- #### Bytes formatting - works with `b'' % ()`.. mostly ```python3.4 In [1]: b'foo' + b'bar' Out[1]: b'foobar' In [2]: b'%b%b' % (b'foo', b'bar') ---------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-5-d3eace36b538> in <module>() ----> 1 b'%b%b' % (b'foo', b'bar') TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple' ``` Note: Pep461, which added bytes formatting, is actually rather new and the feature was only added in python3.5. We only have python3.4 on jessie though. If we want to support for python3.4 and before we need to concat bytes with plusses. I find it really unfortunate python took so long to support bytes formatting and then decided not to go with a format method on the bytes object. --- # Questions?