Try   HackMD

Python encoding nightmares


Encodings derive different meaning from bytes:

In [1]: '你好'.encode('utf8')
Out[1]: b'\xe4\xbd\xa0\xe5\xa5\xbd'

In [2]: '你好'.encode('utf8').decode('latin1')
Out[2]: 'ä½\xa0好'
UTF8
bytes 0xe4 0xbd 0xa0 0xe5 0xa5 0xbd
latin1 ä ½ å ¥ ½

Note:
Bytes don't have a meaning by them selves, only together with an encoding or a well defined protocol they become meaningfull.


python2 str => array of bytes + helpers

In [1]: b'你好' == '你好'
Out[1]: True

In [2]: u'你好' == '你好'.decode('utf8')
Out[2]: True

In [3]: len('你好'), len(u'你好')
Out[3]: (6, 2)

In [4]: u'你好' == '你好'
/usr/local/bin/ipython2:1: UnicodeWarning: Unicode equal
comparison failed to convert both arguments to Unicode
- interpreting them as being unequal
Out[4]: False

Note:
You can decode() strings to unicode strings.


So you want to decode? :trollface:

In [9]: '€'.encode('utf8')
----------------------------------------------------------
UnicodeDecodeError   Traceback (most recent call last)
<ipython-input-9-f4201f3d0b2f> in <module>()
----> 1 '€'.encode('utf8')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2
in position 0: ordinal not in range(128)

Note:
The unicode type has not always been in python2 so you can call encode on python2 strs which already are encoded.
Nowadays this will result in an implicit decode to unicode (with you default) and then the encode you asked for. That implicit decode can raise a UnicodeDecodeError.


python3 str => array of unicode characters + helpers

In [1]: '你好'.encode() == '你好'
Out[1]: False

In [2]: b'\xe4\xbd\xa0\xe5\xa5\xbd' == '你好'.encode()
Out[2]: True

In [3]: len('你好')
Out[3]: 2

In [4]: u'你好' == '你好'
Out[4]: True

Note:
In python3 bytes are encoded and strings are decoded. Never the opposite.


from __future__ import antigravity

In [1]: from __future__ import unicode_literals

In [2]: '你好', len('你好')
Out[2]: (u'\u4f60\u597d', 2)

In [3]: b'\xe4\xbd\xa0\xe5\xa5\xbd' == '你好'.encode('utf8')
Out[3]: True

In [4]: u'你好' == '你好'
Out[4]: True

Note:
It changes the default string type from python2 style strings to python3 style unicode strings.


Fix strings but not default encodings

In [5]: '你好'.encode()
----------------------------------------------------------
UnicodeEncodeError      Traceback (most recent call last)
<ipython-input-11-cbd8a88207ff> in <module>()
----> 1 '你好'.encode()

UnicodeEncodeError: 'ascii' codec can't encode characters
in position 0-1: ordinal not in range(128)

Note:
It does not change the default encoding and decoding encoding from pythons 'ascii' to python3s 'utf8'. So let's just forget about python2.


remote shells -> polysh -> your terminal

(poly) vagrant@polysh:/vagrant$ polysh localhost
ready (1)> printf '%b' '\xac'
waiting (1/1)> error: uncaptured python exception, closing
channel <polysh.remote_dispatcher.remote_dispatcher
connected at 0x7fca92bcfe10> (<class 'UnicodeDecodeError'>:
'utf-8' codec can't decode byte 0xac in position 0: invalid
start byte [/usr/lib/python3.6/asyncore.py|readwrite|108]
[/usr/lib/python3.6/asyncore.py|handle_read_event|423]
[/poly/lib/python3.6/site-packages/polysh-0.5-py3.6.egg
/polysh/remote_dispatcher.py|handle_read|268])

Note:
The python docs say: "The most important tip is: Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end." so that's what I did. It resulted in this. I would like to differ with the python docs and think we have to add: don't assume everybody will be nice and give you propper, valid utf8 encoded data and don't just decode untrusted data.


remote shells -> polysh -> your terminal

In [12]: def read_from_awesome_source():
    ...:     for data in [b'fine', b'take this!\xca']:
    ...:         yield data
    ...:
    ...: for data in read_from_awesome_source():
    ...:     print(data.decode())
    ...:
fine
----------------------------------------------------------
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca
in position 10: unexpected end of data

Note:
Simplifying the failing code from the last slide by a great deal it looks basically like this. As you can see we fail decoding 0xCA as that's not a valid character in utf8. Let's try fix it.


How to handle untrusted data:

In [1]: print(b'This \xca works!'.decode('utf8', 'ignore'))
This  works!

In [2]: print(b'This \xca works!'.decode('utf8', 'replace'))
This � works!

In [3]: print(b'This \xca works!'.decode('utf8',
'backslashreplace'))
This \xca works!

In [4]: print(b'This \xca works!'.decode('utf8', 'strict'))
----------------------------------------------------------
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca
in position 10: invalid continuation byte

Note:
There are a couple of different methods to deal with untrusted encoded data in python3. Some of the most usefull methods are these to strip out, strip and replace with a marker or to backslash escape undecodable sequences in the data.
We don't do this in polysh to not mess with the users data. What comes out of the remote shell should come out of our shell. Polysh therefore now doesn't decode any content from the remote shell anymore. Which brings me to my next topic.


String Formatting

In [1]: 'Named "{named_str}", indexed "{1}", objects
__repr__ "{obj!r}", object attributes "{obj.__class__}",
even left-padded! "{unpadded_str:>20}"').format(
   ...:     'first',
   ...:     'second',
   ...:     named_str='named',
   ...:     obj=list(),
   ...:     unpadded_str='unpadded',
   ...: )
Out[1]: 'Named "named", indexed "second", objects __repr__
"[]", objectattributes "<class \'list\'>", even left-
padded! "            unpadded"'

Note:
There are two ways in python to format a string, this is the new one I assume you have all seen and use. It uses the format function build into the python string class.
PEP3101 defined the new method and also implied that the old version would eventually be deprecated.


Bytes formatting - doesn't work with .format()

In [1]: b'{}'.format(1)
----------------------------------------------------------
AttributeError           Traceback (most recent call last)
<ipython-input-2-3404800b9d90> in <module>()
----> 1 b'{}'.format(1)

AttributeError: 'bytes' object has no attribute 'format'

In [4]: '{}'.format(1).encode()
Out[4]: b'1'

Note:
Sadly python3 bytes don't support formatting via the new method, while python2 strings do. Luckily you can mostly use string formatting and then encode the result to bytes.
That wasn't really an option for polysh though as we often format things we received from the remote side. To format it we would first have to decode it.
Often the python3 unicode support is sold as if strings were now called bytes and unicode now called string. For the latter that's mostly true but the former got much more strict.


Bytes formatting - works with b'' % ()

In [9]: b'Bytes %(action)b works in python%(version)f' %
{b'action': b'formatting', b'version': 3.5}
Out[9]: b'Bytes formatting works in python3.500000'

In [10]: b'This works in python2 %s' % (u'foo')
----------------------------------------------------------
TypeError: %b requires a bytes-like object, or an object
that implements __bytes__, not 'str'

Note:
The percent operator '%' can take a string or bytes on the left and a tuple or dict on the right with the values to template. It's pretty powerfull with python2 strings, python2 unicode and python3 strings, but very basic with python3 bytes. Further there are small differences between what is automatically casted to the correct datatype when formatted into a string.
Note how I used percent b in the first string. Percent s still works in python3, but that is only to make these format strings compatible with both python2 and python3.


Bytes formatting - works with b'' % ().. mostly

In [1]: b'foo' + b'bar'
Out[1]: b'foobar'

In [2]: b'%b%b' % (b'foo', b'bar')
----------------------------------------------------------
TypeError                Traceback (most recent call last)
<ipython-input-5-d3eace36b538> in <module>()
----> 1 b'%b%b' % (b'foo', b'bar')

TypeError: unsupported operand type(s) for %: 'bytes' and
'tuple'

Note:
Pep461, which added bytes formatting, is actually rather new and the feature was only added in python3.5. We only have python3.4 on jessie though. If we want to support for python3.4 and before we need to concat bytes with plusses.
I find it really unfortunate python took so long to support bytes formatting and then decided not to go with a format method on the bytes object.


Questions?