UTF8 | 你 | 好 | ||||
bytes | 0xe4 | 0xbd | 0xa0 | 0xe5 | 0xa5 | 0xbd |
latin1 | ä | ½ | å | ¥ | ½ |
Note:
Bytes don't have a meaning by them selves, only together with an encoding or a well defined protocol they become meaningfull.
Note:
You can decode() strings to unicode strings.
Note:
The unicode type has not always been in python2 so you can call encode on python2 strs which already are encoded.
Nowadays this will result in an implicit decode to unicode (with you default) and then the encode you asked for. That implicit decode can raise a UnicodeDecodeError.
Note:
In python3 bytes are encoded and strings are decoded. Never the opposite.
from __future__ import antigravity
Note:
It changes the default string type from python2 style strings to python3 style unicode strings.
Note:
It does not change the default encoding and decoding encoding from pythons 'ascii' to python3s 'utf8'. So let's just forget about python2.
Note:
The python docs say: "The most important tip is: Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end." so that's what I did. It resulted in this. I would like to differ with the python docs and think we have to add: don't assume everybody will be nice and give you propper, valid utf8 encoded data and don't just decode untrusted data.
Note:
Simplifying the failing code from the last slide by a great deal it looks basically like this. As you can see we fail decoding 0xCA as that's not a valid character in utf8. Let's try fix it.
Note:
There are a couple of different methods to deal with untrusted encoded data in python3. Some of the most usefull methods are these to strip out, strip and replace with a marker or to backslash escape undecodable sequences in the data.
We don't do this in polysh to not mess with the users data. What comes out of the remote shell should come out of our shell. Polysh therefore now doesn't decode any content from the remote shell anymore. Which brings me to my next topic.
Note:
There are two ways in python to format a string, this is the new one I assume you have all seen and use. It uses the format function build into the python string class.
PEP3101 defined the new method and also implied that the old version would eventually be deprecated.
Note:
Sadly python3 bytes don't support formatting via the new method, while python2 strings do. Luckily you can mostly use string formatting and then encode the result to bytes.
That wasn't really an option for polysh though as we often format things we received from the remote side. To format it we would first have to decode it.
Often the python3 unicode support is sold as if strings were now called bytes and unicode now called string. For the latter that's mostly true but the former got much more strict.
b'' % ()
Note:
The percent operator '%' can take a string or bytes on the left and a tuple or dict on the right with the values to template. It's pretty powerfull with python2 strings, python2 unicode and python3 strings, but very basic with python3 bytes. Further there are small differences between what is automatically casted to the correct datatype when formatted into a string.
Note how I used percent b in the first string. Percent s still works in python3, but that is only to make these format strings compatible with both python2 and python3.
b'' % ()
.. mostlyNote:
Pep461, which added bytes formatting, is actually rather new and the feature was only added in python3.5. We only have python3.4 on jessie though. If we want to support for python3.4 and before we need to concat bytes with plusses.
I find it really unfortunate python took so long to support bytes formatting and then decided not to go with a format method on the bytes object.