Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

60%

+1 −0

Meta How should I organize material about text encoding in Python into questions?

Here is my current thinking on the matter. Regarding questions/facets that are "two sides of the same coin" - encoding vs decoding the data, reading vs. writing files - I think they should be ...

posted 2y ago by Karl Knechtel‭

Answer

#1: Initial revision by

Karl Knechtel‭ · 2023-08-30T16:49:39Z (almost 2 years ago)

Copy Link

Raw

Markdown

Here is my current thinking on the matter.

1. Regarding questions/facets that are "two sides of the same coin" - encoding vs decoding the data, reading vs. writing files - I think they should be addressed in one breath.

1. Regarding the Python documentation: I think it is better *cited*, on demand, in the places where it helps explain a concept, and that questions that simply exist to point at the relevant documentation are not very useful. Someone who has the initiative to figure out a problem from reading the documentation, will probably not find a significant barrier in searching for it (perhaps including `site:docs.python.org` in a search query).

1. I imagine the topics split roughly as follows:

> * What is *an encoding*?
> * What are *encoding* (the process) and *decoding*? How do I know which is which?
> * Why do I need a text encoding? *When* do I need one?

These are the same question, and Andreas has helped me realize that it's language-agnostic. There might be a need for a separate question specifically to define/explain the concept of *ASCII-transparent* encodings.

> * How can I know which text encoding to use?
> * How can I know if/how much freedom I have in choosing a text encoding?

There's one practical underlying question here, which is "how do I determine what encoding is appropriate to decode some existing data?". In less formal language, that is "how do I determine the encoding of text?", and that should be a separate question. The other ways to interpret these questions are trivial, and their answers are *implied by* a proper understanding of what text encodings actually are.

> * Are encodings used for other things? Why?
> * What is the `codecs` standard library module for, and how does it relate to text encoding?

These belong together. The concept of non-text encodings that I have in mind, is really a Python-specific idea that pertains specifically to the `codecs` module, and it's somewhat esoteric - just like the need to use `codecs` generally.

> How do I specify an encoding for converting bytes to a string or vice-versa?

Originally I thought this belonged as part of the first question, defining encodings generally. However, Andreas' arguments have convinced me to separate it. Rather than "specify an encoding", it should really just be phrased in terms of doing the conversion; the *fact that this involves an encoding* should only surface in the *answer*, which will cite the first question for background.

> ... for reading and writing files?

Similarly. Also a separate question.

> [... for other purposes?]

These should also be separate, but I also want to hold off on them entirely until a need is demonstrated. While "reading and writing files" is obvious, I can't foresee all the possible libraries people could ask about, or predict which are most important / relevant to others on average. Once there are multiple "motivated" questions, if none is at canonical quality, I can make a canonical.

> * Historical: in Python 2.x, why can attempts to decode cause UnicodeEncodeError, and vice-versa?
> * Historical / migration: how should I understand the type names bytes, str and unicode in 2.x vs 3.x?
> * Historical: What was basestring in 2.x and why was it needed?
> * Historical / migration: why did 2.x treat those types the way it did, and why does 3.x treat them differently? Why shouldn't I try to emulate the old approaches in new code?

I want to combine these general ideas and then most likely split them back into two questions: 

* one about the differences between 2.x and 3.x handling of these types;

* and one about how to understand what legacy 2.x code is doing and migrate it to 3.x.

The "Why shouldn't I try to emulate the old approaches in new code?" part doesn't actually need to be called out specially; to the extent that it isn't obvious from the rest of the explanation, it's subjective.

> What are `UnicodeEncodeError` and `UnicodeDecodeError`? What do they mean; what causes them; and how do I resolve them?

On reflection, I don't think this is actually a separate topic worth addressing. The meanings of the names are straightforward; the only confusing part is when 2.x gives `UnicodeEncodeError` from a decoding attempt or vice-versa, which is explained in stride when considering the legacy text-handling behaviour. Anything else that needs to be said about these exceptions can similarly be explained in stride in other questions.

By my count, this makes 6 to 8 (probably 8) Q&A pairs, from 16 originally proposed facets (that could be broken down further if one tried). This seems like a satisfactory result.

Communities

Post History