Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Meta

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

How should I organize material about text encoding in Python into questions?

+6
−0

I want to write one or more self-answered Q&As on the topic of text encoding in Python, to serve as canonicals and preempt future lower-quality questions. I can think of the following things that need to be addressed:

  • What is an encoding?
  • What are encoding (the process) and decoding? How do I know which is which?
  • Why do I need a text encoding? When do I need one?
  • How can I know which text encoding to use?
  • How can I know if/how much freedom I have in choosing a text encoding?
  • Are encodings used for other things? Why?
  • How do I specify an encoding...
    • for converting bytes to a string or vice-versa?
    • for reading and writing files?
    • when working with web libraries such as Requests, BeautifulSoup etc.?
    • when using a library to parse formats like CSV, JSON etc.?
  • What is the codecs standard library module for, and how does it relate to text encoding?
  • What are UnicodeEncodeError and UnicodeDecodeError? What do they mean; what causes them; and how do I resolve them?
  • Historical: in Python 2.x, why can attempts to decode cause UnicodeEncodeError, and vice-versa?
  • Historical / migration: how should I understand the type names bytes, str and unicode in 2.x vs 3.x?
  • Historical: What was basestring in 2.x and why was it needed?
  • Historical / migration: why did 2.x treat those types the way it did, and why does 3.x treat them differently? Why shouldn't I try to emulate the old approaches in new code?

There might be more that I'm forgetting.

My question here is, how should I organize these facets of the topic into questions? I don't think all of this material can be covered in a single post, but making things too fine-grained makes things awkward in the future - it becomes too hard to search for the right question because you find the other ones instead, and the material becomes redundant between questions.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

3 answers

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

+0
−1

Much of this is already covered in various sources like https://docs.python.org/3/howto/unicode.html. Although there are issues with relying on links, I figure official documentation is probably fair game.

I would start with two types of question:

  • "Where can I find detailed information about text encoding in Python?" and in the answer link to authoritative sources like the official docs.
  • Questions covering individual points that are not clear from the the previous ones.

Also, this is a great argument for enabling articles in Software Development.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

+4
−0

The list you provided seems huge for a single Q/A. I think you're better off breaking them up, and linking to each other in the answers. The theory behind text encoding can be decoupled from Python, as it's relevant to the whole world of programming. If you already know about text encoding, and how it works, but you're just looking for how to use it, perhaps because you forgot, or because you're coming from a different language, it's unnecessary to scroll through a long description of what it is, first.

What are UnicodeEncodeError and UnicodeDecodeError? What do they mean; what causes them; and how do I resolve them?

This could be a debugging Q/A for text encoding in Python.

How do I specify an encoding...

  • for converting bytes to a string or vice-versa?
  • for reading and writing files?
  • when working with web libraries such as Requests, BeautifulSoup etc.?
  • when using a library to parse formats like CSV, JSON etc.?

These don't belong in one and the same Q/A. Have a separate Q/A for each one of them. You can give a broad overview of strings vs blobs, though, in an explanation specific to Python.

What is the codecs standard library module for, and how does it relate to text encoding?

Probably works best as a separate Q/A.

  • Historical: in Python 2.x, why can attempts to decode cause UnicodeEncodeError, and vice-versa?
  • Historical / migration: how should I understand the type names bytes, str and unicode in 2.x vs 3.x?
  • Historical: What was basestring in 2.x and why was it needed?
  • Historical / migration: why did 2.x treat those types the way it did, and why does 3.x treat them differently? Why shouldn't I try to emulate the old approaches in new code?

These can either be separate Q/As, tagged accordingly, or coupled together in a "text encoding in Python 2 vs Python 3" Q/A.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

+1
−0

Here is my current thinking on the matter.

  1. Regarding questions/facets that are "two sides of the same coin" - encoding vs decoding the data, reading vs. writing files - I think they should be addressed in one breath.

  2. Regarding the Python documentation: I think it is better cited, on demand, in the places where it helps explain a concept, and that questions that simply exist to point at the relevant documentation are not very useful. Someone who has the initiative to figure out a problem from reading the documentation, will probably not find a significant barrier in searching for it (perhaps including site:docs.python.org in a search query).

  3. I imagine the topics split roughly as follows:

  • What is an encoding?
  • What are encoding (the process) and decoding? How do I know which is which?
  • Why do I need a text encoding? When do I need one?

These are the same question, and Andreas has helped me realize that it's language-agnostic. There might be a need for a separate question specifically to define/explain the concept of ASCII-transparent encodings.

  • How can I know which text encoding to use?
  • How can I know if/how much freedom I have in choosing a text encoding?

There's one practical underlying question here, which is "how do I determine what encoding is appropriate to decode some existing data?". In less formal language, that is "how do I determine the encoding of text?", and that should be a separate question. The other ways to interpret these questions are trivial, and their answers are implied by a proper understanding of what text encodings actually are.

  • Are encodings used for other things? Why?
  • What is the codecs standard library module for, and how does it relate to text encoding?

These belong together. The concept of non-text encodings that I have in mind, is really a Python-specific idea that pertains specifically to the codecs module, and it's somewhat esoteric - just like the need to use codecs generally.

How do I specify an encoding for converting bytes to a string or vice-versa?

Originally I thought this belonged as part of the first question, defining encodings generally. However, Andreas' arguments have convinced me to separate it. Rather than "specify an encoding", it should really just be phrased in terms of doing the conversion; the fact that this involves an encoding should only surface in the answer, which will cite the first question for background.

... for reading and writing files?

Similarly. Also a separate question.

[... for other purposes?]

These should also be separate, but I also want to hold off on them entirely until a need is demonstrated. While "reading and writing files" is obvious, I can't foresee all the possible libraries people could ask about, or predict which are most important / relevant to others on average. Once there are multiple "motivated" questions, if none is at canonical quality, I can make a canonical.

  • Historical: in Python 2.x, why can attempts to decode cause UnicodeEncodeError, and vice-versa?
  • Historical / migration: how should I understand the type names bytes, str and unicode in 2.x vs 3.x?
  • Historical: What was basestring in 2.x and why was it needed?
  • Historical / migration: why did 2.x treat those types the way it did, and why does 3.x treat them differently? Why shouldn't I try to emulate the old approaches in new code?

I want to combine these general ideas and then most likely split them back into two questions:

  • one about the differences between 2.x and 3.x handling of these types;

  • and one about how to understand what legacy 2.x code is doing and migrate it to 3.x.

The "Why shouldn't I try to emulate the old approaches in new code?" part doesn't actually need to be called out specially; to the extent that it isn't obvious from the rest of the explanation, it's subjective.

What are UnicodeEncodeError and UnicodeDecodeError? What do they mean; what causes them; and how do I resolve them?

On reflection, I don't think this is actually a separate topic worth addressing. The meanings of the names are straightforward; the only confusing part is when 2.x gives UnicodeEncodeError from a decoding attempt or vice-versa, which is explained in stride when considering the legacy text-handling behaviour. Anything else that needs to be said about these exceptions can similarly be explained in stride in other questions.

By my count, this makes 6 to 8 (probably 8) Q&A pairs, from 16 originally proposed facets (that could be broken down further if one tried). This seems like a satisfactory result.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »