I thought about writing a really long post about handling Unicode in Python, but, honestly, you should go watch this video; that’s where most of my points would have come from, anyway. (It’s a great video! It’s funny and helpful and relevant, whether you use Python 2 or 3. I hope I get to go to PyCon and meet Ned in person and thank him for it!)
If you wonder how I ended up watching that video—along with several coworkers—we were doing a lot of metadata parsing, as part of our work on the SHARE project. We were building an alpha version of a notification service for research events (paper publications, dataset releases, etc.). As you’d imagine, not all of the names of the contributors and items are in ASCII (“ASCII” just means “A-Z, a-z, 0-9, and most punctuation”); we also get æ, ê, ī, ø, ü, and sometimes ÿ—so we needed to support Unicode. As an added complication (in my opinion), while we tried to be fairly compliant with both Python 2 and 3, we were running our code with 2, which assumes everything is ASCII by default.
A couple of us ran into the issues brought up in that video—cargo-culting a “u” in front of our strings and converting things to Unicode all willy-nilly. This helped us reach clarity.
I hope that it will help you reach clarity, too, because Unicode support is important; I’ll go so far as to say Unicode should be the default, never ASCII, because more people don’t use the ASCII character set than do.