A little bit about Unicode

I thought about writing a really long post about handling Unicode in Python, but, honestly, you should go watch this video; that’s where most of my points would have come from, anyway. (It’s a great video! It’s funny and helpful and relevant, whether you use Python 2 or 3. I hope I get to go to PyCon and meet Ned in person and thank him for it!)

If you wonder how I ended up watching that video—along with several coworkers—we were doing a lot of metadata parsing, as part of our work on the SHARE project. We were building an alpha version of a notification service for research events (paper publications, dataset releases, etc.). As you’d imagine, not all of the names of the contributors and items are in ASCII (“ASCII” just means “A-Z, a-z, 0-9, and most punctuation”); we also get Ã¦, Ãª, Ä«, Ã¸, Ã¼, and sometimes Ã¿—so we needed to support Unicode. As an added complication (in my opinion), while we tried to be fairly compliant with both Python 2 and 3, we were running our code with 2, which assumes everything is ASCII by default.

A couple of us ran into the issues brought up in that video—cargo-culting a “u” in front of our strings and converting things to Unicode all willy-nilly. This helped us reach clarity.

I hope that it will help you reach clarity, too, because Unicode support is important; I’ll go so far as to say Unicode should be the default, never ASCII, because more people don’t use the ASCII character set than do.

Be First to Comment

Leave a Reply Cancel reply