Sunday, January 18, 2015

Python 2.x and Unicode

Preface

I was working [and still have some work to do] on a web-scraping project for last couple of months. I stared this project because I wanted to get more comfortable with Python and some nonstandard libraries like Requests and SQLAlchemy.
Python 2.7.6 was chosen to scrape websites which were using UTF-8 character set. That's how it all began.


The default

Python's 2.x default encoding is ASCII (since Python 2.4). Therefore you will see plenty of UnicodeDecodeError when trying to process non-ascii strings.
I recommend to go through these blog posts to learn more about encoding and related stuff:
So if you have read at least these two articles, you know that Unicode is a standard mapping characters to codepoints; and encoding, for example UTF-8, is an implementation of this standard.
Another great resource on this topic is the presentation Developing Unicode-aware Applications in Python from Marc-André Lemburg.

We want UTF-8

How to convince Python to work with UTF-8 without having to use those low level encode() and decode() methods? I was struggling to get this working for quite a while even though the recipe is, let's say, simple:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
from __future__ import unicode_literals
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Using this solution I can work with UTF-8 encoded strings and have strings containing non-ascii characters [without the 'u' prefix] in the code base without a problem. But be careful with unicode_literals, this import should be used with caution (for more details refer to Should I import unicode_literals?). When unicode_literals is not imported, unicode strings must be prefixed with 'u' or converted with the unicode() function.

Remove punctuation

Sometimes it's useful to represent an UTF-8 encoded string in ASCII. This can be handy when you want to compare user input and there is a chance that the user is not using punctuation correctly. This snippet helps me accomplish the task:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from unicodedata import normalize
 
 
def string_to_ascii(string):
    """Convert UTF-8 encoded string to ASCII

    :argument string: utf-8 string
    :type string: str or None

    :returns str or None

    """
    if string is None:
        return
 
    if isinstance(string, str):
        string = unicode(string)
 
    return normalize('NFKD', string).encode('ASCII', 'ignore')

And here is an example of how it works [sorry for the crazy sentence]:
str1 = 'Príliš žltučký kôň úpel diabolské ódy' 
print str1 
>>> Príliš žltučký kôň úpel diabolské ódy 
print repr(str1) 
>>> 'Pr\xc3\xadli\xc5\xa1 \xc5\xbeltu\xc4\x8dk\xc3\xbd k\xc3\xb4\xc5\x88 \xc3\xbapel diabolsk\xc3\xa9 \xc3\xb3dy' 

str2 = string_to_ascii(str1) 
print str2 
>>> Prilis zltucky kon upel diabolske ody 
print repr(str2) 
>>> 'Prilis zltucky kon upel diabolske ody'

Caution!

As I learned from Martijn Pieters at StackOverflow, using sys.setdefaultencoding() is a dirty, nasty hack and has always been discouraged (something that I was totally unaware of). Therefore, try to avoid it as much as possible, read more here: Why we need sys.setdefaultencoding(“utf-8”) in a py script?