Preface
I was working [and still have some work to do] on a web-scraping project for last couple of months. I stared this project because I wanted to get more comfortable with Python and some nonstandard libraries like Requests and SQLAlchemy.
Python 2.7.6 was chosen to scrape websites which were using UTF-8 character set. That's how it all began.
The default
Python's 2.x default encoding is ASCII (since Python 2.4). Therefore you will see plenty of UnicodeDecodeError when trying to process non-ascii strings.I recommend to go through these blog posts to learn more about encoding and related stuff:
- Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Akshar Raaj: Understanding Python unicode, str, UnicodeEncodeError and UnicodeDecodeError
Another great resource on this topic is the presentation Developing Unicode-aware Applications in Python from Marc-André Lemburg.
We want UTF-8
How to convince Python to work with UTF-8 without having to use those low level encode() and decode() methods? I was struggling to get this working for quite a while even though the recipe is, let's say, simple:#!/usr/bin/env python # -*- coding: utf-8 -*- from __future__ import unicode_literals import sys reload(sys) sys.setdefaultencoding('utf-8')
Using this solution I can work with UTF-8 encoded strings and have strings containing non-ascii characters [without the 'u' prefix] in the code base without a problem. But be careful with unicode_literals, this import should be used with caution (for more details refer to Should I import unicode_literals?). When unicode_literals is not imported, unicode strings must be prefixed with 'u' or converted with the unicode() function.
Remove punctuation
Sometimes it's useful to represent an UTF-8 encoded string in ASCII. This can be handy when you want to compare user input and there is a chance that the user is not using punctuation correctly. This snippet helps me accomplish the task:#!/usr/bin/env python # -*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf-8') from unicodedata import normalize def string_to_ascii(string): """Convert UTF-8 encoded string to ASCII :argument string: utf-8 string :type string: str or None :returns str or None """ if string is None: return if isinstance(string, str): string = unicode(string) return normalize('NFKD', string).encode('ASCII', 'ignore')
And here is an example of how it works [sorry for the crazy sentence]:
str1 = 'Príliš žltučký kôň úpel diabolské ódy' print str1 >>> Príliš žltučký kôň úpel diabolské ódy print repr(str1) >>> 'Pr\xc3\xadli\xc5\xa1 \xc5\xbeltu\xc4\x8dk\xc3\xbd k\xc3\xb4\xc5\x88 \xc3\xbapel diabolsk\xc3\xa9 \xc3\xb3dy' str2 = string_to_ascii(str1) print str2 >>> Prilis zltucky kon upel diabolske ody print repr(str2) >>> 'Prilis zltucky kon upel diabolske ody'