How can Python not have HTML unescaping?

I’m in the process of trying to migrate a MovableType blog to WordPress. The process itself isn’t that bad, but I’ve found a whole slew of problems with preserving links. In the end it was decided to preserve internal link integrity but not worry about other people’s links. OK.

Due to weirdnesses in how both MT and WP handle post id’s I had to write a bunch of scripts that would extract post ID’s and titles from the old MT blog and a test installation of the WP blog with the imported entries, then compare the titles, and figure out the new URLs of every post that was internally linked to.

In the process of doing that I found out that Python doesn’t have sane handling of HTML unescaping!

Huh? What? You may ask.

Python has nice methods for escaping and unescaping URLs in urllib. But guess what? There’s no one function to do the same for HTML code. Lets see. Anything in htmllib? Nah, that’d be too easy. Anything in urllib, or urllib2, maybe? Just as a favour? Nah…

I’m still shocked that there’s no function for this. In the end I used xml.sax.saxutils.unescape() which does some of the unescaping but doesn’t handle all the HTML entities, who knows why, so I had to add some of the entities that I encountered in the titles manually.

Wow…

6 Responses to “How can Python not have HTML unescaping?”

  1. pedro says:

    Yep.. quite incredible.
    So much talking about python, and i already feel i have wasted time trying to move from php.

  2. James Casbon says:

    You can use urllib (in standard library) or mx.URL (faster).

    eg: urllib.unquote( url )

  3. Andrew says:

    James: those functions unescape URLs, not HTML.

  4. No Name says:

    try

    import htmllib

    def unescape(s):
    p = htmllib.HTMLParser(None)
    p.save_bgn()
    p.feed(s)
    return p.save_end()

    from http://wiki.python.org/moin/EscapingHtml

    It worked for me.

  5. This is great! Thanks for your post. I am new at python and this will help a lot.

  6. Kai says:

    myescapedstring.decode(“string-escape”)

Leave a Reply