Skip to content

How can Python not have HTML unescaping?

I’m in the process of trying to migrate a MovableType blog to WordPress. The process itself isn’t that bad, but I’ve found a whole slew of problems with preserving links. In the end it was decided to preserve internal link integrity but not worry about other people’s links. OK.

Due to weirdnesses in how both MT and WP handle post id’s I had to write a bunch of scripts that would extract post ID’s and titles from the old MT blog and a test installation of the WP blog with the imported entries, then compare the titles, and figure out the new URLs of every post that was internally linked to.

In the process of doing that I found out that Python doesn’t have sane handling of HTML unescaping!

Huh? What? You may ask.

Python has nice methods for escaping and unescaping URLs in urllib. But guess what? There’s no one function to do the same for HTML code. Lets see. Anything in htmllib? Nah, that’d be too easy. Anything in urllib, or urllib2, maybe? Just as a favour? Nah…

I’m still shocked that there’s no function for this. In the end I used xml.sax.saxutils.unescape() which does some of the unescaping but doesn’t handle all the HTML entities, who knows why, so I had to add some of the entities that I encountered in the titles manually.

Wow…

4 Comments

  1. pedro wrote:

    Yep.. quite incredible.
    So much talking about python, and i already feel i have wasted time trying to move from php.

    Thursday, January 4, 2007 at 4:19 pm | Permalink
  2. You can use urllib (in standard library) or mx.URL (faster).

    eg: urllib.unquote( url )

    Thursday, August 23, 2007 at 4:31 am | Permalink
  3. Andrew wrote:

    James: those functions unescape URLs, not HTML.

    Tuesday, January 22, 2008 at 5:39 pm | Permalink
  4. No Name wrote:

    try

    import htmllib

    def unescape(s):
    p = htmllib.HTMLParser(None)
    p.save_bgn()
    p.feed(s)
    return p.save_end()

    from http://wiki.python.org/moin/EscapingHtml

    It worked for me.

    Monday, February 25, 2008 at 2:58 am | Permalink

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*