I’m in the process of trying to migrate a MovableType blog to WordPress. The process itself isn’t that bad, but I’ve found a whole slew of problems with preserving links. In the end it was decided to preserve internal link integrity but not worry about other people’s links. OK.
Due to weirdnesses in how both MT and WP handle post id’s I had to write a bunch of scripts that would extract post ID’s and titles from the old MT blog and a test installation of the WP blog with the imported entries, then compare the titles, and figure out the new URLs of every post that was internally linked to.
In the process of doing that I found out that Python doesn’t have sane handling of HTML unescaping!
Huh? What? You may ask.
Python has nice methods for escaping and unescaping URLs in urllib. But guess what? There’s no one function to do the same for HTML code. Lets see. Anything in htmllib? Nah, that’d be too easy. Anything in urllib, or urllib2, maybe? Just as a favour? Nah…
I’m still shocked that there’s no function for this. In the end I used xml.sax.saxutils.unescape() which does some of the unescaping but doesn’t handle all the HTML entities, who knows why, so I had to add some of the entities that I encountered in the titles manually.
Wow…
4 Comments
Yep.. quite incredible.
So much talking about python, and i already feel i have wasted time trying to move from php.
You can use urllib (in standard library) or mx.URL (faster).
eg: urllib.unquote( url )
James: those functions unescape URLs, not HTML.
try
import htmllib
def unescape(s):
p = htmllib.HTMLParser(None)
p.save_bgn()
p.feed(s)
return p.save_end()
from http://wiki.python.org/moin/EscapingHtml
It worked for me.
Post a Comment