In which I avoid the inverse unicode sandwich

Jun 22, 2012 | categories: python, toscawidgets, testing, turbogears View Comments

Problem #1 - I need to test tons of HTML output for correctness (because I maintain toscawidgets2). That output varies slightly because tw2 supports five different templating languages (mako, genshi, jinja2, kajiki, and chameleon). Using double-equals (==) just won't do it.

Solution #1 - We used strainer. It works!

Problem #2 - Imagine porting this to Python 3. Yes, that's right. The encoding is sniffed by hand and then used to encode regular expressions; these are in turn applied to parse XML. Think "inverse unicode sandwich with a side of Cthulhu."

Solution #2 - I wrote sieve: a baby module child of one corner of FormEncode and another corner of strainer. It works on pythons 2.6, 2.7, and 3.2. If you like, you may use it:

>>> from sieve.operators import eq_xml, in_xml
>>> a = "<foo><bar>Value</bar></foo>"
>>> b = """
... <foo>
...     <bar>
...         Value
...     </bar>
... </foo>
... """
>>> eq_xml(a, b)
>>> c = "<html><body><foo><bar>Value</bar></foo></body></html"
>>> in_xml(a, c)  # 'needle' in a 'haystack'

p.s. -- I looked into xmldiff. Awesome!

