Disable escaping of `href` attribute #152

th0r · 2018-09-07T14:48:36Z

Problem:

I use Rails and want to sanitize HTML that will be used as input for client-side template engine.
For example the following HTML:

<a href="{{ foo }}" other-valid-attr="{{ bar }}" other-invalid-attr="{{ baz }}">Bar</a>

should be sanitized to this:

<a href="{{ foo }}" other-valid-attr="{{ bar }}">Bar</a>

But the thing is the value for href attribute is being escaped for some reason and HTML becomes this:

<a href="%7B%7B%20foo%20%7D%7D" other-valid-attr="{{ bar }}">Bar</a>

Is there a way to disable encoding of href attribute?

Reproduction:

f = Loofah.fragment('<a href="{{ foo }}" other-attr="{{ bar }}"></a>')
f.scrub!(:nofollow).to_s
=> "<a href=\"%7B%7B%20foo%20%7D%7D\" other-attr=\"{{ bar }}\" rel=\"nofollow\"></a>"

Note that other-attr value is left untouched.

The text was updated successfully, but these errors were encountered:

flavorjones · 2018-10-28T18:09:15Z

Hi @th0r, thanks for asking this question, and apologies for my slow reply.

What you're seeing is actually behavior that's built into libxml2. The function htmlAttrDumpOutput will actually ensure that href (and a few other attributes) are escaped when emitted as HTML. (Check out https://gitlab.gnome.org/GNOME/libxml2/blob/v2.9.2/HTMLtree.c#L688 for specifics if you're interested.)

I've given some thought about what it would take for Loofah to support templating, and am open to exploring how to have people opt-in to it. However, we'd have to be very cautious about how we do it to avoid allowing embedded quotes to be rendered unescaped.

I'm curious how other people are getting around templating given this behavior? I don't write Rails much anymore so I'm curious if anyone could enlighten me.

flavorjones · 2018-12-29T06:12:48Z

I'll note #160 asks a similar question for img src attributes, and my above explanation as to the underlying mechanism applies to this attribute as well.

ndbroadbent · 2018-12-29T06:51:13Z

Hi @flavorjones, thanks for the reply!

I was trying to sanitize the user's HTML templates before I processed them with Liquid. But now I've realized that I can just change the order of operations. Instead of user template => Loofah => Liquid => HTML, I can just do user template => Liquid => Loofah => HTML. Not sure why I didn't think of that before! I guess it's good to have a break and come back to something later with a fresh perspective.

I would suggest doing the same thing for the client-side template engine. In that case, you can't run Loofah in the browser, but DOMPurify is very similar (I'm also using that in my frontend.) So instead of user template => Loofah => render sanitized template in the browser => HTML, you would do: user template => render template in the browser => sanitize output with DOMPurify => HTML.

I think this might be a better solution, and I wouldn't want to try to hack around libxml2 (especially because it's a native extension with C code.)

flavorjones · 2019-09-28T17:23:11Z

@ndbroadbent Thanks for closing the loop here. Your approach on templates makes sense.

I'm going to close this, since it doesn't seem like there's anything actionable for Loofah at this point. Happy to continue the conversation if anyone wants to, though.

bbugh · 2020-04-20T22:59:35Z

Hi! 👋 Has anything like this come up again? We're offering Liquid for some user's custom templating, and need to store and retrieve raw (but sanitized) Liquid in the database. Loofah is sanitizing everything perfectly, except that our a[href] and img[src] use cases are getting escaped!

One not-so-great solution I've come up with is to use Loofah as expected, and then scan the result for markup tags that have been encoded, then decode them. This seems error prone, though. @flavorjones did you have some better idea of how this could work for Loofah based on your previous thinking?

Or, does libxml2 only do this to a[href] and img[src] because they're URLs? If libxml2 is doing this in a limited number of cases, maybe our code can special case just those situations.

Thank you!

flavorjones · 2020-04-21T13:52:10Z

Hi @bbugh,

I linked to the underlying libxml2 code in a comment above, here it is again: https://gitlab.gnome.org/GNOME/libxml2/blob/v2.9.2/HTMLtree.c#L688

The important bit is:

	    if ((cur->ns == NULL) && (cur->parent != NULL) &&
		(cur->parent->ns == NULL) &&
		((!xmlStrcasecmp(cur->name, BAD_CAST "href")) ||
	         (!xmlStrcasecmp(cur->name, BAD_CAST "action")) ||
		 (!xmlStrcasecmp(cur->name, BAD_CAST "src")) ||
		 ((!xmlStrcasecmp(cur->name, BAD_CAST "name")) &&
		  (!xmlStrcasecmp(cur->parent->name, BAD_CAST "a"))))) {

which you can interpret as detecting:

any href, action, or src attribute
name attributes, but only in an a tag

bbugh · 2020-04-21T15:18:47Z

Thank you @flavorjones for the interpretation! Sorry I didn't notice that link earlier. I will dig in and try to remember to post the solution here when I get one.

timfjord · 2023-03-27T14:30:02Z

I also faced this issue, and I came up with a workaround similar to this one #240 (comment)

value
  .gsub(/(<(?:a|img)[^>]+)(href|src)(=[^>]*>)/, '\1protected-attribute-\2\3')
  .then { |v| SANITISERS.inject(Loofah.fragment(v)) { |n, scrubber| n.scrub!(scrubber) } }
  .to_s
  .gsub(/protected-attribute-(href|src)/, '\1')

The regexp might be better because currently, it also handles <a src=""> and <img href="">.

flavorjones · 2023-03-30T06:04:10Z

It's probably worth folks following #239 which will (eventually) update Loofah to use Nokogiri's HTML5 parser by default, which does not have this escaping behavior:

puts Nokogiri::HTML4.parse('<a href="{{ foo }}" style="{{ foo }}">').to_html
# => <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#   <html><body><a href="%7B%7B%20foo%20%7D%7D" style="{{ foo }}"></a></body></html>
                           
puts Nokogiri::HTML5.parse('<a href="{{ foo }}" style="{{ foo }}">').to_html
# => <html><head></head><body><a href="{{ foo }}" style="{{ foo }}"></a></body></html>

I think that's the right fix for these situations.

flavorjones added the discussion label Oct 28, 2018

flavorjones mentioned this issue Dec 29, 2018

How can I allow liquid tags inside img src attribute without encoding them? #160

Closed

flavorjones closed this as completed Sep 28, 2019

rolfschmidt mentioned this issue Oct 1, 2021

Signatures sanitize variables in URLs zammad/zammad#3754

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable escaping of `href` attribute #152

Disable escaping of `href` attribute #152

th0r commented Sep 7, 2018 •

edited

Loading

flavorjones commented Oct 28, 2018

flavorjones commented Dec 29, 2018

ndbroadbent commented Dec 29, 2018 •

edited

Loading

flavorjones commented Sep 28, 2019

bbugh commented Apr 20, 2020

flavorjones commented Apr 21, 2020 •

edited

Loading

bbugh commented Apr 21, 2020

timfjord commented Mar 27, 2023 •

edited

Loading

flavorjones commented Mar 30, 2023

Disable escaping of href attribute #152

Disable escaping of href attribute #152

Comments

th0r commented Sep 7, 2018 • edited Loading

flavorjones commented Oct 28, 2018

flavorjones commented Dec 29, 2018

ndbroadbent commented Dec 29, 2018 • edited Loading

flavorjones commented Sep 28, 2019

bbugh commented Apr 20, 2020

flavorjones commented Apr 21, 2020 • edited Loading

bbugh commented Apr 21, 2020

timfjord commented Mar 27, 2023 • edited Loading

flavorjones commented Mar 30, 2023

Disable escaping of `href` attribute #152

Disable escaping of `href` attribute #152

th0r commented Sep 7, 2018 •

edited

Loading

ndbroadbent commented Dec 29, 2018 •

edited

Loading

flavorjones commented Apr 21, 2020 •

edited

Loading

timfjord commented Mar 27, 2023 •

edited

Loading