Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable escaping of href attribute #152

Closed
th0r opened this issue Sep 7, 2018 · 9 comments
Closed

Disable escaping of href attribute #152

th0r opened this issue Sep 7, 2018 · 9 comments

Comments

@th0r
Copy link

th0r commented Sep 7, 2018

Problem:

I use Rails and want to sanitize HTML that will be used as input for client-side template engine.
For example the following HTML:

<a href="{{ foo }}" other-valid-attr="{{ bar }}" other-invalid-attr="{{ baz }}">Bar</a>

should be sanitized to this:

<a href="{{ foo }}" other-valid-attr="{{ bar }}">Bar</a>

But the thing is the value for href attribute is being escaped for some reason and HTML becomes this:

<a href="%7B%7B%20foo%20%7D%7D" other-valid-attr="{{ bar }}">Bar</a>

Is there a way to disable encoding of href attribute?

Reproduction:

f = Loofah.fragment('<a href="{{ foo }}" other-attr="{{ bar }}"></a>')
f.scrub!(:nofollow).to_s
=> "<a href=\"%7B%7B%20foo%20%7D%7D\" other-attr=\"{{ bar }}\" rel=\"nofollow\"></a>"

Note that other-attr value is left untouched.

@flavorjones
Copy link
Owner

Hi @th0r, thanks for asking this question, and apologies for my slow reply.

What you're seeing is actually behavior that's built into libxml2. The function htmlAttrDumpOutput will actually ensure that href (and a few other attributes) are escaped when emitted as HTML. (Check out https://gitlab.gnome.org/GNOME/libxml2/blob/v2.9.2/HTMLtree.c#L688 for specifics if you're interested.)

I've given some thought about what it would take for Loofah to support templating, and am open to exploring how to have people opt-in to it. However, we'd have to be very cautious about how we do it to avoid allowing embedded quotes to be rendered unescaped.

I'm curious how other people are getting around templating given this behavior? I don't write Rails much anymore so I'm curious if anyone could enlighten me.

@flavorjones
Copy link
Owner

I'll note #160 asks a similar question for img src attributes, and my above explanation as to the underlying mechanism applies to this attribute as well.

@ndbroadbent
Copy link

ndbroadbent commented Dec 29, 2018

Hi @flavorjones, thanks for the reply!

I was trying to sanitize the user's HTML templates before I processed them with Liquid. But now I've realized that I can just change the order of operations. Instead of user template => Loofah => Liquid => HTML, I can just do user template => Liquid => Loofah => HTML. Not sure why I didn't think of that before! I guess it's good to have a break and come back to something later with a fresh perspective.

I would suggest doing the same thing for the client-side template engine. In that case, you can't run Loofah in the browser, but DOMPurify is very similar (I'm also using that in my frontend.) So instead of user template => Loofah => render sanitized template in the browser => HTML, you would do: user template => render template in the browser => sanitize output with DOMPurify => HTML.

I think this might be a better solution, and I wouldn't want to try to hack around libxml2 (especially because it's a native extension with C code.)

@flavorjones
Copy link
Owner

@ndbroadbent Thanks for closing the loop here. Your approach on templates makes sense.

I'm going to close this, since it doesn't seem like there's anything actionable for Loofah at this point. Happy to continue the conversation if anyone wants to, though.

@bbugh
Copy link

bbugh commented Apr 20, 2020

Hi! 👋 Has anything like this come up again? We're offering Liquid for some user's custom templating, and need to store and retrieve raw (but sanitized) Liquid in the database. Loofah is sanitizing everything perfectly, except that our a[href] and img[src] use cases are getting escaped!

One not-so-great solution I've come up with is to use Loofah as expected, and then scan the result for markup tags that have been encoded, then decode them. This seems error prone, though. @flavorjones did you have some better idea of how this could work for Loofah based on your previous thinking?

Or, does libxml2 only do this to a[href] and img[src] because they're URLs? If libxml2 is doing this in a limited number of cases, maybe our code can special case just those situations.

Thank you!

@flavorjones
Copy link
Owner

flavorjones commented Apr 21, 2020

Hi @bbugh,

I linked to the underlying libxml2 code in a comment above, here it is again: https://gitlab.gnome.org/GNOME/libxml2/blob/v2.9.2/HTMLtree.c#L688

The important bit is:

	    if ((cur->ns == NULL) && (cur->parent != NULL) &&
		(cur->parent->ns == NULL) &&
		((!xmlStrcasecmp(cur->name, BAD_CAST "href")) ||
	         (!xmlStrcasecmp(cur->name, BAD_CAST "action")) ||
		 (!xmlStrcasecmp(cur->name, BAD_CAST "src")) ||
		 ((!xmlStrcasecmp(cur->name, BAD_CAST "name")) &&
		  (!xmlStrcasecmp(cur->parent->name, BAD_CAST "a"))))) {

which you can interpret as detecting:

  • any href, action, or src attribute
  • name attributes, but only in an a tag

@bbugh
Copy link

bbugh commented Apr 21, 2020

Thank you @flavorjones for the interpretation! Sorry I didn't notice that link earlier. I will dig in and try to remember to post the solution here when I get one.

@timfjord
Copy link

timfjord commented Mar 27, 2023

I also faced this issue, and I came up with a workaround similar to this one #240 (comment)

value
  .gsub(/(<(?:a|img)[^>]+)(href|src)(=[^>]*>)/, '\1protected-attribute-\2\3')
  .then { |v| SANITISERS.inject(Loofah.fragment(v)) { |n, scrubber| n.scrub!(scrubber) } }
  .to_s
  .gsub(/protected-attribute-(href|src)/, '\1')

The regexp might be better because currently, it also handles <a src=""> and <img href="">.

@flavorjones
Copy link
Owner

It's probably worth folks following #239 which will (eventually) update Loofah to use Nokogiri's HTML5 parser by default, which does not have this escaping behavior:

puts Nokogiri::HTML4.parse('<a href="{{ foo }}" style="{{ foo }}">').to_html
# => <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#   <html><body><a href="%7B%7B%20foo%20%7D%7D" style="{{ foo }}"></a></body></html>
                           
puts Nokogiri::HTML5.parse('<a href="{{ foo }}" style="{{ foo }}">').to_html
# => <html><head></head><body><a href="{{ foo }}" style="{{ foo }}"></a></body></html>

I think that's the right fix for these situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants