Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] default to html5 parsing #239

Closed
wants to merge 4 commits into from

Conversation

flavorjones
Copy link
Owner

@flavorjones flavorjones commented May 30, 2022

The libgumbo parser used by Nokogiri::HTML5 is superior to the libxml2 parser used by Nokogiri::HTML4 (the default).

This is a draft pull request to see how hard it would be to default to use that parser for Loofah's sanitization, and evaluate what changes might be breaking to the many Rails apps that use it.

Note that Loofah can only support HTML5 in Nokogiri >= 1.14.0 because it requires the subclassing fix at sparklemotion/nokogiri@ebde7da

See a related but orthogonal issue to default Nokogiri to HTML5: sparklemotion/nokogiri#2331

  • run against some large Rails apps in CI
  • generate some performance benchmark results
  • write a nice changelog entry detailing the known behavior changes that might trip people up, and how to set the environment variable to escape back to the old behavior in a pinch
  • run a CI job against an older Nokogiri version to make sure we keep backwards compatibility
  • blocked on a Nokogiri v1.14.0 release

allow us to add sanitized values with arbitrary keys to the data file
this removes noisy bad attributes from output for unsafe nodes that
are being escaped
@flavorjones
Copy link
Owner Author

See downstream rails/rails-html-sanitizer#133 for an indication of impact of this change

@flavorjones
Copy link
Owner Author

flavorjones commented Jun 2, 2022

Running some large Rails apps' CI with the HTML5 parser, found another notable behavioral difference that didn't show up in this test suite: binary attributes which don't have values in the libxml2 HTML4 parser (<option disabled>) but do in the libgumbo HTML5 parser (<option disabled="">).

@flavorjones
Copy link
Owner Author

I'd love to get some benchmarks just to understand the impact of making this change. It's unlikely to change my mind but would be good information to have handy.

@flavorjones
Copy link
Owner Author

I've run three large rails apps at Shopify through CI with this branch of Loofah and (except for three tests that were relying on the above binary attribute behavior) everything was green. This is encouraging.

Note that there is an escape hatch which is to set the environment
variable LOOFAH_HTML4_MODE to return to the previous behavior.

CI tests both html4 and html5 modes
@DanielHeath
Copy link

Would it make sense to get this merged with the default set to the existing HTML4, so apps can opt in to the HTML5 behavior and start testing? That would presumably make this a far safer change to release.

@flavorjones
Copy link
Owner Author

@DanielHeath thanks for asking. There's a bit of a yak shave of dependencies here.

Loofah can only support HTML5 in Nokogiri >= 1.14.0 because it requires the subclassing fix at sparklemotion/nokogiri@ebde7da

Hoping to get that nokogiri release out in an RC this weekend if I can finish Ruby 3.2 support in rake-compiler-doc.

@flavorjones
Copy link
Owner Author

flavorjones commented Jan 3, 2023

I picked this back up again this week, now that Nokogiri 1.14.0.rc1 has been prereleased. This work is mostly done, though there are still details to work through here and in rails-html-sanitizer.

@flavorjones
Copy link
Owner Author

Closing this in favor of #261 which has a much better API for introducing HTML5 support.

@flavorjones flavorjones closed this Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants