Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML4::DocumentFragment4 truncates text in a <div> tag at about 10mb #2941

Closed
bensherman opened this issue Jul 31, 2023 · 6 comments
Closed

Comments

@bensherman
Copy link

bensherman commented Jul 31, 2023

It looks like there is different behavior for Nokogiri::HTML4::DocumentFragment vs HTML5 - If you have more than 10mb of text inside a <div> tag in the HTML4 parser, some of it goes missing.

It looks like there is a buffer in the parser that has about a 10mb limit and the data after that point get dropped - the closing </div> tag remains on the file, so it's not truncating the final string.

Quickly written code that uses text converted to escaped HTML that shows where it breaks:

require "nokogiri"
require "cgi"

DocumentFragment4 = Nokogiri::HTML4::DocumentFragment
DocumentFragment5 = Nokogiri::HTML5::DocumentFragment

times = 9_999_980 / 2 

loop do
  text = "a\n" * times
  html = "<div>#{CGI.escape_html(text)}</div>"
  parsed4 = DocumentFragment4.parse(html)
  parsed5 = DocumentFragment5.parse(html)

  if parsed4.to_html != parsed5.to_html
    puts "HTML4 and HTML5 parsers differ"
    puts "length of text: #{text.length}"
    puts "length of parsed4: #{parsed4.to_s.length}"
    puts "length of parsed5: #{parsed5.to_s.length}"
    exit
  end

  puts "#{times} is ok"
  times += 1
end

Environment

A quick env to reproduce. It happens on older versions too. This is from the docker image ruby:3.2.2 running only gem install nokigiri then the above script.

# Nokogiri (1.15.3)
    ---
    warnings: []
    nokogiri:
      version: 1.15.3
      cppflags:
      - "-I/usr/local/bundle/gems/nokogiri-1.15.3-aarch64-linux/ext/nokogiri"
      - "-I/usr/local/bundle/gems/nokogiri-1.15.3-aarch64-linux/ext/nokogiri/include"
      - "-I/usr/local/bundle/gems/nokogiri-1.15.3-aarch64-linux/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.2.2
      platform: aarch64-linux
      gem_platform: aarch64-linux
      description: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [aarch64-linux]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      - 0010-update-config.guess-and-config.sub-for-libxml2.patch
      - 0011-rip-out-libxml2-s-libc_single_threaded-support.patch
      libxml2_path: "/usr/local/bundle/gems/nokogiri-1.15.3-aarch64-linux/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.11.4
      loaded: 2.11.4
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-config.guess-and-config.sub-for-libxslt.patch
      datetime_enabled: true
      compiled: 1.1.38
      loaded: 1.1.38
    other_libraries:
      zlib: 1.2.13
      libgumbo: 1.0.0-nokogiri

@bensherman bensherman added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jul 31, 2023
@flavorjones
Copy link
Member

@bensherman Thanks for asking this question.

Diagnosis

libxml2 has some soft limits that you're running into here. The bad news is that it's not obvious what's happening. The good news, though, is that if you know what you're looking for, it's easy to detect when this is happening and it's easy to work around it.

Explanation

Let's start with a slightly simpler reproduction:

require "nokogiri"

TIMES = 9999983

(0..1).each do |j|
  text = "a" * (TIMES + j)
  html = "<div>" + text + "</div>"
  parsed4 = Nokogiri::HTML4::DocumentFragment.parse(html)

  puts "length of source html: #{html.length}"
  puts "length of parsed html: #{parsed4.to_html.length}"
  puts
end

which outputs:

length of source html: 9999994
length of parsed html: 9999994

length of source html: 9999995
length of parsed html: 9999011

Detection

Let's start by checking if the parser has reported any errors:

require "nokogiri"

TIMES = 9999983

(0..1).each do |j|
  text = "a" * (TIMES + j)
  html = "<div>" + text + "</div>"
  parsed4 = Nokogiri::HTML4::DocumentFragment.parse(html)

  puts "length of source html: #{html.length}"
  puts "length of parsed html: #{parsed4.to_html.length}"
  pp parsed4.errors
  puts
end

outputs:

length of source html: 9999994
length of parsed html: 9999994
[]

length of source html: 9999995
length of parsed html: 9999011
[#<Nokogiri::XML::SyntaxError: 1:10000002: FATAL: Huge input lookup>]

Aha! So we've hit the soft limits.

Configure the parser

Let's tell libxml2 to turn off the safety limits and handle huge inputs:

require "nokogiri"

TIMES = 9999983

(0..1).each do |j|
  text = "a" * (TIMES + j)
  html = "<div>" + text + "</div>"
  parsed4 = Nokogiri::HTML4::DocumentFragment.parse(html) { |config| config.huge }

  puts "length of source html: #{html.length}"
  puts "length of parsed html: #{parsed4.to_html.length}"
  pp parsed4.errors
  puts
end

outputs

length of source html: 9999994
length of parsed html: 9999994
[]

length of source html: 9999995
length of parsed html: 9999995
[]

So now the full text is parsed correctly. Hopefully this helps you!

Epilogue

This is weird to have to work around, and I'm open to suggestions on how to balance these principles:

  • handle a broad set of untrusted inputs
  • limit the blast radius from parsing a malicious document

libxml2's strategy over the years has been to adopt a set of reasonably large size limits that protect against unbounded memory allocation while also enabling most inputs to be handled correctly. This has been "good enough" for Nokogiri for a pretty long time.

It might also be interesting to think about whether libgumbo should have similar soft limits. @stevecheckoway is this something we can chat about, maybe in a new issue?

@flavorjones flavorjones added meta/user-help and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Aug 4, 2023
@bensherman
Copy link
Author

bensherman commented Aug 7, 2023

Thanks for the quick response, we can work this into our code.

As this isn't an issue in HTML5, how is it handling the risks you outlined in the epilogue?

edit: I will follow along to see how the libgumbo work goes - thanks again.

@stevecheckoway
Copy link
Contributor

I'm not entirely sure what the limit is designed to do.

The way Gumbo tries to handle parsing a malicious document is to adhere to the standard as closely as we can. Barring standards bugs (which do exist and get fixed with some regularity), the HTML standard specifies a way to parse any sequence of bytes. A malicious document could probably cause Gumbo to allocate a lot of memory, particularly with errors.

We have soft limits on number of errors and tree depth because they were requested by a downstream project.

Currently, if gumbo fails to allocate memory, the whole process is killed. It may be worth thinking about how to make that happen. Since gumbo doesn't do anything like expand XML external entities (e.g., the billion laughs attack), it may require a very large document.

@flavorjones
Copy link
Member

@stevecheckoway Thanks for the thoughtful reply.

The libxml2 limits are in place to cap memory allocations done for an untrusted document. They halt the parser based on a few different limits, either length or depth or even individual text node size. It's enough to prevent an OOM condition.

I think I'm primarily concerned about unbounded memory allocation if a large untrusted document is parsed. It seems like it could be trivial to craft a long-enough document to trigger an OOM condition, which could be used for a denial-of-service attack.

@stevecheckoway
Copy link
Contributor

@flavorjones Gotcha! That does make sense.

We'll have to think about threading allocation failure all the way back to the main gumbo parse functions. Makes me pine for Rust's Result<T, E> and the ? operator.

@flavorjones
Copy link
Member

I don't think it's particularly urgent that we address this, but I have created a new issue #2949 so we don't forget and can continue the conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants