HTML4::DocumentFragment4 truncates text in a <div> tag at about 10mb #2941

bensherman · 2023-07-31T20:47:06Z

It looks like there is different behavior for Nokogiri::HTML4::DocumentFragment vs HTML5 - If you have more than 10mb of text inside a <div> tag in the HTML4 parser, some of it goes missing.

It looks like there is a buffer in the parser that has about a 10mb limit and the data after that point get dropped - the closing </div> tag remains on the file, so it's not truncating the final string.

Quickly written code that uses text converted to escaped HTML that shows where it breaks:

require "nokogiri"
require "cgi"

DocumentFragment4 = Nokogiri::HTML4::DocumentFragment
DocumentFragment5 = Nokogiri::HTML5::DocumentFragment

times = 9_999_980 / 2 

loop do
  text = "a\n" * times
  html = "<div>#{CGI.escape_html(text)}</div>"
  parsed4 = DocumentFragment4.parse(html)
  parsed5 = DocumentFragment5.parse(html)

  if parsed4.to_html != parsed5.to_html
    puts "HTML4 and HTML5 parsers differ"
    puts "length of text: #{text.length}"
    puts "length of parsed4: #{parsed4.to_s.length}"
    puts "length of parsed5: #{parsed5.to_s.length}"
    exit
  end

  puts "#{times} is ok"
  times += 1
end

Environment

A quick env to reproduce. It happens on older versions too. This is from the docker image ruby:3.2.2 running only gem install nokigiri then the above script.

# Nokogiri (1.15.3)
    ---
    warnings: []
    nokogiri:
      version: 1.15.3
      cppflags:
      - "-I/usr/local/bundle/gems/nokogiri-1.15.3-aarch64-linux/ext/nokogiri"
      - "-I/usr/local/bundle/gems/nokogiri-1.15.3-aarch64-linux/ext/nokogiri/include"
      - "-I/usr/local/bundle/gems/nokogiri-1.15.3-aarch64-linux/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.2.2
      platform: aarch64-linux
      gem_platform: aarch64-linux
      description: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [aarch64-linux]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      - 0010-update-config.guess-and-config.sub-for-libxml2.patch
      - 0011-rip-out-libxml2-s-libc_single_threaded-support.patch
      libxml2_path: "/usr/local/bundle/gems/nokogiri-1.15.3-aarch64-linux/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.11.4
      loaded: 2.11.4
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-config.guess-and-config.sub-for-libxslt.patch
      datetime_enabled: true
      compiled: 1.1.38
      loaded: 1.1.38
    other_libraries:
      zlib: 1.2.13
      libgumbo: 1.0.0-nokogiri

The text was updated successfully, but these errors were encountered:

flavorjones · 2023-08-04T13:46:51Z

@bensherman Thanks for asking this question.

Diagnosis

libxml2 has some soft limits that you're running into here. The bad news is that it's not obvious what's happening. The good news, though, is that if you know what you're looking for, it's easy to detect when this is happening and it's easy to work around it.

Explanation

Let's start with a slightly simpler reproduction:

require "nokogiri"

TIMES = 9999983

(0..1).each do |j|
  text = "a" * (TIMES + j)
  html = "<div>" + text + "</div>"
  parsed4 = Nokogiri::HTML4::DocumentFragment.parse(html)

  puts "length of source html: #{html.length}"
  puts "length of parsed html: #{parsed4.to_html.length}"
  puts
end

which outputs:

length of source html: 9999994
length of parsed html: 9999994

length of source html: 9999995
length of parsed html: 9999011

Detection

Let's start by checking if the parser has reported any errors:

require "nokogiri"

TIMES = 9999983

(0..1).each do |j|
  text = "a" * (TIMES + j)
  html = "<div>" + text + "</div>"
  parsed4 = Nokogiri::HTML4::DocumentFragment.parse(html)

  puts "length of source html: #{html.length}"
  puts "length of parsed html: #{parsed4.to_html.length}"
  pp parsed4.errors
  puts
end

outputs:

length of source html: 9999994
length of parsed html: 9999994
[]

length of source html: 9999995
length of parsed html: 9999011
[#<Nokogiri::XML::SyntaxError: 1:10000002: FATAL: Huge input lookup>]

Aha! So we've hit the soft limits.

Configure the parser

Let's tell libxml2 to turn off the safety limits and handle huge inputs:

require "nokogiri"

TIMES = 9999983

(0..1).each do |j|
  text = "a" * (TIMES + j)
  html = "<div>" + text + "</div>"
  parsed4 = Nokogiri::HTML4::DocumentFragment.parse(html) { |config| config.huge }

  puts "length of source html: #{html.length}"
  puts "length of parsed html: #{parsed4.to_html.length}"
  pp parsed4.errors
  puts
end

outputs

length of source html: 9999994
length of parsed html: 9999994
[]

length of source html: 9999995
length of parsed html: 9999995
[]

So now the full text is parsed correctly. Hopefully this helps you!

Epilogue

This is weird to have to work around, and I'm open to suggestions on how to balance these principles:

handle a broad set of untrusted inputs
limit the blast radius from parsing a malicious document

libxml2's strategy over the years has been to adopt a set of reasonably large size limits that protect against unbounded memory allocation while also enabling most inputs to be handled correctly. This has been "good enough" for Nokogiri for a pretty long time.

It might also be interesting to think about whether libgumbo should have similar soft limits. @stevecheckoway is this something we can chat about, maybe in a new issue?

bensherman · 2023-08-07T17:02:45Z

Thanks for the quick response, we can work this into our code.

~~As this isn't an issue in HTML5, how is it handling the risks you outlined in the epilogue?~~

edit: I will follow along to see how the libgumbo work goes - thanks again.

stevecheckoway · 2023-08-07T17:45:06Z

I'm not entirely sure what the limit is designed to do.

The way Gumbo tries to handle parsing a malicious document is to adhere to the standard as closely as we can. Barring standards bugs (which do exist and get fixed with some regularity), the HTML standard specifies a way to parse any sequence of bytes. A malicious document could probably cause Gumbo to allocate a lot of memory, particularly with errors.

We have soft limits on number of errors and tree depth because they were requested by a downstream project.

Currently, if gumbo fails to allocate memory, the whole process is killed. It may be worth thinking about how to make that happen. Since gumbo doesn't do anything like expand XML external entities (e.g., the billion laughs attack), it may require a very large document.

flavorjones · 2023-08-08T15:53:27Z

@stevecheckoway Thanks for the thoughtful reply.

The libxml2 limits are in place to cap memory allocations done for an untrusted document. They halt the parser based on a few different limits, either length or depth or even individual text node size. It's enough to prevent an OOM condition.

I think I'm primarily concerned about unbounded memory allocation if a large untrusted document is parsed. It seems like it could be trivial to craft a long-enough document to trigger an OOM condition, which could be used for a denial-of-service attack.

stevecheckoway · 2023-08-08T18:03:27Z

@flavorjones Gotcha! That does make sense.

We'll have to think about threading allocation failure all the way back to the main gumbo parse functions. Makes me pine for Rust's Result<T, E> and the ? operator.

flavorjones · 2023-08-09T14:42:56Z

I don't think it's particularly urgent that we address this, but I have created a new issue #2949 so we don't forget and can continue the conversation.

bensherman added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jul 31, 2023

flavorjones closed this as completed Aug 4, 2023

flavorjones added meta/user-help and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Aug 4, 2023

flavorjones mentioned this issue Aug 9, 2023

RFC: limit libgumbo memory allocations for untrusted HTML5 content #2949

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML4::DocumentFragment4 truncates text in a <div> tag at about 10mb #2941

HTML4::DocumentFragment4 truncates text in a <div> tag at about 10mb #2941

bensherman commented Jul 31, 2023 •

edited

Loading

flavorjones commented Aug 4, 2023

bensherman commented Aug 7, 2023 •

edited

Loading

stevecheckoway commented Aug 7, 2023

flavorjones commented Aug 8, 2023

stevecheckoway commented Aug 8, 2023

flavorjones commented Aug 9, 2023

HTML4::DocumentFragment4 truncates text in a <div> tag at about 10mb #2941

HTML4::DocumentFragment4 truncates text in a <div> tag at about 10mb #2941

Comments

bensherman commented Jul 31, 2023 • edited Loading

flavorjones commented Aug 4, 2023

Diagnosis

Explanation

Detection

Configure the parser

Epilogue

bensherman commented Aug 7, 2023 • edited Loading

stevecheckoway commented Aug 7, 2023

flavorjones commented Aug 8, 2023

stevecheckoway commented Aug 8, 2023

flavorjones commented Aug 9, 2023

bensherman commented Jul 31, 2023 •

edited

Loading

bensherman commented Aug 7, 2023 •

edited

Loading