-
-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
explore further optimizing the HTML5 parser and serializer #2722
Comments
Let's capture a benchmark at the start of this, on my development machine. #! /usr/bin/env ruby
# coding: utf-8
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri", path: "."
gem "benchmark-ips"
end
require "nokogiri"
require "benchmark/ips"
filenames = [
"test/files/GH_1042.html", # 650b
"test/files/tlm.html", # 70kb
"big_shopping.html", # 1.9mb
]
inputs = filenames.map { |fn| File.read(fn) }
puts RUBY_DESCRIPTION
inputs.each do |input|
len = input.length
Benchmark.ips do |x|
x.warmup = 0
x.time = 10
x.report("html5 parse #{len}") do
Nokogiri::HTML5::Document.parse(input)
end
x.report("html4 parse #{len}") do
Nokogiri::HTML4::Document.parse(input)
end
x.compare!
end
end
puts "=========="
inputs.each do |input|
len = input.length
html4_doc = Nokogiri::HTML4::Document.parse(input)
html5_doc = Nokogiri::HTML5::Document.parse(input)
Benchmark.ips do |x|
x.warmup = 0
x.time = 10
x.report("html5 serlz #{len}") do
html5_doc.to_html
end
x.report("html4 serlz #{len}") do
html4_doc.to_html
end
x.compare!
end
end (the big_shopping.html file is linked to at #2331 (comment))
and also some profiling information: # stackprof-big-shopping.sh
#! /usr/bin/env bash
if [[ $# -lt 1 ]] ; then
echo "usage: $0 <output-filename>"
exit 1
fi
cmd=$(rbenv which ruby)
env \
LD_PRELOAD=$HOME/local/lib/libprofiler.so \
CPUPROFILE=$1 \
$cmd ./stackprof-big-shopping.rb | tee $1.log
pprof --gif $cmd $1 > $1.gif
pprof --text $cmd $1 > $1.text # stackprof-big-shopping.rb
#! /usr/bin/env ruby
# coding: utf-8
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri", path: "."
gem "benchmark-ips"
end
require "nokogiri"
require "benchmark/ips"
input = File.read("big_shopping.html") # 1.9mb
puts "input #{input.length} bytes"
puts RUBY_DESCRIPTION
Benchmark.ips do |x|
x.warmup = false
x.time = 10
x.report("parsing") { Nokogiri::HTML5::Document.parse(input) }
end text output:
top of the stack profile:
|
@stevecheckoway suggested optimizing out calls to
Already an improvement! A 14% baseline improvement, going from 2.92x slower to 2.46x slower than html4 on the big file. Here's the stack prof:
|
Part of #2722 Co-authored-by: Stephen Checkoway <[email protected]>
@stevecheckoway I tried to do LTO but didn't see a noticeable performance improvement. Here's the patch: diff --git a/ext/nokogiri/extconf.rb b/ext/nokogiri/extconf.rb
index daf6094..7745e4f 100644
--- a/ext/nokogiri/extconf.rb
+++ b/ext/nokogiri/extconf.rb
@@ -618,6 +618,9 @@ def do_clean
# gumbo html5 serialization is slower with O3, let's make sure we use O2
append_cflags("-O2")
+# link-time optimization
+append_cflags("-flto")
+
# always include debugging information
append_cflags("-g")
@@ -725,7 +728,7 @@ def install
class << recipe
def configure
env = {}
- env["CFLAGS"] = concat_flags(ENV["CFLAGS"], "-fPIC", "-g")
+ env["CFLAGS"] = concat_flags(ENV["CFLAGS"], "-fPIC", "-g", "-flto")
env["CHOST"] = host
execute("configure", ["./configure", "--static", configure_prefix], { env: env })
if darwin?
@@ -751,7 +754,7 @@ def configure
# The libiconv configure script doesn't accept "arm64" host string but "aarch64"
recipe.host = recipe.host.gsub("arm64-apple-darwin", "aarch64-apple-darwin")
- cflags = concat_flags(ENV["CFLAGS"], "-O2", "-U_FORTIFY_SOURCE", "-g")
+ cflags = concat_flags(ENV["CFLAGS"], "-O2", "-U_FORTIFY_SOURCE", "-g", "-flto")
recipe.configure_options += [
"--disable-dependency-tracking",
@@ -804,7 +807,7 @@ def configure
recipe.patch_files = Dir[File.join(PACKAGE_ROOT_DIR, "patches", "libxml2", "*.patch")].sort
end
- cflags = concat_flags(ENV["CFLAGS"], "-O2", "-U_FORTIFY_SOURCE", "-g")
+ cflags = concat_flags(ENV["CFLAGS"], "-O2", "-U_FORTIFY_SOURCE", "-g", "-flto")
if zlib_recipe
recipe.configure_options << "--with-zlib=#{zlib_recipe.path}"
@@ -853,7 +856,7 @@ def configure
recipe.patch_files = Dir[File.join(PACKAGE_ROOT_DIR, "patches", "libxslt", "*.patch")].sort
end
- cflags = concat_flags(ENV["CFLAGS"], "-O2", "-U_FORTIFY_SOURCE", "-g")
+ cflags = concat_flags(ENV["CFLAGS"], "-O2", "-U_FORTIFY_SOURCE", "-g", "-flto")
if darwin? && !cross_build_p
recipe.configure_options += ["RANLIB=/usr/bin/ranlib", "AR=/usr/bin/ar"]
@@ -974,9 +977,10 @@ def install
end
def compile
- cflags = concat_flags(ENV["CFLAGS"], "-fPIC", "-O2", "-g")
+ cflags = concat_flags(ENV["CFLAGS"], "-fPIC", "-O2", "-g", "-flto")
+ ldflags = concat_flags(ENV["LDFLAGS"], "-flto")
- env = { "CC" => gcc_cmd, "CFLAGS" => cflags }
+ env = { "CC" => gcc_cmd, "CFLAGS" => cflags, "LDFLAGS" => ldflags }
if config_cross_build?
if /darwin/.match?(host)
env["AR"] = "#{host}-libtool" Is there anything obvious I'm missing? I haven't got much experience with link-time optimization, I may be missing something elementary. (Note that flags passed to |
Also, for comparison, this is the stack profile for libxml2's HTML4 parser on the same file:
It's interesting that |
Pulling on the malloc thread (pun intended), there's a significant performance improvement overall for libxml2 (and so also for html5 serialization) if we don't tell libxml2 to use ruby's memory management functions. I'll ship a PR soon. edit: PR at #2734 |
@stevecheckoway I'm out of ideas. Anything else you'd like me to explore? |
With I tried adding I'm surprised
It's surprising that we'd be spending such a large amount of time in this handful of instructions given how much code it takes to handle each byte of input. |
@stevecheckoway although I can get small (1%-6%) speedups on libgumbo parsing with
Maybe I'm making systemic errors doing this? I'll go back and try to reproduce these results. I'm also surprised to see |
I diassembled utf8.o on my machine, too, and saw that it looks like I ran The whole function's analysis is here.
But here's the interesting bit that accounts for the high number of samples for "decode":
So it looks like the |
Thanks for your work! I used your benchmark with Ruby 3.2.2 and 2.7.2, and added Nokolexbor to the benchmark. Nokolexbor is 2-12 times faster when parsing and 2-6 times faster when serializing. @flavorjones What do you think about using Nokolexbor for the HTML processing in Nokogiri? #! /usr/bin/env ruby
# coding: utf-8
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri", path: "."
gem "nokolexbor"
gem "benchmark-ips"
end
require "nokogiri"
require "nokolexbor"
require "benchmark/ips"
filenames = [
"test/files/GH_1042.html", # 650b
"test/files/tlm.html", # 70kb
"big_shopping.html", # 1.9mb
]
inputs = filenames.map { |fn| File.read(fn) }
puts RUBY_DESCRIPTION
inputs.each do |input|
len = input.length
Benchmark.ips do |x|
x.warmup = 0
x.time = 10
x.report("html5 parse #{len}") do
Nokogiri::HTML5::Document.parse(input)
end
x.report("html4 parse #{len}") do
Nokogiri::HTML4::Document.parse(input)
end
x.report("nokolexbor html5 parse #{len}") do
Nokolexbor::HTML(input)
end
x.compare!
end
end
puts "=========="
inputs.each do |input|
len = input.length
html4_doc = Nokogiri::HTML4::Document.parse(input)
html5_doc = Nokogiri::HTML5::Document.parse(input)
html5_doc_nokolexbor = Nokolexbor::HTML(input)
Benchmark.ips do |x|
x.warmup = 0
x.time = 10
x.report("html5 serlz #{len}") do
html5_doc.to_html
end
x.report("html4 serlz #{len}") do
html4_doc.to_html
end
x.report("html5 nokolexbor serlz #{len}") do
html5_doc_nokolexbor.to_html
end
x.compare!
end
end ruby 2.7.2 benchmark$ ruby bench.rb
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
Calculating -------------------------------------
html5 parse 656 21.049k (±23.5%) i/s - 179.547k in 9.929976s
html4 parse 656 22.142k (±22.3%) i/s - 189.923k in 9.926466s
nokolexbor html5 parse 656
43.945k (±21.3%) i/s - 296.049k in 9.900173s
Comparison:
nokolexbor html5 parse 656: 43944.8 i/s
html4 parse 656: 22141.7 i/s - 1.98x (± 0.00) slower
html5 parse 656: 21048.8 i/s - 2.09x (± 0.00) slower
Calculating -------------------------------------
html5 parse 70095 300.102 (±18.7%) i/s - 2.684k in 9.997238s
html4 parse 70095 450.409 (±22.6%) i/s - 3.978k in 9.997504s
nokolexbor html5 parse 70095
1.406k (±20.4%) i/s - 13.083k in 9.984839s
Comparison:
nokolexbor html5 parse 70095: 1405.6 i/s
html4 parse 70095: 450.4 i/s - 3.12x (± 0.00) slower
html5 parse 70095: 300.1 i/s - 4.68x (± 0.00) slower
Calculating -------------------------------------
html5 parse 1929522 13.132 (± 7.6%) i/s - 131.000 in 10.075865s
html4 parse 1929522 37.880 (±13.2%) i/s - 370.000 in 10.017928s
nokolexbor html5 parse 1929522
157.773 (± 9.5%) i/s - 1.561k in 9.999853s
Comparison:
nokolexbor html5 parse 1929522: 157.8 i/s
html4 parse 1929522: 37.9 i/s - 4.17x (± 0.00) slower
html5 parse 1929522: 13.1 i/s - 12.01x (± 0.00) slower
==========
Calculating -------------------------------------
html5 serlz 656 40.303k (±17.2%) i/s - 373.898k in 9.891472s
html4 serlz 656 53.260k (±18.3%) i/s - 484.973k in 9.844606s
html5 nokolexbor serlz 656
263.888k (±15.5%) i/s - 2.270M in 9.493963s
Comparison:
html5 nokolexbor serlz 656: 263887.5 i/s
html4 serlz 656: 53260.0 i/s - 4.95x (± 0.00) slower
html5 serlz 656: 40303.4 i/s - 6.55x (± 0.00) slower
Calculating -------------------------------------
html5 serlz 70095 918.855 (±15.1%) i/s - 8.842k in 9.993063s
html4 serlz 70095 1.112k (±13.3%) i/s - 10.828k in 9.992264s
html5 nokolexbor serlz 70095
3.359k (±14.9%) i/s - 32.417k in 9.985435s
Comparison:
html5 nokolexbor serlz 70095: 3358.8 i/s
html4 serlz 70095: 1112.0 i/s - 3.02x (± 0.00) slower
html5 serlz 70095: 918.9 i/s - 3.66x (± 0.00) slower
Calculating -------------------------------------
html5 serlz 1929522 107.234 (±12.1%) i/s - 1.055k in 10.007869s
html4 serlz 1929522 115.701 (±11.2%) i/s - 1.140k in 9.999178s
html5 nokolexbor serlz 1929522
425.103 (±19.8%) i/s - 4.042k in 9.994780s
Comparison:
html5 nokolexbor serlz 1929522: 425.1 i/s
html4 serlz 1929522: 115.7 i/s - 3.67x (± 0.00) slower
html5 serlz 1929522: 107.2 i/s - 3.96x (± 0.00) slower ruby 3.2.2 benchmark$ ruby ./bench.rb
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
Calculating -------------------------------------
html5 parse 656 21.030k (±18.5%) i/s - 170.856k
html4 parse 656 21.118k (±18.9%) i/s - 172.096k in 9.886192s
nokolexbor html5 parse 656
38.215k (±24.8%) i/s - 243.899k in 9.856369s
Comparison:
nokolexbor html5 parse 656: 38214.8 i/s
html4 parse 656: 21118.4 i/s - 1.81x slower
html5 parse 656: 21029.8 i/s - 1.82x slower
Calculating -------------------------------------
html5 parse 70095 275.828 (±21.0%) i/s - 2.421k in 9.996074s
html4 parse 70095 439.891 (±20.9%) i/s - 3.646k in 9.995517s
nokolexbor html5 parse 70095
1.467k (±18.5%) i/s - 13.797k in 9.983325s
Comparison:
nokolexbor html5 parse 70095: 1466.9 i/s
html4 parse 70095: 439.9 i/s - 3.33x slower
html5 parse 70095: 275.8 i/s - 5.32x slower
Calculating -------------------------------------
html5 parse 1929522 12.321 (± 8.1%) i/s - 122.000 in 10.067774s
html4 parse 1929522 36.420 (±19.2%) i/s - 351.000 in 10.018349s
nokolexbor html5 parse 1929522
146.070 (±15.1%) i/s - 1.423k in 10.001315s
Comparison:
nokolexbor html5 parse 1929522: 146.1 i/s
html4 parse 1929522: 36.4 i/s - 4.01x slower
html5 parse 1929522: 12.3 i/s - 11.86x slower
==========
Calculating -------------------------------------
html5 serlz 656 39.037k (±22.6%) i/s - 335.023k in 9.824201s
html4 serlz 656 52.522k (±21.3%) i/s - 452.027k in 9.742767s
html5 nokolexbor serlz 656
260.432k (±19.0%) i/s - 2.064M in 9.155473s
Comparison:
html5 nokolexbor serlz 656: 260432.1 i/s
html4 serlz 656: 52521.9 i/s - 4.96x slower
html5 serlz 656: 39037.3 i/s - 6.67x slower
Calculating -------------------------------------
html5 serlz 70095 950.690 (±15.6%) i/s - 9.173k in 9.989867s
html4 serlz 70095 1.049k (±15.6%) i/s - 10.090k in 9.988001s
html5 nokolexbor serlz 70095
3.464k (±16.9%) i/s - 32.979k in 9.976496s
Comparison:
html5 nokolexbor serlz 70095: 3464.2 i/s
html4 serlz 70095: 1049.5 i/s - 3.30x slower
html5 serlz 70095: 950.7 i/s - 3.64x slower
Calculating -------------------------------------
html5 serlz 1929522 114.167 (± 9.6%) i/s - 1.130k in 10.002443s
html4 serlz 1929522 112.654 (±12.4%) i/s - 1.107k in 10.006577s
html5 nokolexbor serlz 1929522
412.097 (±18.9%) i/s - 3.934k in 9.992725s
Comparison:
html5 nokolexbor serlz 1929522: 412.1 i/s
html5 serlz 1929522: 114.2 i/s - 3.61x slower
html4 serlz 1929522: 112.7 i/s - 3.66x slower |
@ilyazub please open a new issue. |
@ilyazub Seriously! I haven't looked hard at Nokolexbor, but there are some incompatibilities. Please open an issue if this is something you think warrants investigation and we can talk about it! |
@flavorjones Thank you for following up! I created an issue: #3043 |
Ideally we want HTML5 to be the default HTML parser in Nokogiri (see #2331). Some necessary work before we do that is to make sure it's as performant as we can make it.
This issue is open-ended and meant to collect the conversations and optimizations attempts we've made.
gumbo-specific speedup:
-O2
to compile libgumbo and the nokogiri extension #2639 (previous work)General speedup:
The text was updated successfully, but these errors were encountered: