-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend allowed characters' set #93
Conversation
This refactoring will allow for an easier addition/deletion of allowed characters, without directly manipulating the regular expression.
This refactoring could facilitate adding more characters to the allowed set. @flooose what do you think? |
1 similar comment
I seems like it would be fine, but I'd be interested what happens in terms of performance when we start adding more characters to allowed_chars.txt. My understanding of the xls file was that all of the characters marked in green should be allowed. That would be a large regex :) Any idea of how to test this? Have you tried adding just the Greek alphabet to see if there are any performance hits? |
I ran the following code: require 'benchmark'
STRING_LENGTH = 100_000
regex_filter = /[^a-zA-Z0-9ÄÖÜäöüß&*$%\ \'\:\?\,\-\(\+\.\)\/]/
string_filter = "^#{Regexp.escape("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789ÄÖÜäöüß&*$% ':?,-(+.)/")}"
regex_filter_with_greek = /[^a-zA-ZΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ0-9ÄÖÜäöüß&*$%\ \'\:\?\,\-\(\+\.\)\/]/
string_filter_with_greek = "^#{Regexp.escape("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ0123456789ÄÖÜäöüß&*$% ':?,-(+.)/")}"
charset = Array('A'..'Z') + Array('a'..'z') + Array(0..9) + %w[ÄÖÜäöüß&*$%] + %w[!^@~\\] + %w[ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ]
my_string = Array.new(STRING_LENGTH) { charset.sample }.join
results = Benchmark.bm do |x|
x.report('gsub with greek:') { my_string.gsub(regex_filter_with_greek, '') }
x.report('gsub no greek:') { my_string.gsub(regex_filter, '') }
x.report('tr with greek:') { my_string.tr(string_filter_with_greek, '') }
x.report('tr no greek:') { my_string.tr(string_filter, '') }
end For a
For a
What I find somewhat surprising, is the fact that |
Interesting. I guess, let's see what the maintainers say. |
Ping. Maybe no one is responding because the pull request is still a "draft"? |
Closed in favour of #105. |
Resolves #92.