Skip to content

Add Base.valid{n}?/2 functions #14417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 11, 2025
Merged

Add Base.valid{n}?/2 functions #14417

merged 13 commits into from
Apr 11, 2025

Conversation

whatyouhide
Copy link
Member

This is a first PR to add this kind of functions. If we like the direction, I'll go ahead and add valid32?/2 and valid64?/2.

@josevalim
Copy link
Member

LGTM. What are we going to call the uri_base64 one? :D

@whatyouhide
Copy link
Member Author

@josevalim I'd just replace decode with valid in function names I think, so url_valid64? and friends.

@josevalim
Copy link
Member

@whatyouhide perhaps we should go with valid_decode64? and valid_url_decode64? as those are specific to the decoding operation? In theory we don't need decoding in the name, as all encoding is valid, but perhaps this makes it clearer anyway.

@whatyouhide
Copy link
Member Author

Okay I’m down to go with valid_decode*. Gimme a sec.

@whatyouhide
Copy link
Member Author

Added all bases. The code is a bit repetitive but I’m afraid getting more clever would make it really hard to read. For base64, ignoring whitespace is nasty because we still have to remove it completely and then do a validation pass (so one copy), but I think that's fine as the first step.

@josevalim
Copy link
Member

@whatyouhide how big is the Base module before and after these changes? Should we also benchmark to see if the performance difference is significant?

{min, decoded} = alphabet |> Enum.with_index() |> to_decode_list.()

defp unquote(name)(char) do
defp unquote(validate_name)(char) when char in unquote(alphabet), do: char
Copy link
Member

@josevalim josevalim Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For validating each char, we should use the same decode/1 function instead of introducing a new validate function. The decode function is a subtraction plus a tuple operation, which is likely faster than checking for char in unquote(alphabet).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should apply to all of them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not done :D My suggestion is to remove defp unquote(validate_name)(char) and use the decoding function instead, which is likely faster?

@sabiwara
Copy link
Contributor

sabiwara commented Apr 9, 2025

Should we also benchmark to see if the performance difference is significant?

I think this is a good point, also I'm wondering if replacing the raise/rescue approach by generating and clauses would improve these benchmarks.

@whatyouhide
Copy link
Member Author

to see if the performance difference is significant?

The huge difference here is the memory difference anyway?

@whatyouhide
Copy link
Member Author

  • Byte size of Elixir.Base.beam before: 89548
  • Byte size of Elixir.Base.beam after: 115360 (~28% bigger)

Performance wise.

Bench script
Mix.install([:benchee, :benchee_html])

defmodule Helpers do
  def decode(16, input), do: Base.decode16!(input)
  def decode(32, input), do: Base.decode32!(input)
  def decode(64, input), do: Base.decode64!(input)

  def validate(16, input), do: Base.valid_decode16?(input)
  def validate(32, input), do: Base.valid_decode32?(input)
  def validate(64, input), do: Base.valid_decode64?(input)
end

inputs = %{
  "small string" => "hello world",
  "big string" => :crypto.strong_rand_bytes(1_000_000)
}

inputs =
  for base <- [16, 32, 64],
      {input_name, input_value} <- inputs,
      into: %{} do
    encoded_value = apply(Base, :"encode#{base}", [input_value])
    {"base#{base} - #{input_name}", {base, encoded_value}}
  end

Benchee.run(
  %{
    "decode" => fn {base, input} -> Helpers.decode(base, input) end,
    "validate" => fn {base, input} -> Helpers.validate(base, input) end
  },
  memory_time: 3,
  reduction_time: 2,
  inputs: inputs,
  formatters: [
    {Benchee.Formatters.HTML, file: "bench.html"},
    Benchee.Formatters.Console
  ]
)
Results
##### With input base16 - big string #####
Name               ips        average  deviation         median         99th %
validate        286.82        3.49 ms    ±12.32%        3.62 ms        4.43 ms
decode          244.58        4.09 ms     ±3.14%        4.09 ms        4.59 ms

Comparison: 
validate        286.82
decode          244.58 - 1.17x slower +0.60 ms

Memory usage statistics:

Name        Memory usage
validate            40 B
decode             104 B - 2.60x memory usage +64 B

**All measurements for memory usage were the same**

Reduction count statistics:

Name     Reduction count
validate          1.00 M
decode            4.25 M - 4.25x reduction count +3.25 M

**All measurements for reduction count were the same**

##### With input base16 - small string #####
Name               ips        average  deviation         median         99th %
validate       13.95 M       71.70 ns  ±7684.41%          83 ns         125 ns
decode         10.26 M       97.50 ns   ±113.52%          84 ns         208 ns

Comparison: 
validate       13.95 M
decode         10.26 M - 1.36x slower +25.80 ns

Memory usage statistics:

Name        Memory usage
validate            40 B
decode             104 B - 2.60x memory usage +64 B

**All measurements for memory usage were the same**

Reduction count statistics:

Name     Reduction count
validate              20
decode                56 - 2.80x reduction count +36

**All measurements for reduction count were the same**

##### With input base32 - big string #####
Name               ips        average  deviation         median         99th %
validate        421.79        2.37 ms     ±8.23%        2.31 ms        3.04 ms
decode          306.89        3.26 ms    ±11.75%        3.01 ms        4.01 ms

Comparison: 
validate        421.79
decode          306.89 - 1.37x slower +0.89 ms

Memory usage statistics:

Name        Memory usage
validate           120 B
decode             184 B - 1.53x memory usage +64 B

**All measurements for memory usage were the same**

Reduction count statistics:

Name     Reduction count
validate          3.40 M
decode            3.40 M - 1.00x reduction count +0.00000 M

**All measurements for reduction count were the same**

##### With input base32 - small string #####
Name               ips        average  deviation         median         99th %
decode          7.17 M      139.56 ns  ±4817.17%         125 ns         250 ns
validate        6.54 M      153.01 ns ±25536.65%          83 ns         167 ns

Comparison: 
decode          7.17 M
validate        6.54 M - 1.10x slower +13.46 ns

Memory usage statistics:

Name        Memory usage
decode             176 B
validate           112 B - 0.64x memory usage -64 B

**All measurements for memory usage were the same**

Reduction count statistics:

Name     Reduction count
decode                51
validate              51 - 1.00x reduction count +0

**All measurements for reduction count were the same**

##### With input base64 - big string #####
Name               ips        average  deviation         median         99th %
validate        513.06        1.95 ms     ±2.97%        1.95 ms        2.05 ms
decode          397.90        2.51 ms     ±6.27%        2.50 ms        2.85 ms

Comparison: 
validate        513.06
decode          397.90 - 1.29x slower +0.56 ms

Memory usage statistics:

Name        Memory usage
validate           120 B
decode             184 B - 1.53x memory usage +64 B

**All measurements for memory usage were the same**

Reduction count statistics:

Name     Reduction count
validate          2.83 M
decode            2.83 M - 1.00x reduction count +0.00000 M

**All measurements for reduction count were the same**

##### With input base64 - small string #####
Name               ips        average  deviation         median         99th %
validate        7.12 M      140.53 ns ±27766.24%          83 ns         167 ns
decode          6.28 M      159.26 ns ±10765.68%         125 ns         209 ns

Comparison: 
validate        7.12 M
decode          6.28 M - 1.13x slower +18.73 ns

Memory usage statistics:

Name        Memory usage
validate           104 B
decode             168 B - 1.62x memory usage +64 B

**All measurements for memory usage were the same**

Reduction count statistics:

Name     Reduction count
validate              47
decode                47 - 1.00x reduction count +0

TL;DR: validating is basically always faster and uses less memory than decoding 😄

@sleepiecappy
Copy link

Hi! Sorry to just chime in, but I'm confused about the naming. Is it because we want to validate if the decode is valid before using it? Or would these functions be used standing alone to validate the bases? I take that we want a naming scheme as: valid_[subject]?, so if the decode is the subject then it makes sense, but if the subject is the base I feel it can create confusion when using them. If you could explain the rationale behind the names it would be appreciated 🙏🏻

@whatyouhide
Copy link
Member Author

@sleepiecappy sometimes you want to efficiently validate that something is valid baseN, but without actually decoding it because that causes an output binary to be created and allocated and whatnot. This set of new functions is exactly for that use case.

@whatyouhide whatyouhide requested a review from josevalim April 9, 2025 18:31
@whatyouhide whatyouhide changed the title Add Base.valid16?/2 Add Base.valid_decode{n}?/2 functions Apr 9, 2025
Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code wise it looks good to me, but let's decide on the name before merging it.

Copy link
Contributor

@sabiwara sabiwara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

whatyouhide and others added 2 commits April 10, 2025 15:06
@josevalim josevalim changed the title Add Base.valid_decode{n}?/2 functions Add Base.valid{n}?/2 functions Apr 11, 2025
@whatyouhide whatyouhide merged commit fdbc664 into main Apr 11, 2025
22 checks passed
@whatyouhide whatyouhide deleted the al/base-valid16 branch April 11, 2025 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants