When skipping a comment, the total number of expected comments goes
down. But it should not fall below zero.
This issue was benign: in the case of wraparound, we would still read to
the end of the block as usual, and no further. The 'comments_len'
variable was only used after that to report a format error in case of a
mismatch between how many comments we read, and how many the header said
there would be. The preallocation for the comments vector has already
happened at that point, so this could not cause a large allocation.
This issue was caught on CI by LibFuzzer.
This file was produced by modifying the double Vorbis comment test
sample in a hex editor. The second comment (7 bytes) was just long
enough to fit one length prefix (4 bytes) and one "X=Y" comment.
While this example file is artificial, I did receive a real-world file
where this happens in a bug report.
The test currently fails, the fix will be in a separate commit.
This is the same data as in the non_subset test sample. That sample has
been verified against the reference decoder, so the expected output here
must be the correct output. This is more of a test that will catch bugs
early, and pinpoint the cause of a difference, if I ever change the
high-order LPC prediction code, than a test that improves coverage -- we
already have the test sample for that.
The test sample was constructed by identifying the start and end byte of
the frame that caused the decode to fail, dd-ing that frame out, then
dd-ing out the header of the file too, up to the end of the STREAMINFO
block, and concatenating those. (I first removed all other metadata
blocks with metaflac to ensure that the last block flag is set correctly
in the STREAMINFO block.)
That results in a file that is decodable by Claxon, but it suffers from
two issues:
* The MD5 signature of the audio data in the STREAMINFO block is now
wrong, so libflac fails the decode.
* The number of samples in the STREAMINFO block is now wrong.
I fixed both of these, by zeroing out the MD5 signature, and by setting
the correct length, in a hex editor. Even better would be to compute the
correct MD5 signature, but frankly I never reverse-engineered what it is
a signature of. The spec says "MD5 signature of the unencoded audio
data", but there is no such thing as "unencoded" audio. For example, in
Claxon you always get i32 samples, but often the bit depth will be 16
bits per sample, and the upper bits are zero. Whether you include those
zero bits matters a lot for the signature! And then still, are samples
of channels interleaved? And in what order? In what endianness? So it is
not clear to me what bytes to take the MD5 of.
I searched for an existing tool that can replace the signature with one
that it computes over the data, but I found no such thing, and flac
itself reencodes the audio data too. So for now, just zeroing the
signature will have to do.
This is a single frame extracted from a larger file that could not be
decoded before implementing support for LPC > 12. We deliberately
include only a single frame to make the original audio unrecognizable.
The header still includes the md5 checksum of the audio data of the full
file, so when decoding with libflac, checksum verification fails. I
should still fix the checksum in the file to make the test pass.
Flac files which are not subset-compliant may contain LPC orders greater
than 12. I had never observed such a file in the wild, so I optimized
for subset files, and this optimization relies on the upper bound. But
we can still provide a fallback for high LPC orders.
Temporarily disable builds that use the nightly toolchain, as it cannot
be installed at the moment due to [1]. This issue will sort itself out
soon, after which this commit can be reverted.
It's a bit of a coincidence that the Travis cron build happened to run
today, of all days; had it run tomorrow, then I probably would not have
noticed a thing.
[1]: https://github.com/rust-lang/rust/issues/57488
Previously, the buffer was allocated to the right size, but the memory
was not zeroed, because the entire buffer will be overwritten anyway, so
zeroing is a waste. But in the presence of bugs, this can lead
uninitialized memory to be exposed.
I did some careful benchmarking, and the performance impact of zeroing
is negligible for the common case where the buffer can be recycled, and
the block size is stable. That means that only initially the buffer
needs to be resized, and subsequent blocks all reuse the buffer without
any resizing (not in capacity, nor in vec len). Because of that, zeroing
is very uncommon, so its expense is not that important, because it is a
cold code path.
I changed the benchmarks to report the raw decode times per sample
(averaged over a block still), and decoded 5 distinct files each 50
times. This made for a total of 1094220 timing measurements. With R we
can do a t-test to see the impact of zeroing.
In the first case, we compare the typical case: recycling the buffer.
> t.test(nozero_recycle, zero_recycle)
Welch Two Sample t-test
data: nozero_recycle and zero_recycle
t = -2.8154, df = 2186400, p-value = 0.004872
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.006822184 -0.001222073
sample estimates:
mean of x mean of y
14.12020 14.12422
> mean(zero_recycle) / mean(nozero_recycle)
[1] 1.000285
Values are decode time per sample in nanoseconds, so on my system Claxon
can do a sample in about 14 ns. As we can see the impact of zeroing is
slight, but significant.
In the second case, we truncate the decode buffer to 10 samples before
decoding a block. This is a worst-case scenario for zeroing, because it
has to zero out practically the entire buffer, and it can't be smart
about requesting zeroed pages from the OS, because the first 10 samples
have to be preserved. Truncating the buffer should not negatively affect
the no-zeroing code, because the capacity is still sufficient.
> t.test(nozero_truncate, zero_truncate)
Welch Two Sample t-test
data: nozero_truncate and zero_truncate
t = -23.217, df = 2188400, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.02758650 -0.02329142
sample estimates:
mean of x mean of y
14.10286 14.12830
> mean(zero_truncate) / mean(nozero_truncate)
[1] 1.001804
Here we can see a larger impact of about a 0.2% increase in decode time.
That's somewhat painful, given that many of the optimizations in Claxon
were wins of a similar order of magnitude. However, keep in mind that
this is for an atypical case.
As a sanity check, we should see no significant difference in decode
times between nozero_truncate and nozero_recycle, but that is not the
case, nozero_truncate is significantly faster:
> t.test(nozero_truncate, nozero_recycle)
Welch Two Sample t-test
data: nozero_truncate and nozero_recycle
t = -13.487, df = 2036300, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.01985537 -0.01481676
sample estimates:
mean of x mean of y
14.10286 14.12020
One difference is that in the truncate case, we mostly hit the code path
that does an unsafe Vec::set_len(), whereas the recycling case fails the
check for "buffer.len() < new_len" when the buffer has exactly the right
size already, we call Vec::truncate() instead. Digging into the
Vec::truncate source code reveals that it is implemented as a loop that
drops the values. I have not confirmed if this loop is optimized away
(it should be, because i32's don't need to be dropped), but if it is
not, that might explain the difference.
In any case, I think the overhead of zeroing small enough that we can
afford it.
Thanks to Sergey "Shnatsel" Davidoff for getting this discussion
started.
To trigger the replenishing logic earlier. This improves coverage when
the sample size is limited, or conversely, it allows us to get the same
coverage with smaller samples, which speeds up fuzzing.
After the LLVM upgrade in Rust 1.25.0, rustc started generating better
code. This brings Claxon closer to libflac performance, without any
effort on my part! I re-ran the tools/benchmark_against_libflac.sh
script with a Rust 1.26.0 toolchain, and now Claxon is within 10% of
libflac performance, yay!
Somebody should try recompiling libflac with a recent Clang then though,
it might undo the advantage again. Although Claxon probably benefits
more from optimizations than libflac does, because libflac is full of
manually vectorized code and low-level optimizations. Claxon's code is
far more high-level (still optimized to ensure that the compiler can
generate fast code from the high-level code though).
Since Rust 1.24.0, the number of codegen units is no longer 1, even for
release builds. This speeds up builds slightly, but it reduces the
quality of the generated code tremendously. To get decent performance
out of Claxon, we need to explicitly pass -C codegen-units=1 as
RUSTFLAGS. Update the benchmarks to do so, and add a note in the readme.
Thanks to est31 for pointing this out.
See also https://github.com/rust-lang/rust/issues/53833.
This happening by chance on a corrupted file is very unlikely due to
flac's checksums. It could be exploited by a maliciously crafted input,
but decoding valid flac files worked properly.
Many thanks to Sergey "Shnatsel" Davidoff for providing some of these
(the ones with a valid checksum). The smaller ones were found by the
fuzzer that I just added, but they don't have valid checksums, so they
don't cause the test to fail.
This is to reproduce the issue where the contents of the buffer affect
the decoding oucome. By decodng twice with different initial buffers,
we can discover dependencies on the buffer contents, which should not
exist.
It is very unsafe, because it expects the caller to agree to overwrite
all bytes. This makes that a bit clearer. Just doing this does not
improve safety though.
A partition order could occur, such that the block size was not a
multiple of 2^order. Computation of the number of samples per partition
did not account for this case, rounding down due to the bit shift. This
meant that we would not fill the entire decode buffer.
Claxon does not zero the decode buffer because it is (should be)
overwritten anyway, and in the case of a format error, where the buffer
might be only partially full, the buffer is not exposed again.
Furthermore, the way decoding works in most places, is that we fill the
entire buffer, just by looping to fill it. If the input bitstream does
not contain enough data to fill the buffer, then that's a format error.
In a few places though, we need to slice up the buffer before decoding
into it: for decoding individual channels, and also for decoding
residuals, which are split into partitions.
This particular format error was especially nasty because it did not
cause a format error down the line. Instead, it caused the buffer to be
sliced in a way where the slices together did not cover the entire
buffer, and so parts of uninitialized memory could remain in the buffer.
Thanks a lot to Sergey "Shnatsel" Davidoff for reporting this bug,
together with elaborate steps to reproduce that allowed me to pinpoint
the cause quickly.