Conversation
|
If the figures from #60399 still apply for ArrayBuffer-backed buffers, is there scope to add |
doing all of these type checks also adds overhead. Furthermore the benchmark from that PR apply to concat so it's unclear what it's actual effect on |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #62032 +/- ##
==========================================
+ Coverage 89.65% 89.69% +0.03%
==========================================
Files 676 676
Lines 206231 206236 +5
Branches 39505 39759 +254
==========================================
+ Hits 184898 184981 +83
+ Misses 13463 13404 -59
+ Partials 7870 7851 -19
🚀 New features to boost your workflow:
|
Renegade334
left a comment
There was a problem hiding this comment.
Think it's worth mentioning in the commit message that this is essentially a reversion of 24bebd0, sans tests.
This fixes a performance regression for Buffer.copy(target, 0) and brings it back inline with Buffer.write.
V8 has a massive TypedArray.prototype.set penalty on SharedArrayBuffer
Buffer.set and Buffer.copy are up to 8.4x slower when writing to a SharedArrayBuffer vs a regular ArrayBuffer, while Buffer.write (string encoding) is completely unaffected.
256 bytes, varying offset (Apple M3 Pro, Node 25.6.1):
ArrayBuffer SharedArrayBuffer Slowdown
Buffer.set 13.6 ns 56.1 ns 4.1x
Buffer.copy 17.0 ns 65.1 ns 3.8x
Buffer.write 75.8 ns 74.1 ns 1.0x (unaffected)
4096 bytes, varying offset:
ArrayBuffer SharedArrayBuffer Slowdown
Buffer.set 80.3 ns 674.2 ns 8.4x
Buffer.copy 78.4 ns 677.7 ns 8.6x
Buffer.write 190.6 ns 186.1 ns 1.0x (unaffected)
Failed to start CI⚠ Commits were pushed since the last approving review: ⚠ - buffer: always use _copy for copy ✘ Refusing to run CI on potentially unsafe PRhttps://github.com/nodejs/node/actions/runs/22516891493 |
There was a problem hiding this comment.
I am seeing a significant regression in non-partial copy performance:
node % ./node benchmark/buffers/buffer-copy.js
buffers/buffer-copy.js n=6000000 partial="true" bytes=8: 37,745,651.53108819
buffers/buffer-copy.js n=6000000 partial="false" bytes=8: 72,230,886.31703596
buffers/buffer-copy.js n=6000000 partial="true" bytes=128: 37,269,250.926878504
buffers/buffer-copy.js n=6000000 partial="false" bytes=128: 70,233,110.24800341
buffers/buffer-copy.js n=6000000 partial="true" bytes=1024: 26,739,253.757964805
buffers/buffer-copy.js n=6000000 partial="false" bytes=1024: 37,699,305.6259145
node % ./node-ronag benchmark/buffers/buffer-copy.js
buffers/buffer-copy.js n=6000000 partial="true" bytes=8: 36,820,630.5993248
buffers/buffer-copy.js n=6000000 partial="false" bytes=8: 37,401,995.084754474
buffers/buffer-copy.js n=6000000 partial="true" bytes=128: 37,360,956.762793645
buffers/buffer-copy.js n=6000000 partial="false" bytes=128: 35,075,651.428067446
buffers/buffer-copy.js n=6000000 partial="true" bytes=1024: 26,155,344.217952482
buffers/buffer-copy.js n=6000000 partial="false" bytes=1024: 22,480,048.984626204Those checks are necessary to hit the V8 .set optimization, it is faster than our native copy, I think we should instead make those checks more strict if you found a case that's slower, as @Renegade334 suggested
You are right that .concat no longer hits that path, but copy still does
Cc @ChALkeR
|
Benchmark CI: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1805/ Results |
|
Let's hold off until we figure out what's going on @ronag Which benchmark is that in the PR description? If it's not in tree, could you add it (or modify buffer-copy.js) to test behavior on:
All of those are important / could give different results This should be tested with benchmarks code Also the bech should have 64/128 bytes in sizes (along others lower/higher) as there is a sharp difference there in how the data is stored in some cases. |
|
Also i'm curious why is v8 slower on SAB there |
Because the javascript memory model is super strict. |
|
ChatGPT:
|
|
So from what I understand on ARM the implementation is not allowed to use SIMD since it invalidated the "no-tear" requirement of the memory model, while on x86 this is not a problem. I think we maybe should document that Buffer.copy (and in general all buffer operations that we implement) do not strictly follow the ecmascript memory model guarantees. @mcollina |
AFAIA it's that V8 uses a libatomic-based memmove implementation if the source or target is SAB-backed. |
|
DetailsDoc: https://nodejs.org/api/buffer.html#bufcopytarget-targetstart-sourcestart-sourceend > Copies data from a region of buf to a region in target, even if the target memory region overlaps with buf.It should be using memove, not memcpy then Those are different under some libs/optimizations E.g.: I have not verified that the issue exists yet, but given the overall lack of bound checks in _copy it needs a recheck Upd: No,
This is hallucinated, |
How often are Buffer.from(SAB) used and where? |
When it's used it's super important. It's the only way to efficiently use workers in node. Also implementing that guarantee needs us to update most buffer operations which will make things harder to maintain, e.g. rewriting and TBH it doesn't make any difference for this PR since sometimes the implementation uses set and sometimes _copy which make the whole situation even more confusing. |
It's talking about typedarray.set, not _copy. |
|
@ChALkeR I do see you have good intentions here and you are probably right. But this is becoming to much work for what I feel is a rather straightforward change. We run x86 in production which is not affected by this so I can't motivate spending much more time on this. Please feel free to take over this PR or open a different one. I'm happy to continue discussing but getting this over the finish line seems like a too big task for me at the moment. IMHO I would recommend to land this is as revert of a change causing a perf regression and then start over with 2 different PR's in regards to the memory model and the performance (set is faster in certain scenarios) topics. |
|
I think @jasnell might have some insightful opinions in regards to the difference between memory models of |
Yes, it's talking about typedarray.set vs memcpy, and the bottom part of that is not relevant as this is PR is really typedarray.set vs _copy which is memmove (not memcpy), so memcpy overlapping concerns (the cited part) do not apply to this PR. So that's not the answer.
Perf regression under which usecase?
Also it's not demonstrated. |
If this is about There might be more non-literal matches, but this hints at low usage anyway. What exactly is affected that justifies regressing note: this perf question is orthogonal to the concern that using |
There was a problem hiding this comment.
I don't think we should land perf-only PRs without benchmarks that demonstrate the improvement
So far, existing benchmarks demonstrate only slowdown
Even if this is landed (in the current form), it is likely to be flipped back in another attempt to optimize under the guidance of maintained benchmarks
Separately, when benchmarks are introduced, we should consider if that's a usecase that is likely to happen / under what conditions (and if that is worth a slowdown on other common usecases). I don't say that this doesn't have usage, but so far it's unclear from the current PR.
import { bench, run, group, summary } from 'mitata'
const SIZES = [64, 256, 1024, 4096, 16_384, 65_536, 256_000, 1024_000]
for (const size of SIZES) {
const src = Buffer.alloc(size, 0x42)
const abDst = Buffer.from(new ArrayBuffer(size))
const sabDst = Buffer.from(new SharedArrayBuffer(size))
summary(() => {
group(`Buffer.copy ${size} bytes`, () => {
bench(`ArrayBuffer (${size})`, () => {
src.copy(abDst)
})
bench(`SharedArrayBuffer (${size})`, () => {
src.copy(sabDst)
})
})
})
}
await run() |
|
@RafaelGSS maybe you can assists with measuring + benchmarks on this to possibly unblock the PR? |
|
Thanks for the benchmark! I think it was a mistake to compare only relative SAB/AB perf, not absolute with/without patch. According to that bench (modified to not use a dep), comparing 24.11.0 to 24.12.0 (which introduced that change in the first place), I expect that this PR is a slowdown for both SAB and ArrayBuffer on sizes <= 256 (or slightly larger) It makes AB and SAB values approximately equal there, so the coef is brought to ~1x, but it just makes them equally slow. For sizes 1k-16k, according to that provided benchmark, this significantly slows down regular Buffer but significantly improves SAB. For sizes >=65k this has almost no effect on regular Buffer but significantly improves SAB. This hints at a better fix - not plain revert (that makes AB worse and even SAB worse on small buffers), not SAB detection, but a mix of SAB + size detection. 24.12.0 (with #60399): 24.11.0 (likely where with this revert would bring us): That said, I have not tested applying this to main yet. |
There was a problem hiding this comment.
I re-tested this PR and compared it to base branch without it
In short: this PR description mentions sizes 256 and 4096 with relative "slowdown" on SAB compared to AB, and clams to improve that relative slowdown.
What is does instead is makes both AB and SAB slower on size=256. Down to an identical value though.
It has benefits for on higher lengths though, for Buffer.from(new SharedArrayBuffer( only.
But it also still hurts regular Buffer instances up to ~16k, and Buffer.from(new SharedArrayBuffer( is very rare to justify noticeable slowdown on regular Buffer.
This needs not just benchmarks, but the implementation fixed to not cause a perf regression.
Base
ArrayBuffer (64) x 54,529,149 ops/sec @ 18ns/op (0ns..305μs)
SharedArrayBuffer (64) x 53,716,573 ops/sec @ 18ns/op (0ns..38μs)
ArrayBuffer (256) x 52,328,394 ops/sec @ 19ns/op (0ns..41μs)
SharedArrayBuffer (256) x 41,411,669 ops/sec @ 24ns/op (0ns..42μs)
ArrayBuffer (1024) x 38,529,150 ops/sec @ 25ns/op (0ns..36μs)
SharedArrayBuffer (1024) x 18,232,579 ops/sec @ 54ns/op (0ns..41μs)
ArrayBuffer (4096) x 16,600,224 ops/sec @ 60ns/op (0ns..40μs)
SharedArrayBuffer (4096) x 6,146,755 ops/sec @ 162ns/op (42ns..59μs)
ArrayBuffer (16384) x 3,042,573 ops/sec @ 328ns/op (83ns..72μs)
SharedArrayBuffer (16384) x 1,741,612 ops/sec @ 574ns/op (458ns..45μs)
ArrayBuffer (65536) x 1,053,112 ops/sec @ 949ns/op (666ns..70μs)
SharedArrayBuffer (65536) x 451,151 ops/sec @ 2μs/op (2μs..60μs)
ArrayBuffer (256000) x 251,257 ops/sec @ 3μs/op (3μs..10ms)
SharedArrayBuffer (256000) x 116,322 ops/sec @ 8μs/op (8μs..80μs)
ArrayBuffer (1024000) x 66,171 ops/sec @ 15μs/op (14μs..206μs)
SharedArrayBuffer (1024000) x 29,225 ops/sec @ 34μs/op (33μs..135μs)
This PR
ArrayBuffer (64) x 40,335,525 ops/sec @ 24ns/op (0ns..46μs)
SharedArrayBuffer (64) x 39,725,116 ops/sec @ 25ns/op (0ns..80μs)
ArrayBuffer (256) x 35,785,083 ops/sec @ 27ns/op (0ns..45μs)
SharedArrayBuffer (256) x 35,819,652 ops/sec @ 27ns/op (0ns..39μs)
ArrayBuffer (1024) x 27,800,955 ops/sec @ 35ns/op (0ns..41μs)
SharedArrayBuffer (1024) x 27,316,682 ops/sec @ 36ns/op (0ns..57μs)
ArrayBuffer (4096) x 14,335,737 ops/sec @ 69ns/op (0ns..35μs)
SharedArrayBuffer (4096) x 14,335,732 ops/sec @ 69ns/op (0ns..84μs)
ArrayBuffer (16384) x 3,025,528 ops/sec @ 330ns/op (84ns..43μs)
SharedArrayBuffer (16384) x 2,995,333 ops/sec @ 333ns/op (125ns..36μs)
ArrayBuffer (65536) x 1,053,536 ops/sec @ 949ns/op (708ns..23μs)
SharedArrayBuffer (65536) x 1,052,752 ops/sec @ 949ns/op (708ns..67μs)
ArrayBuffer (256000) x 265,626 ops/sec @ 3μs/op (3μs..62μs)
SharedArrayBuffer (256000) x 266,723 ops/sec @ 3μs/op (3μs..55μs)
ArrayBuffer (1024000) x 66,146 ops/sec @ 15μs/op (14μs..170μs)
SharedArrayBuffer (1024000) x 66,000 ops/sec @ 15μs/op (14μs..102μs)
If someone wants to do that go ahead. I still think this PR is better than the current state of things, i.e. both have acceptable performance while the current state of things is AB with great performance and SAB with horrible performance. If you look at the numbers the AB regression you mention is 6ns (?) while the SAB regression I want to fix is 1000+ ns (?). Also we should document that difference in the memory model for Buffer vs TypedArray, if no-one else feels comfortable doing that I will try to make time for that at some point. |
|
Digging into |
|
cc @nodejs/v8 |
This fixes a performance regression for
Buffer.copy(target, 0)and brings it back inline with Buffer.write.V8 has a massive
TypedArray.prototype.setpenalty on SharedArrayBufferBuffer.set and Buffer.copy are up to 8.4x slower when writing to a SharedArrayBuffer vs a regular ArrayBuffer, while Buffer.write (string encoding) is completely unaffected.