Datagen fixes for better ingestions/higher throughput by gz · Pull Request #5747 · feldera/feldera

gz · 2026-03-03T23:58:48Z

Describe Manual Test Plan

I ran datagen and it reaches higher throughput and needs many more datagen workers to cause buffering. But I have low confidence in the changes because I'm not very familiar with this code.

This fixes an issue where the datagen main thread flushed unstaged batches which took too long, hence starving the workers. Another unfortunate problem is that these records were considered buffered by the pipeline so it wasn't obvious. I haven't fixed that problem because I wouldn't know where to fix it. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

mihaibudiu · 2026-03-04T00:08:20Z

Didn't you write this code?
Is it still parsing the config for every record emitted?

mihaibudiu · 2026-03-04T00:10:42Z

crates/datagen/src/lib.rs

+    }
+}
+
+impl Hasher for HashBytesRecorder {


so the trick is to hash and stage in one single pass?

from what I can tell the trick is to move the parsing into the worker threads... I just did what leonid told me is the solution

gz · 2026-03-04T00:18:19Z

Didn't you write this code?

not this piece

ryzhyk

Another unfortunate problem is that these records were considered buffered by the pipeline

The semantics of buffered is that this data has been parsed by the input connector and is ready for ingestion, so this is the intended behavior. We just don't distinguish between staged and unstaged buffers, and there's probably no reason not to implement staging in all connectors.

mythical-fred

LGTM — moving staging and hashing into worker threads is a clean win. One thing worth verifying given your low-confidence note:

The old code used take_some to split a completion at the transaction boundary. The new code avoids splitting entirely: if total.records + flushed.records > n it pushes the whole completion back. But the total.records > 0 guard means the first completion in a transaction is always taken whole, even if it exceeds n. So with transaction_size = k and batch sizes > k, the first transaction will be oversized. For datagen/benchmarking this is almost certainly fine, just worth knowing.

gz added 2 commits March 3, 2026 14:49

benchmarking: add join microbenchmark.

5e49544

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

gz requested review from blp and ryzhyk and removed request for ryzhyk March 3, 2026 23:58

gz changed the title ~~Datagen fix~~ Datagen fixes for better ingestions/higher throughput Mar 4, 2026

mihaibudiu approved these changes Mar 4, 2026

View reviewed changes

ryzhyk approved these changes Mar 4, 2026

View reviewed changes

mythical-fred approved these changes Mar 4, 2026

View reviewed changes

gz added this pull request to the merge queue Mar 4, 2026

Merged via the queue into main with commit 772ac9b Mar 4, 2026
7 checks passed

gz deleted the datagen-fix branch March 4, 2026 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datagen fixes for better ingestions/higher throughput#5747

Datagen fixes for better ingestions/higher throughput#5747
gz merged 2 commits intomainfrom
datagen-fix

gz commented Mar 3, 2026

Uh oh!

mihaibudiu commented Mar 4, 2026

Uh oh!

mihaibudiu Mar 4, 2026

Uh oh!

gz Mar 4, 2026

Uh oh!

gz commented Mar 4, 2026 •

edited

Loading

Uh oh!

ryzhyk left a comment

Uh oh!

mythical-fred left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gz commented Mar 3, 2026

Describe Manual Test Plan

Uh oh!

mihaibudiu commented Mar 4, 2026

Uh oh!

mihaibudiu Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gz Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gz commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryzhyk left a comment

Choose a reason for hiding this comment

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gz commented Mar 4, 2026 •

edited

Loading