Apache Beam is a layer over other data processing systems with multiple backends ("runners"). It includes its own implementation of the Nexmark benchmarks. These instructions document how to run them for four of its runners:
-
Direct: This is built into Beam. This is meant for checking correctness and is not optimized for performance.
-
Flink
-
Spark
-
Google Cloud Dataflow
Each runner supports both batch and streaming processing. In addition, Beam's Nexmark suite includes an implementation of each query in as many as three forms. All of them are implemented using the Beam native representation in terms of transforms and aggregations. Some of them are also implemented in two forms of SQL: "standard" SQL and ZetaSQL.
For a Beam overview, see https://beam.apache.org/get-started/beam-overview/. For information about the Nexmark benchmark for Beam, see https://beam.apache.org/documentation/sdks/java/testing/nexmark/.
Install the Java Development Kit. These instructions were tested with OpenJDK 21.0.2.
You can follow the instructions below to build Nexmark, or run
setup.sh in this directory.
-
Clone the Beam repository:
git clone https://github.com/apache/beam.gitIf you wish to benchmark a particular version, check it out:
(cd beam && git checkout origin/release-2.55.0) -
Apply
configurable-spark-master.patch:(cd beam && git am < ../configurable-spark-master.patch)
There's no need to run a separate build step. Beam will build the first time you run a benchmark.
Use run-nexmark.sh in the parent directory.