Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Running Nexmark Benchmarks with Apache Beam

Apache Beam is a layer over other data processing systems with multiple backends ("runners"). It includes its own implementation of the Nexmark benchmarks. These instructions document how to run them for four of its runners:

  • Direct: This is built into Beam. This is meant for checking correctness and is not optimized for performance.

  • Flink

  • Spark

  • Google Cloud Dataflow

Each runner supports both batch and streaming processing. In addition, Beam's Nexmark suite includes an implementation of each query in as many as three forms. All of them are implemented using the Beam native representation in terms of transforms and aggregations. Some of them are also implemented in two forms of SQL: "standard" SQL and ZetaSQL.

For a Beam overview, see https://beam.apache.org/get-started/beam-overview/. For information about the Nexmark benchmark for Beam, see https://beam.apache.org/documentation/sdks/java/testing/nexmark/.

Prerequisites

Install the Java Development Kit. These instructions were tested with OpenJDK 21.0.2.

Setting up Beam

You can follow the instructions below to build Nexmark, or run setup.sh in this directory.

  1. Clone the Beam repository:

    git clone https://github.com/apache/beam.git
    

    If you wish to benchmark a particular version, check it out:

    (cd beam && git checkout origin/release-2.55.0)
    
  2. Apply configurable-spark-master.patch:

    (cd beam && git am < ../configurable-spark-master.patch)
    

There's no need to run a separate build step. Beam will build the first time you run a benchmark.

Running the benchmarks

Use run-nexmark.sh in the parent directory.