feat: refactor ConnectedComponents API by SemyonSinchenko · Pull Request #803 · graphframes/graphframes

SemyonSinchenko · 2026-03-05T12:42:06Z

What changes were proposed in this pull request?

Refactor the API to simplify switching algorithms
Expose the RC
graphframes -> two_phase
A new attempt of adding AQE-friendly TP

Why are the changes needed?

Close #775
Close #759
Close #758

feat: add TwoPhase.scala

…hm guide

SemyonSinchenko · 2026-03-05T12:45:32Z

On my benchmarks (wiki-Talks, 2M/5M v/e):

TP + skewedJoin =~ 130 seconds
TP + AQE =~ 20 seconds
RC =~ 30 seconds

SemyonSinchenko · 2026-03-05T13:58:29Z

@james-willis it is very important to review this one carefully :) I tried to keep the existing behavior as is and did not change any default behavior.

cc: @sonalgoyal you can be interested in skipping prepare
cc: @greg-kennedy on the local small-sized benchmark, the TP outperform RC. Not sure, was it my implementation problem, or the TP is just better for Apache Spark model.

…sist

russelljurney-upside · 2026-03-05T21:01:07Z

@claude review please

russelljurney-upside · 2026-03-05T21:01:31Z

@cursor review please, find bugs

russelljurney-upside · 2026-03-05T23:58:00Z

Via @claude:

Overview

This PR refactors the Connected Components implementation by:

Extracting the two-phase algorithm into a new TwoPhase.scala object
Adding an AQE-friendly code path (runAQE) that avoids disabling AQE globally
Renaming the "graphframes" algorithm to "two_phase" (with deprecation alias)
Exposing isGraphPrepared to let advanced users skip internal graph preparation
Adding checkpointing support to RandomizedContraction

Critical Issues

Global AQE config mutation is not thread-safe (TwoPhase.scala:189-192)

run() reads spark.sql.adaptive.enabled, sets it to "false", then restores in finally. Concurrent callers on the same SparkSession will race on this config. The runAQE path avoids this — good — but run() is
still the default for broadcastThreshold != -1.

Heavy code duplication between run and runAQE

These two methods are ~80-90% identical (~120 lines each). Differences are only join strategy (skewed vs. plain) and checkpoint strategy. Any bug fix must be applied in two places. Should extract shared
iteration logic.

Medium Issues

Convergence via sum comparison is theoretically fragile (TwoPhase.scala:293, 413)

currSum == prevSum on sum(MIN_NBR) cast to DecimalType(38,0) could theoretically produce false convergence if two different label assignments produce the same sum. Extremely unlikely in practice but worth
documenting.

isGraphPrepared silently ignored for GraphX (ConnectedComponents.scala:140-144)

If a user sets isGraphPrepared = true and uses ALGO_GRAPHX, the flag has no effect and no warning is issued. Should warn or throw.

setIsGraphPrepared warns even when setting to false (ConnectedComponents.scala:127-133)

The warning should be conditional on value == true.

Benchmark still uses deprecated "graphframes" name

ConnectedComponentsBenchmark.scala:23 has @param(Array("graphframes", "graphx")) — should use "two_phase" and consider adding "randomized_contraction".

Missing test coverage

isGraphPrepared = true is never tested anywhere
randomized_contraction is never tested through the ConnectedComponents routing logic (only via direct RandomizedContraction.run())
No test for deprecated "graphframes" → "two_phase" normalization
No LDBC integration test for AQE mode (broadcastThreshold = -1)

Low Issues

Naming conventions

persisted_df (TwoPhase.scala:299, 419) uses snake_case — should be persistedDf
ee/vv are very terse for 100+ line methods
minNbrs1/minNbrs2 are opaque

System.gc() call (TwoPhase.scala:279)

Explicit GC is a JVM anti-pattern — it's only a hint and can cause stop-the-world pauses on the driver.

ALGO_GRAPHFRAMES companion object val lacks @deprecated annotation (ConnectedComponents.scala:189)

Only has a Scaladoc tag, which doesn't produce compiler warnings at call sites.

Documentation typos

is_direted → is_directed (05-traversals.md:66)
cluste → cluster (line 371)

Persist after localCheckpoint may be redundant (TwoPhase.scala:268+283)

localCheckpoint(eager=true) already materializes data; subsequent .persist(intermediateStorageLevel) on the same DataFrame applies a different storage level to already-checkpointed data.

Checkpoint cleanup is incomplete

Only the previous interval's checkpoint is cleaned up — the final checkpoint is never removed.

SemyonSinchenko · 2026-03-06T09:29:24Z

1. Global AQE config mutation is not thread-safe (TwoPhase.scala:189-192)
run() reads spark.sql.adaptive.enabled, sets it to "false", then restores in finally. Concurrent callers on the same SparkSession will race on this config. The runAQE path avoids this — good — but run() is still the default for broadcastThreshold != -1.

That is an existing code. Problem is out of the scope.

2. Heavy code duplication between run and runAQE
These two methods are ~80-90% identical (~120 lines each). Differences are only join strategy (skewed vs. plain) and checkpoint strategy. Any bug fix must be applied in two places. Should extract shared iteration logic.

That was intentional. The goal is guarantee the existing behavior. Last time we were forced to do rollback.

3. Convergence via sum comparison is theoretically fragile (TwoPhase.scala:293, 413)
currSum == prevSum on sum(MIN_NBR) cast to DecimalType(38,0) could theoretically produce false convergence if two different label assignments produce the same sum. Extremely unlikely in practice but worth documenting.

That is an existing code. Problem is out of the scope.

4. isGraphPrepared silently ignored for GraphX (ConnectedComponents.scala:140-144)
If a user sets isGraphPrepared = true and uses ALGO_GRAPHX, the flag has no effect and no warning is issued. Should warn or throw.

That is intentional. Anyone who use isGraphPrepared knows what are they doing.

5. setIsGraphPrepared warns even when setting to false (ConnectedComponents.scala:127-133)
The warning should be conditional on value == true.

I see zero reasons to do it. The comment lacks the context of the method.

6. Benchmark still uses deprecated "graphframes" name
ConnectedComponentsBenchmark.scala:23 has @param(Array("graphframes", "graphx")) — should use "two_phase" and consider adding "randomized_contraction".

That is an existing code. Problem is out of the scope.

* isGraphPrepared = true is never tested anywhere

I see zero reasons to do it. The comment lacks the context of the method.

* randomized_contraction is never tested through the ConnectedComponents routing logic (only via direct RandomizedContraction.run())

* No test for deprecated "graphframes" → "two_phase" normalization

* No LDBC integration test for AQE mode (broadcastThreshold = -1)

Tests coverage is fine.

8. Naming conventions


* persisted_df (TwoPhase.scala:299, 419) uses snake_case — should be persistedDf

* ee/vv are very terse for 100+ line methods

* minNbrs1/minNbrs2 are opaque

That is the only reasonable comment, I will update the persisted_df -> persistedDf

9. System.gc() call (TwoPhase.scala:279)
Explicit GC is a JVM anti-pattern — it's only a hint and can cause stop-the-world pauses on the driver.

Too generic. System.gc() is a standard in the Spark World.

10. ALGO_GRAPHFRAMES companion object val lacks @deprecated annotation (ConnectedComponents.scala:189)
Only has a Scaladoc tag, which doesn't produce compiler warnings at call sites.

scalafix says otherwise (there was a warning "use scala annotations instead of java annotations")

11. Documentation typos


* is_direted → is_directed (05-traversals.md:66)

* cluste → cluster (line 371)

Out of scope.

12. Persist after localCheckpoint may be redundant (TwoPhase.scala:268+283)
localCheckpoint(eager=true) already materializes data; subsequent .persist(intermediateStorageLevel) on the same DataFrame applies a different storage level to already-checkpointed data.

That is fine, just check the logic.

13. Checkpoint cleanup is incomplete
Only the previous interval's checkpoint is cleaned up — the final checkpoint is never removed.

Not 100% what is it about

greg-kennedy · 2026-03-06T14:01:12Z

Checkpoint cleanup is incomplete
Only the previous interval's checkpoint is cleaned up — the final checkpoint is never removed.

This is a difficult pattern in Spark in all cases... if you unpersist the final checkpoint before returning, then it's been deleted before the caller can make use of it :)

russelljurney-upside · 2026-03-06T20:15:11Z

Sorry if Claude wasn't helpful, I'll direct it to only review changes native to the PR next time.

SemyonSinchenko added 20 commits February 28, 2026 09:32

</think>

044a1c1

feat: add TwoPhase.scala

refactor: extract ConnectedComponents implementation to TwoPhase

54ebffe

fix: standardize connected components algorithm string and GraphX impl

87f43bb

feat: add getAlgorithm getter

8057d93

refactor: remove GraphX implementation and maxIter from TwoPhase

3f6c1d0

feat: add runAQE method for AQE-based two-phase connected components

b97f19c

refactor: calcMinNbrSum object method, DecimalType(38,10), drop overflow

44772b3

refactor: simplify calcMinNbrSum and remove unused variable

44bfb4e

refactor: extract shared output generation to buildOutput method

354ec7d

feat: add intermediate storage level and unpersist in GraphX CC

6a8c0bf

feat: use runAQE for two-phase CC when broadcastThreshold is -1

3d1972e

refactor: remove unused spark variable in TwoPhase

d3d3230

feat: add isGraphPrepared to skip graph preparation in run and runAQE

4cac662

refactor: add isGraphPrepared and checkpoint params to algorithm runs

2321744

feat: add internal graph preparation control to ConnectedComponents

2047d52

refactor: rename _isGraphPrepared and update setter

4949d3c

docs: clarify setIsGraphPrepared docstring on algorithm prep steps

43982af

docs: replace Connected Components section with comprehensive algorit…

6d778b7

…hm guide

feat: rename graphframes to two_phase and add checkpointing to rand_cont

075026b

test: add local checkpointing to RandomizedContractionSuite

d2d2ecb

SemyonSinchenko requested a review from james-willis March 5, 2026 12:42

SemyonSinchenko self-assigned this Mar 5, 2026

SemyonSinchenko added the scala label Mar 5, 2026

SemyonSinchenko added 2 commits March 5, 2026 14:33

refactor: unpersist edges earlier and in finally block

5bfaac1

test: add System.gc() to make test more robust

7813237

SemyonSinchenko added 3 commits March 5, 2026 15:18

test: clean up Spark checkpoint directory in RandomizedContractionSuite

328f4fc

test(RandomizedContractionSuite): move checkpoint cleanup after unper…

cc3abbd

…sist

chore: check memory leaks for persisted only

04ebe28

SemyonSinchenko added 2 commits March 5, 2026 16:05

fix: make the test robust, not random

a4aee16

chore: drop the redundant

8562ee6

chor: naming

54a7ae9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: refactor ConnectedComponents API#803

feat: refactor ConnectedComponents API#803
SemyonSinchenko wants to merge 28 commits intographframes:mainfrom
SemyonSinchenko:775-refactor-cc-api

SemyonSinchenko commented Mar 5, 2026

Uh oh!

SemyonSinchenko commented Mar 5, 2026

Uh oh!

SemyonSinchenko commented Mar 5, 2026

Uh oh!

russelljurney-upside commented Mar 5, 2026

Uh oh!

russelljurney-upside commented Mar 5, 2026

Uh oh!

russelljurney-upside commented Mar 5, 2026

Uh oh!

SemyonSinchenko commented Mar 6, 2026

Uh oh!

greg-kennedy commented Mar 6, 2026

Uh oh!

russelljurney-upside commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SemyonSinchenko commented Mar 5, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Uh oh!

SemyonSinchenko commented Mar 5, 2026

Uh oh!

SemyonSinchenko commented Mar 5, 2026

Uh oh!

russelljurney-upside commented Mar 5, 2026

Uh oh!

russelljurney-upside commented Mar 5, 2026

Uh oh!

russelljurney-upside commented Mar 5, 2026

Overview

Critical Issues

Medium Issues

Low Issues

Uh oh!

SemyonSinchenko commented Mar 6, 2026

Uh oh!

greg-kennedy commented Mar 6, 2026

Uh oh!

russelljurney-upside commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants