feat: refactor ConnectedComponents API#803
feat: refactor ConnectedComponents API#803SemyonSinchenko wants to merge 28 commits intographframes:mainfrom
Conversation
|
On my benchmarks (wiki-Talks, 2M/5M v/e):
|
|
@james-willis it is very important to review this one carefully :) I tried to keep the existing behavior as is and did not change any default behavior. cc: @sonalgoyal you can be interested in skipping prepare |
|
@claude review please |
|
@cursor review please, find bugs |
|
Via @claude: OverviewThis PR refactors the Connected Components implementation by:
Critical Issues
run() reads spark.sql.adaptive.enabled, sets it to "false", then restores in finally. Concurrent callers on the same SparkSession will race on this config. The runAQE path avoids this — good — but run() is
These two methods are ~80-90% identical (~120 lines each). Differences are only join strategy (skewed vs. plain) and checkpoint strategy. Any bug fix must be applied in two places. Should extract shared Medium Issues
currSum == prevSum on sum(MIN_NBR) cast to DecimalType(38,0) could theoretically produce false convergence if two different label assignments produce the same sum. Extremely unlikely in practice but worth
If a user sets isGraphPrepared = true and uses ALGO_GRAPHX, the flag has no effect and no warning is issued. Should warn or throw.
The warning should be conditional on value == true.
ConnectedComponentsBenchmark.scala:23 has @param(Array("graphframes", "graphx")) — should use "two_phase" and consider adding "randomized_contraction".
Low Issues
Explicit GC is a JVM anti-pattern — it's only a hint and can cause stop-the-world pauses on the driver.
Only has a Scaladoc tag, which doesn't produce compiler warnings at call sites.
localCheckpoint(eager=true) already materializes data; subsequent .persist(intermediateStorageLevel) on the same DataFrame applies a different storage level to already-checkpointed data.
Only the previous interval's checkpoint is cleaned up — the final checkpoint is never removed. |
That is an existing code. Problem is out of the scope.
That was intentional. The goal is guarantee the existing behavior. Last time we were forced to do rollback.
That is an existing code. Problem is out of the scope.
That is intentional. Anyone who use
I see zero reasons to do it. The comment lacks the context of the method.
That is an existing code. Problem is out of the scope.
I see zero reasons to do it. The comment lacks the context of the method.
Tests coverage is fine.
That is the only reasonable comment, I will update the
Too generic.
Out of scope.
That is fine, just check the logic.
Not 100% what is it about |
This is a difficult pattern in Spark in all cases... if you unpersist the final checkpoint before returning, then it's been deleted before the caller can make use of it :) |
|
Sorry if Claude wasn't helpful, I'll direct it to only review changes native to the PR next time. |
What changes were proposed in this pull request?
graphframes->two_phaseWhy are the changes needed?
Close #775
Close #759
Close #758