Skip to content

feat: aggregate nbrs api#792

Open
SemyonSinchenko wants to merge 13 commits intographframes:mainfrom
SemyonSinchenko:785-aggregate-nbrs
Open

feat: aggregate nbrs api#792
SemyonSinchenko wants to merge 13 commits intographframes:mainfrom
SemyonSinchenko:785-aggregate-nbrs

Conversation

@SemyonSinchenko
Copy link
Collaborator

What changes were proposed in this pull request?

New API as described in #785

Why are the changes needed?

Close #785

@SemyonSinchenko SemyonSinchenko self-assigned this Feb 3, 2026
@SemyonSinchenko SemyonSinchenko added scala pyspark-classic GraphFrames on PySpark Classic pyspark-connect GraphFrames on PySpark Connect labels Feb 3, 2026
@SemyonSinchenko
Copy link
Collaborator Author

Hi @james-willis ! I addressed most of your comments. For some I left my answers. It looks like we are close to a final version. Could be nice if you can take another look, so I can work on bindings and docs.

see scaladoc issue
@codecov-commenter
Copy link

codecov-commenter commented Feb 8, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 70.30303% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.28%. Comparing base (d36a5ac) to head (929a28f).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...park/sql/graphframes/GraphFramesConnectUtils.scala 0.00% 36 Missing ⚠️
...scala/org/graphframes/lib/AggregateNeighbors.scala 89.84% 13 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #792      +/-   ##
==========================================
- Coverage   84.90%   84.28%   -0.62%     
==========================================
  Files          68       69       +1     
  Lines        3444     3672     +228     
  Branches      431      459      +28     
==========================================
+ Hits         2924     3095     +171     
- Misses        520      577      +57     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@SemyonSinchenko
Copy link
Collaborator Author

I will work on tests, bindings and docs.

@SemyonSinchenko SemyonSinchenko marked this pull request as ready for review February 26, 2026 15:49
@SemyonSinchenko
Copy link
Collaborator Author

@james-willis this one is ready for review

@james-willis
Copy link
Collaborator

Looking at this PR, here are my recommendations:

Code Quality & Performance

  1. Memory Management Concerns (AggregateNeighbors.scala:267-268):
    The hardcoded warning for maxHops > 10 should be configurable or include memory estimation logic rather than an arbitrary threshold.

  2. Potential Optimization (AggregateNeighbors.scala:303):
    The repartition by SRC could be expensive for smaller datasets. Consider making this optional or using coalesce when appropriate.

API Design

  1. Required Parameter Validation:
    The API requires either stoppingCondition or targetCondition but this isn't enforced until run(). Consider validation during builder setup for better UX.

  2. Default Accumulator:
    Most use cases will want basic path tracking. Consider adding a convenience method for common path accumulation patterns.

Testing & Documentation

  1. Edge Case Testing: Add tests for:

    • Very large graphs (performance characteristics)
    • Disconnected components
    • Graphs with cycles under different stopping conditions
  2. Python API Consistency:
    Ensure Python parameter names follow conventions consistently.

Implementation Robustness

  1. Convergence Logic (AggregateNeighbors.scala:400):
    Currently only checks if frontier is empty, but doesn't handle cases where states exist but no new edges can be traversed.

Overall, this is a solid implementation that follows GraphFrames patterns well. The main areas for improvement are around performance optimization configuration and edge case handling. The comprehensive test suite and documentation are excellent.

Comment by Claude (AI Assistant)

@SemyonSinchenko
Copy link
Collaborator Author

@james-willis

  1. It is a warning, I don't understand how the warning can be configurable.
  2. That is false. The join mean repartition by src anyway, so even on a single step it is better to keep the repartition
  3. It is impossible to achieve, because we do not know the order of setter-calls
  4. I do not understand, what does it mean? If it is about adding implementations, I'm going to do it in follow-up PRs. I'm going to add at least allPaths to find all paths between two nodes
  5. I'm not going to add huge graphs (performance testing) to unit tests. All tests are 10+ minutes already, do we want to have 30 minutes tests?
  6. All the Python API follows the snake_case way
  7. In this case the DataFrame will be empty, so I don't fully understand the comment...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pyspark-classic GraphFrames on PySpark Classic pyspark-connect GraphFrames on PySpark Connect scala

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: aggregate neighbors API

3 participants