Skip to content

Add input validation for Friedman-Popescu H statistic calculation#16760

Open
Copilot wants to merge 4 commits intomasterfrom
copilot/improve-input-validation
Open

Add input validation for Friedman-Popescu H statistic calculation#16760
Copilot wants to merge 4 commits intomasterfrom
copilot/improve-input-validation

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 4, 2026

getFriedmanPopescusH() throws NPE when called with non-numeric or non-existent column names. The error occurs in FriedmanPopescusH.filterFrame() when frame.vecs(cols) returns null for invalid columns.

# This produces a cryptic NPE
h = titanic_xgb.h(train, ['fair', 'age'])  # 'fair' is categorical
# Error: Cannot read field "_key" because "vecs[i]" is null

Changes

  • Added default validation method in FriedmanPopescusHCollector interface

    • Validates vars is non-null and non-empty
    • Validates all columns exist and are numeric
    • Validates columns contain non-constant values
  • Updated implementations to call validation before processing

    • GBMModel.getFriedmanPopescusH()
    • XGBoostModel.getFriedmanPopescusH()
  • Added comprehensive test coverage

    • Java: GBMTest.testFriedmanPopescusHInputValidation()
    • Python: pyunit_h_input_validation.py (tests both GBM and XGBoost)
    • R: runit_GBM_h_statistic_input_validation.R and runit_XGBoost_h_statistic_input_validation.R

Error messages now clearly indicate the issue:

Calculating H statistics error: column 'fair' is not numeric. 
H statistics can only be calculated for numeric variables.
Original prompt

This section details on the original issue you should resolve

<issue_title>H statistic: Improve input validation</issue_title>
<issue_description>This code throws NPE:

from h2o.estimators import H2OXGBoostEstimator

# Import the titanic dataset into H2O:
titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")

# Set the predictors and response; set the response as a factor:
titanic["survived"] = titanic["survived"].asfactor()
predictors = titanic.columns
response = "survived"

# Split the dataset into a train and valid set:
train, valid = titanic.split_frame(ratios=[.8], seed=1234)

# Build and train the model:
titanic_xgb = H2OXGBoostEstimator(booster='dart',
                                  normalize_type="tree",
                                  seed=1234)
titanic_xgb.train(x=predictors,
                  y=response,
                  training_frame=train,
                  validation_frame=valid)

# Eval performance:
perf = titanic_xgb.model_performance()

# Generate predictions on a test set (if necessary):
pred = titanic_xgb.predict(valid)

# Extract feature interactions:
feature_interactions = titanic_xgb.feature_interaction()

# Get Friedman and Popescu's H statistics
h = titanic_xgb.h(train, ['fair','age'])
print(h)
Server error java.lang.NullPointerException:
  Error: Cannot read field "_key" because "vecs[i]" is null
  Request: None

The input parameter validation should be done properly. </issue_description>

<agent_instructions>Validate input parameters of the Friedman Popescus H statistics. On the Java Backend, find the getFriedmanPopescusH() method, where the implementation is for the XGBoost and GBM algorithms. </agent_instructions>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits February 4, 2026 15:54
Co-authored-by: maurever <11465784+maurever@users.noreply.github.com>
Co-authored-by: maurever <11465784+maurever@users.noreply.github.com>
Copilot AI changed the title [WIP] Improve input validation for H statistic Add input validation for Friedman-Popescu H statistic calculation Feb 4, 2026
Copilot AI requested a review from maurever February 4, 2026 16:00
Copy link
Copy Markdown
Contributor

@maurever maurever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R and Python tests are missing.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 14, 2026

R and Python tests are missing.

Added Python and R tests for input validation in commit 3365711.

Python test: h2o-py/tests/testdir_hstats/pyunit_h_input_validation.py

  • Tests null/None, empty, non-existent, and non-numeric column parameters
  • Validates both GBM and XGBoost models

R tests:

  • h2o-r/tests/testdir_algos/gbm/runit_GBM_h_statistic_input_validation.R (GBM)
  • h2o-r/tests/testdir_algos/xgboost/runit_XGBoost_h_statistic_input_validation.R (XGBoost)
  • Both test the same validation scenarios

All tests verify proper error messages for invalid input and correct behavior for valid input.

Copilot AI requested a review from maurever April 14, 2026 16:54
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@maurever maurever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @tomasfryda or @valenad1, could you review this PR, please?

@maurever maurever marked this pull request as ready for review April 21, 2026 10:53
Copy link
Copy Markdown
Collaborator

@valenad1 valenad1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should go to rel-3.46.0 if you want it in next release and fix check pull request test stage please. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

H statistic: Improve input validation

3 participants