HIVE-29451: Optimize MapWork to configure JobConf once per table by hemanthumashankar0511 · Pull Request #6317 · apache/hive

hemanthumashankar0511 · 2026-02-12T10:07:49Z

What changes were proposed in this pull request?
This PR optimizes the configureJobConf method in MapWork.java to eliminate redundant job configuration calls during the map phase initialization.

Modified File: ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java

Logic Change: Introduced a Set within the partition iteration loop.

Mechanism: The code now checks if a TableDesc has already been processed before invoking PlanUtils.configureJobConf(tableDesc, job).

Result: The configuration logic, which includes expensive operations like loading StorageHandlers via reflection, is now executed only once per unique table, rather than once per partition.

Why are the changes needed?
Performance Bottleneck in Job Initialization: Currently, the MapWork.configureJobConf method iterates over aliasToPartnInfo.values(), which contains an entry for every single partition participating in the scan. Inside this loop, it calls PlanUtils.configureJobConf for every partition.

The Issue:

Redundancy: If a query reads 10,000 partitions from the same table, PlanUtils.configureJobConf is called 10,000 times with the exact same TableDesc.

Expensive Operations: PlanUtils.configureJobConf invokes HiveUtils.getStorageHandler, which uses Java Reflection (Class.forName) to load the storage handler class. Repeatedly performing reflection and credential handling for thousands of identical partition objects adds significant, avoidable overhead to the job setup phase.

Impact of Fix:

Complexity Reduction: Reduces the configuration complexity from O(N) (where N is the number of partitions) to O(T) (where T is the number of unique tables).

Scalability: significantly improves the startup time for jobs scanning large numbers of partitions.

Safety: The worst-case scenario (single-partition reads) incurs only the negligible cost of a HashSet instantiation and a single add operation, preserving existing performance for small jobs.

Does this PR introduce any user-facing change?
No. This is an internal optimization to the MapWork plan generation phase. While users may experience faster job startup times for queries involving large numbers of partitions, there are no changes to the user interface, SQL syntax, or configuration properties.

How was this patch tested?
The patch was verified using local unit tests in the ql (Query Language) module to ensure no regressions were introduced by the optimization.

Build Verification: Ran a clean install on the ql module to ensure compilation and dependency integrity.
mvn clean install -pl ql -am -DskipTests
Unit Testing: Executed relevant tests in the ql module, specifically targeting the planning logic components to verify that MapWork configuration remains correct.

mvn test -pl ql -Dtest=TestMapWork
mvn test -pl ql -Dtest="org.apache.hadoop.hive.ql.plan.*"

Logic Verification: Verified that the deduplication logic correctly handles TableDesc objects and that configureJobConf is still called exactly once for each unique table, preserving the correctness of the job configuration while removing redundant calls.

ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java

sonarqubecloud · 2026-02-17T06:36:03Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

abstractdog · 2026-02-17T09:33:33Z

I believe this patch is fine now, LGTM
In the long run, we might want to find a way to get the distinct TableDesc objects from a MapWork (TODO: could be more than 1?), which also means rethinking of this method:

  public Map<String, PartitionDesc> getAliasToPartnInfo() {
    return aliasToPartnInfo;
  }

which is a bad public getter for a mutable collection, also, it's setter counterpart setAliasToPartnInfo is not used at all
I created https://issues.apache.org/jira/browse/HIVE-29464 as a follow-up

soumyakanti3578

LGTM!

ayushtkn

I have a question here:

Set<TableDesc> processedTables = new HashSet<>();

Why are we storing TableDesc object? Can't we just do it via tableDesc.getFullTableName(), it is like just storing the tableName vs the TableDesc. The equals & hashcode seams comparatively heavier than a String

hemanthumashankar0511 · 2026-02-18T10:20:08Z

I have a question here:
Set<TableDesc> processedTables = new HashSet<>();
Why are we storing TableDesc object? Can't we just do it via tableDesc.getFullTableName(), it is like just storing the tableName vs the TableDesc. The equals & hashcode seams comparatively heavier than a String

I stuck with the TableDesc object to be safe with things like self-joins (e.g., table A join table A). If we just checked the table name, we’d incorrectly skip the second configuration.

Also, since the planner reuses the exact same object instance for partitions, the Set check is mostly just comparing memory addresses. This is actually faster than hashing and comparing Strings.

abstractdog · 2026-02-18T10:27:51Z

I have a question here:
Set<TableDesc> processedTables = new HashSet<>();
Why are we storing TableDesc object? Can't we just do it via tableDesc.getFullTableName(), it is like just storing the tableName vs the TableDesc. The equals & hashcode seams comparatively heavier than a String
I stuck with the TableDesc object to be safe with things like self-joins (e.g., table A join table A). If we just checked the table name, we’d incorrectly skip the second configuration.

Also, since the planner reuses the exact same object instance for partitions, the Set check is mostly just comparing memory addresses. This is actually faster than hashing and comparing Strings.

I agree with using as light comparison as possible, so I believe it's worth discovering String comparison (given maximum collision in case of comparing the same TableDesc objects just as many times as many partitions we have)

btw: @hemanthumashankar0511 you mentioned self-joins and safety: how is it a risk? in case of a self-join, does "we’d incorrectly skip the second configuration" true? does it mean different TableDesc objects that should be considered twice by this configuration method

regarding: "Set check is mostly just comparing memory addresses" <- this can be true though, if you mean "==" comparison:

hive/ql/src/java/org/apache/hadoop/hive/ql/plan/TableDesc.java

Lines 236 to 238 in 7060d94

    
           if (o == this) { 
        
             return true; 
        
           }

  public boolean equals(Object o) {
    if (o == this) {
      return true;
    }

so from this point of view, this is fine, however regarding "This is actually faster than hashing and comparing Strings.", I think hashing will happen anyway in case of a Hash-based collection, and in the same bucket is where the equals() plays a role, correct me if I'm wrong

asf-ci-hive added tests pending tests failed and removed tests pending labels Feb 12, 2026

hemanthumashankar0511 force-pushed the optimize-mapwork-config branch 2 times, most recently from 49d9fd4 to 4ba4b60 Compare February 12, 2026 11:40

asf-ci-hive added tests pending tests passed and removed tests failed tests pending labels Feb 12, 2026

hemanthumashankar0511 marked this pull request as ready for review February 12, 2026 14:21

soumyakanti3578 reviewed Feb 12, 2026

View reviewed changes

ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java Outdated Show resolved Hide resolved

humashankar26 added 2 commits February 17, 2026 11:10

HIVE-29451: Optimize MapWork to configure JobConf once per table

8e5bbd4

Using both contains and add method for better readability

29d9fab

hemanthumashankar0511 force-pushed the optimize-mapwork-config branch from 4ba4b60 to 29d9fab Compare February 17, 2026 05:41

asf-ci-hive added tests pending and removed tests passed labels Feb 17, 2026

asf-ci-hive added tests passed and removed tests pending labels Feb 17, 2026

abstractdog requested review from abstractdog and soumyakanti3578 February 17, 2026 09:34

soumyakanti3578 approved these changes Feb 17, 2026

View reviewed changes

ayushtkn reviewed Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-29451: Optimize MapWork to configure JobConf once per table#6317

HIVE-29451: Optimize MapWork to configure JobConf once per table#6317
hemanthumashankar0511 wants to merge 2 commits intoapache:masterfrom
hemanthumashankar0511:optimize-mapwork-config

hemanthumashankar0511 commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 17, 2026

Uh oh!

abstractdog commented Feb 17, 2026 •

edited

Loading

Uh oh!

soumyakanti3578 left a comment

Uh oh!

ayushtkn left a comment

Uh oh!

hemanthumashankar0511 commented Feb 18, 2026

Uh oh!

abstractdog commented Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

Conversation

hemanthumashankar0511 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 17, 2026

Quality Gate passed

Uh oh!

abstractdog commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soumyakanti3578 left a comment

Choose a reason for hiding this comment

Uh oh!

ayushtkn left a comment

Choose a reason for hiding this comment

Uh oh!

hemanthumashankar0511 commented Feb 18, 2026

Uh oh!

abstractdog commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

hemanthumashankar0511 commented Feb 12, 2026 •

edited

Loading

abstractdog commented Feb 17, 2026 •

edited

Loading

abstractdog commented Feb 18, 2026 •

edited

Loading