DRILL-8504: Add Schema Caching to Splunk Plugin #2929

cgivre · 2024-07-24T18:58:29Z

DRILL-8504: Add Schema Caching to Splunk Plugin

Description

Whenever Drill executes a Splunk query, it must retrieve a list of indexes from Splunk. This step can add a considerable amount of time to the planning phase. This PR introduces a simple in-memory cache for the Splunk plugin which caches the list of indexes to avoid having to query Splunk repeatedly to obtain this information.

This PR also makes a few unrelated minor improvements:

Updates the test container to Splunk version 9.3 which at the time of writing is the most current version. I had to update some unit tests as a result.
Adds a new config option for the maximum columns returned in Splunk
Adds the actual SPL sent to Splunk to the query plan. This can be useful for debugging.

Documentation

(Added to README)
For every query that you send to Splunk from Drill, Drill will have to pull schema information from Splunk. If you have a lot of indexes, this process can cause slow planning time. To improve planning time, you can configure Drill to cache the index names so that it does not need to make additional calls to Splunk.

There are two configuration parameters for the schema caching: maxCacheSize and cacheExpiration. The maxCacheSize defaults to 10k bytes and the cacheExpiration defaults to 1024 minutes. To disable schema caching simply set the cacheExpiration parameter to a value less than zero.

Testing

Ran all unit tests and tested manually.

jnturton · 2024-08-03T10:26:34Z

contrib/storage-splunk/pom.xml

+    <dependency>
+      <groupId>com.github.ben-manes.caffeine</groupId>
+      <artifactId>caffeine</artifactId>
+      <version>2.9.3</version>


Can we achieve the same thing using Guava's caching? The reason I ask is that we already have this insanely big dependency tree and Guava is already in it...

https://www.baeldung.com/guava-cache

But so is caffeine now that I look! So I guess we can ignore this suggestion.

@jnturton Is that a +1? I somehow broke the versioning when I rebased on the current master, but I'll fix before merging.

Sorry, I got pulled away before I could continue but will complete the review today.

Please lift the caffeine.version property from metastore/iceberg-metastore/pom.xml to the root pom and either

add caffeine with caffeine.version to dependencyManagement in the root pom and remove version numbers here and in the Iceberg metastore, or

make both this pom and Iceberg metastore pom specify <version>${caeffine.version}</version>.

So that we standardise the version of Caffeine that gets pulled in.

cgivre · 2024-08-25T18:54:02Z

@jnturton It looks like the GitHub CI is failing on the Hadoop 2 tests with Hive.

jnturton · 2024-10-10T03:08:46Z

contrib/storage-splunk/pom.xml

+    <dependency>
+      <groupId>com.github.ben-manes.caffeine</groupId>
+      <artifactId>caffeine</artifactId>
+      <version>2.9.3</version>


Please lift the caffeine.version property from metastore/iceberg-metastore/pom.xml to the root pom and either

add caffeine with caffeine.version to dependencyManagement in the root pom and remove version numbers here and in the Iceberg metastore, or

make both this pom and Iceberg metastore pom specify <version>${caeffine.version}</version>.

So that we standardise the version of Caffeine that gets pulled in.

contrib/storage-splunk/src/main/java/org/apache/drill/exec/store/splunk/SplunkSchema.java

jnturton · 2024-10-10T03:32:36Z

contrib/storage-splunk/src/main/java/org/apache/drill/exec/store/splunk/SplunkSchema.java

    }
+    // Clear the index cache.
+    if (useCache) {
+      cache.invalidate(getNameForCache());


It feels like it would be more natural (and efficient) for the cache to hold one entry per Splunk index, rather than a single entry containing the list of all indexes. Is there a reason it isn't built that way?

@jnturton I think that's exactly what it does. The invalidate is just the delete method, so the code there removes any cache entries with that entry. Also as an FYI, the cache adds the username to the index name so that if user translation is enabled, users will not see other users' cache.

jnturton · 2024-10-11T06:41:03Z

contrib/storage-splunk/src/main/java/org/apache/drill/exec/store/splunk/SplunkSchema.java

+    String nameKey = getNameForCache();
+    if (useCache) {
+      indexList = cache.getIfPresent(nameKey);
+    }


@cgivre is the cache really storing one Splunk index per key? Here it looks to me like there's a single cache key, derived from the queryUserName and the plugin name, that holds a list of indexes. Or am I just being confused by the Caffeine API?

@jnturton ,
I'm sorry, I misspoke. The way I intended the cache to work was that we combine the plugin name + user name to create a key, and the value is a list of indexes. Every time a user adds or drops an index, we have to recreate the cache entry for that plugin/username.

So:

splunk1-cgivre: [index1, index2, index2] splunk2-cgivre: [index5, index6, index7] splunk1-jnturton: [index1, index2]

That's what the cache should look like. (In theory)

Okay, in that case, can I propose

a Map of caches with one cache per plugin name + query user and one cache key per Splunk index or

a unified cache with one cache key per plugin name + query user + Splunk index name?

@jnturton I'm remembering why I did it this way. When the plugin is accessed, Drill calls the registerTable method which loads the list of indexes in to memory. From my recollection, all similar storage plugins, do something similar in that they load the schemata into memory when the plugin is first accessed. Here the change I made was to first check the memory cache and if that's not populated, then call the index list from Splunk.

I'm happy to do this as you suggest but I'm a little confused as to what exactly you're asking for.

Let's go with this implementation. Fine tuning can come later if it's needed.

cgivre · 2024-11-15T19:03:15Z

@jnturton
Thanks for the advice about the swap file. We got a clean CI run. Are we good to merge?

cgivre self-assigned this Jul 24, 2024

cgivre requested a review from jnturton July 25, 2024 16:29

cgivre changed the title ~~Add Index Cache to Splunk Plugin~~ DRILL-8504: Add Schema Caching to Splunk Plugin Jul 29, 2024

cgivre marked this pull request as ready for review July 29, 2024 15:00

cgivre added enhancement PRs that add a new functionality to Drill doc-impacting PRs that affect the documentation dependencies labels Jul 30, 2024

jnturton reviewed Aug 3, 2024

View reviewed changes

cgivre force-pushed the splunk_schema_cache branch from af3d11b to f372ad6 Compare August 5, 2024 05:22

cgivre force-pushed the splunk_schema_cache branch from 74f6b90 to 0d8364b Compare August 12, 2024 16:47

cgivre force-pushed the splunk_schema_cache branch from 2ff2983 to fd2549a Compare August 26, 2024 13:00

cgivre force-pushed the splunk_schema_cache branch from fd2549a to f8bcc82 Compare October 9, 2024 14:05

jnturton requested changes Oct 10, 2024

View reviewed changes

jnturton reviewed Oct 11, 2024

View reviewed changes

cgivre and others added 13 commits November 11, 2024 16:57

Initial Experiments

49b8ae1

Downgrade Caffeine to support Java 8

48eaf42

Various fixes

ad83b34

Added config options

f7486a8

Updated docs and unit tests

59feb62

Added SPL to query plan

6f5fe3e

Moved SPL to group scan

6739905

Formatting and minor bug fix

e38ad9c

Bump Splunk test to 9.3

992e53d

Fix unit itest

c5687ad

Hopefully fixed UT

238fe48

Clear swap space

42e0768

Addressed review comments

1d3a872

cgivre force-pushed the splunk_schema_cache branch from 74b29c4 to 1d3a872 Compare November 11, 2024 21:57

Minor code cleanup

413e4ce

Reduce swap file size

1decd4b

jnturton approved these changes Nov 18, 2024

View reviewed changes

cgivre merged commit 70d6a95 into apache:master Nov 18, 2024
8 checks passed

DRILL-8504: Add Schema Caching to Splunk Plugin #2929

DRILL-8504: Add Schema Caching to Splunk Plugin #2929

Uh oh!

Conversation

cgivre commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DRILL-8504: Add Schema Caching to Splunk Plugin

Description

Documentation

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgivre commented Aug 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnturton Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgivre commented Nov 15, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cgivre commented Jul 24, 2024 •

edited

Loading

jnturton Oct 11, 2024 •

edited

Loading