-
Notifications
You must be signed in to change notification settings - Fork 986
DRILL-8504: Add Schema Caching to Splunk Plugin #2929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
contrib/storage-splunk/pom.xml
Outdated
| <dependency> | ||
| <groupId>com.github.ben-manes.caffeine</groupId> | ||
| <artifactId>caffeine</artifactId> | ||
| <version>2.9.3</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we achieve the same thing using Guava's caching? The reason I ask is that we already have this insanely big dependency tree and Guava is already in it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But so is caffeine now that I look! So I guess we can ignore this suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnturton Is that a +1? I somehow broke the versioning when I rebased on the current master, but I'll fix before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I got pulled away before I could continue but will complete the review today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please lift the caffeine.version property from metastore/iceberg-metastore/pom.xml to the root pom and either
- add caffeine with
caffeine.versionto dependencyManagement in the root pom and remove version numbers here and in the Iceberg metastore, or - make both this pom and Iceberg metastore pom specify
<version>${caeffine.version}</version>.
So that we standardise the version of Caffeine that gets pulled in.
af3d11b to
f372ad6
Compare
74f6b90 to
0d8364b
Compare
|
@jnturton It looks like the GitHub CI is failing on the Hadoop 2 tests with Hive. |
2ff2983 to
fd2549a
Compare
fd2549a to
f8bcc82
Compare
contrib/storage-splunk/pom.xml
Outdated
| <dependency> | ||
| <groupId>com.github.ben-manes.caffeine</groupId> | ||
| <artifactId>caffeine</artifactId> | ||
| <version>2.9.3</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please lift the caffeine.version property from metastore/iceberg-metastore/pom.xml to the root pom and either
- add caffeine with
caffeine.versionto dependencyManagement in the root pom and remove version numbers here and in the Iceberg metastore, or - make both this pom and Iceberg metastore pom specify
<version>${caeffine.version}</version>.
So that we standardise the version of Caffeine that gets pulled in.
contrib/storage-splunk/src/main/java/org/apache/drill/exec/store/splunk/SplunkSchema.java
Outdated
Show resolved
Hide resolved
| } | ||
| // Clear the index cache. | ||
| if (useCache) { | ||
| cache.invalidate(getNameForCache()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like it would be more natural (and efficient) for the cache to hold one entry per Splunk index, rather than a single entry containing the list of all indexes. Is there a reason it isn't built that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnturton I think that's exactly what it does. The invalidate is just the delete method, so the code there removes any cache entries with that entry. Also as an FYI, the cache adds the username to the index name so that if user translation is enabled, users will not see other users' cache.
| String nameKey = getNameForCache(); | ||
| if (useCache) { | ||
| indexList = cache.getIfPresent(nameKey); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgivre is the cache really storing one Splunk index per key? Here it looks to me like there's a single cache key, derived from the queryUserName and the plugin name, that holds a list of indexes. Or am I just being confused by the Caffeine API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnturton ,
I'm sorry, I misspoke. The way I intended the cache to work was that we combine the plugin name + user name to create a key, and the value is a list of indexes. Every time a user adds or drops an index, we have to recreate the cache entry for that plugin/username.
So:
splunk1-cgivre: [index1, index2, index2]
splunk2-cgivre: [index5, index6, index7]
splunk1-jnturton: [index1, index2]
That's what the cache should look like. (In theory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, in that case, can I propose
- a Map of caches with one cache per plugin name + query user and one cache key per Splunk index or
- a unified cache with one cache key per plugin name + query user + Splunk index name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnturton I'm remembering why I did it this way. When the plugin is accessed, Drill calls the registerTable method which loads the list of indexes in to memory. From my recollection, all similar storage plugins, do something similar in that they load the schemata into memory when the plugin is first accessed. Here the change I made was to first check the memory cache and if that's not populated, then call the index list from Splunk.
I'm happy to do this as you suggest but I'm a little confused as to what exactly you're asking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with this implementation. Fine tuning can come later if it's needed.
74b29c4 to
1d3a872
Compare
|
@jnturton |
DRILL-8504: Add Schema Caching to Splunk Plugin
Description
Whenever Drill executes a Splunk query, it must retrieve a list of indexes from Splunk. This step can add a considerable amount of time to the planning phase. This PR introduces a simple in-memory cache for the Splunk plugin which caches the list of indexes to avoid having to query Splunk repeatedly to obtain this information.
This PR also makes a few unrelated minor improvements:
Documentation
(Added to README)
For every query that you send to Splunk from Drill, Drill will have to pull schema information from Splunk. If you have a lot of indexes, this process can cause slow planning time. To improve planning time, you can configure Drill to cache the index names so that it does not need to make additional calls to Splunk.
There are two configuration parameters for the schema caching:
maxCacheSizeandcacheExpiration. The maxCacheSize defaults to 10k bytes and thecacheExpirationdefaults to 1024 minutes. To disable schema caching simply set thecacheExpirationparameter to a value less than zero.Testing
Ran all unit tests and tested manually.