We are testing open sharing where the provider is using Databricks and as a consumer, we have setup open source Spark on our GCP Compute engine. The requirement is to connect to Delta Sharing server over proxy for authentication and accessing files privately with no_proxy configuration.
This works fine with delta sharing python client but doesn't work with Spark Connector.
Command :
spark-submit --packages io.delta:delta-sharing-spark_2.13:4.2.0 --conf "spark.driver.extraJavaOptions=-Dhttps.proxyHost= -Dhttps.proxyPort=8080 -Dhttp.proxyHost= -Dhttp.proxyPort=8080 -Djava.net.useSystemProxies=true" --conf "spark.executor.extraJavaOptions=-Dhttps.proxyHost= -Dhttps.proxyPort=8080 -Dhttp.proxyHost= -Dhttp.proxyPort=8080 -Djava.net.useSystemProxies=true" --conf "spark.delta.sharing.network.proxyHost=" --conf "spark.delta.sharing.network.proxyPort=8080" delta_test.py
delta_test.py
import delta_sharing
from pyspark.sql import SparkSession
spark = SparkSession.builder
.appName("DeltaSharing")
.config("spark.delta.sharing.network.proxyHost", "")
.config("spark.delta.sharing.network.proxyPort", "8080")
.config("spark.delta.sharing.network.sslTrustAll", "true")
.config("spark.jars.packages", "io.delta:delta-sharing-spark_2.13:4.2.0")
.config("spark.jars.repositories", "https://repo1/maven.org/maven2")
.getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
profile_file = "/root/oauth_config_cs.share"
Create a SharingClient.
client = delta_sharing.SharingClient(profile_file)
List all shared tables.
tables = client.list_all_tables() ###### THIS WORKS ######
print(tables)
table_url = profile_file + "#<share.catalog.schema.table">
df = spark.read.format("deltaSharing").load(table_url)
df.show(10)
spark.stop()
Logs:
26/04/23 12:25:35 INFO DeltaSharingRestClient: DeltaSharingRestClient with endStreamActionEnabled: false, enableAsyncQuery:false, skipFileIdHashVerification:false
26/04/23 12:25:35 DEBUG FsUrlStreamHandlerFactory: Creating handler for protocol http
26/04/23 12:25:35 DEBUG FsUrlStreamHandlerFactory: Unknown protocol http, delegating to default implementation
26/04/23 12:25:35 DEBUG FsUrlStreamHandlerFactory: Creating handler for protocol https
26/04/23 12:25:35 DEBUG FsUrlStreamHandlerFactory: Unknown protocol https, delegating to default implementation
26/04/23 12:25:35 DEBUG RequestAddCookies: CookieSpec selected: default
26/04/23 12:25:35 DEBUG RequestAuthCache: Auth cache not set in the context
26/04/23 12:25:35 DEBUG PoolingHttpClientConnectionManager: Connection request: [route: {s}->https://login.microsoftonline.com:443][total available: 0; route allocated: 0 of 2; total allocated: 0 of 20]
26/04/23 12:25:35 DEBUG PoolingHttpClientConnectionManager: Connection leased: [id: 0][route: {s}->https://login.microsoftonline.com:443][total available: 0; route allocated: 1 of 2; total allocated: 1 of 20]
26/04/23 12:25:35 DEBUG MainClientExec: Opening connection {s}->https://login.microsoftonline.com:443
26/04/23 12:25:35 DEBUG DefaultHttpClientConnectionOperator: Connecting to login.microsoftonline.com/20.190.159.130:443
26/04/23 12:25:35 DEBUG SSLConnectionSocketFactory: Connecting socket to login.microsoftonline.com/20.190.159.130:443 with timeout 320000
26/04/23 12:27:46 DEBUG DefaultHttpClientConnectionOperator: Connect to login.microsoftonline.com/20.190.159.130:443 timed out. Connection will be retried using another IP address
26/04/23 12:27:46 DEBUG DefaultHttpClientConnectionOperator: Connecting to login.microsoftonline.com/20.190.159.2:443
26/04/23 12:27:46 DEBUG SSLConnectionSocketFactory: Connecting socket to login.microsoftonline.com/20.190.159.2:443 with timeout 320000
26/04/23 12:29:57 DEBUG DefaultHttpClientConnectionOperator: Connect to login.microsoftonline.com/20.190.159.2:443 timed out. Connection will be retried using another IP address
26/04/23 12:29:57 DEBUG DefaultHttpClientConnectionOperator: Connecting to login.microsoftonline.com/40.126.31.2:443
26/04/23 12:29:57 DEBUG SSLConnectionSocketFactory: Connecting socket to login.microsoftonline.com/40.126.31.2:443 with timeout 320000
We are testing open sharing where the provider is using Databricks and as a consumer, we have setup open source Spark on our GCP Compute engine. The requirement is to connect to Delta Sharing server over proxy for authentication and accessing files privately with no_proxy configuration.
This works fine with delta sharing python client but doesn't work with Spark Connector.
Command :
spark-submit --packages io.delta:delta-sharing-spark_2.13:4.2.0 --conf "spark.driver.extraJavaOptions=-Dhttps.proxyHost= -Dhttps.proxyPort=8080 -Dhttp.proxyHost= -Dhttp.proxyPort=8080 -Djava.net.useSystemProxies=true" --conf "spark.executor.extraJavaOptions=-Dhttps.proxyHost= -Dhttps.proxyPort=8080 -Dhttp.proxyHost= -Dhttp.proxyPort=8080 -Djava.net.useSystemProxies=true" --conf "spark.delta.sharing.network.proxyHost=" --conf "spark.delta.sharing.network.proxyPort=8080" delta_test.py
delta_test.py
import delta_sharing
from pyspark.sql import SparkSession
spark = SparkSession.builder
.appName("DeltaSharing")
.config("spark.delta.sharing.network.proxyHost", "")
.config("spark.delta.sharing.network.proxyPort", "8080")
.config("spark.delta.sharing.network.sslTrustAll", "true")
.config("spark.jars.packages", "io.delta:delta-sharing-spark_2.13:4.2.0")
.config("spark.jars.repositories", "https://repo1/maven.org/maven2")
.getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
profile_file = "/root/oauth_config_cs.share"
Create a SharingClient.
client = delta_sharing.SharingClient(profile_file)
List all shared tables.
tables = client.list_all_tables() ###### THIS WORKS ######
print(tables)
table_url = profile_file + "#<share.catalog.schema.table">
df = spark.read.format("deltaSharing").load(table_url)
df.show(10)
spark.stop()
Logs:
26/04/23 12:25:35 INFO DeltaSharingRestClient: DeltaSharingRestClient with endStreamActionEnabled: false, enableAsyncQuery:false, skipFileIdHashVerification:false
26/04/23 12:25:35 DEBUG FsUrlStreamHandlerFactory: Creating handler for protocol http
26/04/23 12:25:35 DEBUG FsUrlStreamHandlerFactory: Unknown protocol http, delegating to default implementation
26/04/23 12:25:35 DEBUG FsUrlStreamHandlerFactory: Creating handler for protocol https
26/04/23 12:25:35 DEBUG FsUrlStreamHandlerFactory: Unknown protocol https, delegating to default implementation
26/04/23 12:25:35 DEBUG RequestAddCookies: CookieSpec selected: default
26/04/23 12:25:35 DEBUG RequestAuthCache: Auth cache not set in the context
26/04/23 12:25:35 DEBUG PoolingHttpClientConnectionManager: Connection request: [route: {s}->https://login.microsoftonline.com:443][total available: 0; route allocated: 0 of 2; total allocated: 0 of 20]
26/04/23 12:25:35 DEBUG PoolingHttpClientConnectionManager: Connection leased: [id: 0][route: {s}->https://login.microsoftonline.com:443][total available: 0; route allocated: 1 of 2; total allocated: 1 of 20]
26/04/23 12:25:35 DEBUG MainClientExec: Opening connection {s}->https://login.microsoftonline.com:443
26/04/23 12:25:35 DEBUG DefaultHttpClientConnectionOperator: Connecting to login.microsoftonline.com/20.190.159.130:443
26/04/23 12:25:35 DEBUG SSLConnectionSocketFactory: Connecting socket to login.microsoftonline.com/20.190.159.130:443 with timeout 320000
26/04/23 12:27:46 DEBUG DefaultHttpClientConnectionOperator: Connect to login.microsoftonline.com/20.190.159.130:443 timed out. Connection will be retried using another IP address
26/04/23 12:27:46 DEBUG DefaultHttpClientConnectionOperator: Connecting to login.microsoftonline.com/20.190.159.2:443
26/04/23 12:27:46 DEBUG SSLConnectionSocketFactory: Connecting socket to login.microsoftonline.com/20.190.159.2:443 with timeout 320000
26/04/23 12:29:57 DEBUG DefaultHttpClientConnectionOperator: Connect to login.microsoftonline.com/20.190.159.2:443 timed out. Connection will be retried using another IP address
26/04/23 12:29:57 DEBUG DefaultHttpClientConnectionOperator: Connecting to login.microsoftonline.com/40.126.31.2:443
26/04/23 12:29:57 DEBUG SSLConnectionSocketFactory: Connecting socket to login.microsoftonline.com/40.126.31.2:443 with timeout 320000