Issue 86 : Add support for Datasource V2 : ORC#85
Issue 86 : Add support for Datasource V2 : ORC#85maheshk114 wants to merge 5 commits intoqubole:masterfrom
Conversation
| colNames, colTypes) _ | ||
|
|
||
| //TODO :Its a ugly hack, but avoids lots of duplicate code. | ||
| val rdd = new HiveAcidRDD( |
There was a problem hiding this comment.
Are we doing this only to get Splits ?
amoghmargoor
left a comment
There was a problem hiding this comment.
I have given one pass and see there is load code lift from Spark code base. We would like to understand more about why was that required and why we couldn't have just extended those classes or composed it or use them as it is.
| } | ||
|
|
||
| def keyPrefix() : String = { | ||
| "HiveAcidV2" |
There was a problem hiding this comment.
Would table created using 'USING HiveAcid' would never be able to use DSv2 untill not recreated ?
There was a problem hiding this comment.
Yes. Unless we make V2 the default reader.
| * @param identifier - table identifier | ||
| * @param hiveConf hiveConf - hive conf | ||
| */ | ||
| class HiveAcidMetadata(sparkSession: SparkSession, |
There was a problem hiding this comment.
This is an open API and it can break clients using it.
There was a problem hiding this comment.
looks like the changes are not required, will revert it.
| partitions) | ||
| } | ||
|
|
||
| def getReader(requiredColumns: Array[String], |
There was a problem hiding this comment.
I think we are returning list of Partitions here. We should not be calling it as getReader.
There was a problem hiding this comment.
makes sense, shall change it to getPartitionV2
| * | ||
| * @return - Returns RDD on top of partitioned hive acid table | ||
| */ | ||
| def makeRDDForPartitionedTable(hiveAcidMetadata: HiveAcidMetadata, |
There was a problem hiding this comment.
Any reason for removing partitions from method and making it member variable of class. I think one HiveAcidReader can be used multiple times to read 1, 2 or more partitions. HiveAcidReader should map to just tables and not to <Table, partitions>
There was a problem hiding this comment.
I think the logic is same. Earlier getTableReader used to create the HiveAcidReader and then call getRDD. Now i have just split it into two methods. The readerOptions is already created using the filters passed by user and from that the partitions list is generated. So i am not sure if we should change the partitions list after HiveAcidReader object is created.
| * Spark's vectorized.ColumnVector, this column vector is used to adapt Hive ColumnVector with | ||
| * Spark ColumnarVector. This class is a copy of spark ColumnVector which is declared private. | ||
| */ | ||
| public class HiveAcidColumnVector extends org.apache.spark.sql.vectorized.ColumnVector { |
There was a problem hiding this comment.
This looks similar to org.apache.spark.sql.execution.datasources.orc.OrcColumnVector. Should we not be extending it or using it directly here. We should also be Format agnostic but we can use something like Factory etc to get ColumnVector based on format.
There was a problem hiding this comment.
OrcColumnVector in spark is in private scope so had to copy it.
There was a problem hiding this comment.
@maheshk114 : The class org.apache.spark.sql.vectorized.ColumnVector is public in spark 2.4.x. Check this https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java#L36
There was a problem hiding this comment.
this file is removed
| * After creating, `initialize` and `initBatch` should be called sequentially. This internally uses | ||
| * the Hive ACID Vectorized ORC reader to support reading of deleted and updated data. | ||
| */ | ||
| public class OrcColumnarBatchReader extends RecordReader<Void, ColumnarBatch> { |
There was a problem hiding this comment.
This looks similar to org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader ? Should we use that instead . what are the extra changes here ?
There was a problem hiding this comment.
No it has some code from OrcColumnarBatchReader , but it has logic to support sargs and ACID table reader.
There was a problem hiding this comment.
@maheshk114 : Can't we set searchArguments outside somewhere outside i.e before initializing this reader? There doesn't seem to be anything specific to the reader which we are setting in search arguments.
There was a problem hiding this comment.
The search arguments are set in reader option, so its better to set with the reader code.
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.types._ | ||
|
|
||
| private[hiveacid] class HiveAcidInputPartitionV2(split: HiveAcidPartition, |
There was a problem hiding this comment.
Instead of calling it HiveAcidInputPartitionV2, we should it ColumnarHiveAcidPartition.
There was a problem hiding this comment.
I have kept the name inline with "org.apache.spark.sql.sources.v2" .
d16e9e1 to
e7abcb5
Compare
|
|
||
| private def convertV2(relation: HiveTableRelation): LogicalPlan = { | ||
| val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT) | ||
| if (!serde.equals("org.apache.hadoop.hive.ql.io.orc.orcserde")) { |
There was a problem hiding this comment.
nit: Use HiveSerde.sourceToSerDe() api to do the comparison.
| private def convertV2(relation: HiveTableRelation): LogicalPlan = { | ||
| val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT) | ||
| if (!serde.equals("org.apache.hadoop.hive.ql.io.orc.orcserde")) { | ||
| // Only ORC formatted is supported as of now. If its not ORC, then fallback to |
There was a problem hiding this comment.
Add a log line stating the reason to fallback to v1?
| case relation: HiveTableRelation | ||
| if DDLUtils.isHiveTable(relation.tableMeta) && isConvertible(relation) => | ||
| convert(relation) | ||
| if (spark.conf.get("spark.acid.use.datasource.v2", "false").toBoolean) { |
There was a problem hiding this comment.
nit: change config to spark.hive-acid.read.datasource.version. The config will accept "v1" and "v2" as valid values.
There was a problem hiding this comment.
make sense. This way the config can be extended if later we want to support "v3" reader. Done
|
|
||
| class HiveAcidDataSourceV2 extends DataSourceV2 with ReadSupport with Logging { | ||
| override def createReader (options: DataSourceOptions) : DataSourceReader = { | ||
| logInfo("Using Datasource V2 for table" + options.tableName.get) |
There was a problem hiding this comment.
nit: Change logging to "Creating datasource v2 reader for table ....."
| options.databaseName.get, options.tableName.get) | ||
| } | ||
|
|
||
| def keyPrefix() : String = { |
There was a problem hiding this comment.
@maheshk114 : Where is this method used? Since it is not an overridden method, I am not sure if it is getting used anywhere in spark/acid.
There was a problem hiding this comment.
i have kept it for future use. But i think we should use the same "HiveAcid" for both the readers and switch is based on config.
| //TODO: Need to generalize it for supporting other kind of file format. | ||
| orcColumnarBatchReader.initialize(fileSplit, taskAttemptContext) | ||
| orcColumnarBatchReader.initBatch(readerLocal.getSchema, requestedColIds, | ||
| requiredFields, partitionSchema, partitionValues, isFullAcidTable && !fileSplit.isOriginal) |
There was a problem hiding this comment.
Shouldn't we be creating orcColumnarBatchReader only if these conditions are met:
isFullAcidTable && !fileSplit.isOriginal
Right now orcColumnarBatchReader would work for both acid and non acid reads because we are creating a baseRecordReader but creating AcidRecordReader only if fullAcidScan is true.
Here we should only create vectorized reader for acid reads. For normal reads, we should simply throw an exception.
There was a problem hiding this comment.
fullAcidScan is for full ACID table. The base reader will be used for insert-only table. In that case there will not be any filtering from Hive ACID reader as it does not support update/delete. I think spark native reader does not support read from insert-only Hive ACID tables. If the user has a ACID table with old file format(original), then HiveAcid reader should be able to read it. I don't think we should throw exception.
| import com.qubole.shaded.hadoop.hive.ql.io.sarg.*; | ||
|
|
||
| /** | ||
| * After creating, `initialize` and `initBatch` should be called sequentially. This internally uses |
There was a problem hiding this comment.
To avoid code duplication, we can extend spark's OrcColumnarBatchReader and override the methods required. For example -
- initialize() - where we can initialize our vectorizedAcidRecordReader.
- nextKeyValue() -> to read next batch using vectorizedAcidRecordReader.
- getProgress() -> return progress of acid reader.
There was a problem hiding this comment.
i think most of the code is required ..like putRepeatingValues,putNonNullValues is required as spark reader does not take care of the "selected" rows. In case of Hive ACID row batch, some of the rows needs to be ignored. I think that logic is not present in spark currently. Anyways i have extended it to avoid few duplicate method like putDecimalWritables
| extends InputPartitionReader[ColumnarBatch] { | ||
| //TODO : Need to get a unique id to cache the jobConf. | ||
| private val jobConf = new JobConf(broadcastedConf.value.value) | ||
| private val orcColumnarBatchReader = new OrcColumnarBatchReader(1024) |
There was a problem hiding this comment.
Shouldn't we make capacity configurable? Currently it is hard coded to 1024.
There was a problem hiding this comment.
i am not sure if anybody can use this configuration.But for perf test and all i have am making it configurable using "spark.hive.acid.default.row.batch.size"
| hiveAcidMetadata: HiveAcidMetadata) extends Logging { | ||
|
|
||
| def getRdd(requiredColumns: Array[String], | ||
| def getTableReader(requiredColumns: Array[String], |
| partitions) | ||
| } | ||
|
|
||
| def getPartitionsV2(requiredColumns: Array[String], |
There was a problem hiding this comment.
nit: We can rename it to getInputPartitionsInBatch.
There was a problem hiding this comment.
I think it is better to keep the naming with "V2" to make it inline with spark naming for v2 reader.
|
@maheshk114 : I was testing the changes locally. I created a simple unpartitioned table and tried reading data in a range. I got the wrong results. Here is the query I tried: the output was: Table schema is: I get the expected result if I disable dsv2 reader. Let me know if you need more details regarding the above query. Have you guys tested this PR in your environment? |
|
@sourabh912 thanks for pointing it out. The issue is that ORC does not support row level filtering and thus the filtering has to be done again in spark. We have done some testing internally. But the hive/spark/ORC version is different. For this specific issue, this was already fixed internally, i forgot to merge it to this PR. |
No description provided.