Skip to content

Conversation

@hawk9821
Copy link
Contributor

@hawk9821 hawk9821 commented Jan 22, 2026

Purpose of this pull request

fix: #10374
Resolve null first column in CSV Reader with UTF-8 BOM files

Does this PR introduce any user-facing change?

How was this patch tested?

CsvReadStrategyTest#testUtf8BomCsvWithHeaderRead

Check list

// issues
List<String> cleanedHeaders =
headers.stream()
.map(header -> header.replace("\uFEFF", "").trim())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other characters besides \uFEFF that can cause this problem? In addition, does the reading of text and excel also have this problem?

Copy link
Contributor Author

@hawk9821 hawk9821 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\uFEFF is the UTF-8 BOM (Byte Order Mark), a special non-printable character that usually appears at the start of files saved in the UTF-8 with BOM encoding. (Tools such as Notepad on the Windows system often add this mark automatically when saving files in UTF-8 format.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Collaborator

@chl-wxp chl-wxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a bug where UTF-8 BOM (Byte Order Mark) characters in CSV files cause the first column to be null when csv_use_header_line=true. The fix introduces BOMInputStream from Apache Commons IO to automatically detect and skip BOM markers, and adds additional header cleaning logic to remove any residual BOM characters.

Changes:

  • Added BOMInputStream to detect and handle UTF-8 BOM in CSV files during reading
  • Implemented header cleaning logic to remove BOM characters and trim whitespace from column names
  • Added test case to verify BOM handling with a sample UTF-8 BOM encoded CSV file

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java Updated CSV reading logic to wrap input streams with BOMInputStream for automatic BOM detection, detect charset based on BOM presence, and clean headers of BOM characters
seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategyTest.java Added test method testUtf8BomCsvWithHeaderRead() to verify correct parsing of UTF-8 BOM CSV files with headers
seatunnel-connectors-v2/connector-file/connector-file-base/src/test/resources/csv/utf8_bom.csv Added test CSV file with UTF-8 BOM marker for testing BOM handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 123 to 126
Charset charset =
bomIn.getBOM() == null
? Charset.forName(encoding)
: Charset.forName(bomIn.getBOM().getCharsetName());
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The charset detection logic assumes that if a BOM is detected, the BOM's charset name will be valid. However, there's no null check on bomIn.getBOM().getCharsetName() when bomIn.getBOM() is not null. If getBOM() returns a non-null value but getCharsetName() returns null or an invalid charset name, Charset.forName() will throw an exception. Consider adding defensive checks or using a try-catch to handle potential IllegalCharsetNameException or UnsupportedCharsetException.

Copilot uses AI. Check for mistakes.
@DanielCarter-stack
Copy link

Issue 1: BOM Processing Logic Error in Split Scenario (BLOCKER)

Location: CsvReadStrategy.java:102-105

Code:

try (BOMInputStream bomIn = new BOMInputStream(wrapInputStream(inputStream, split));
        BufferedReader reader =
                new BufferedReader(new InputStreamReader(bomIn, getCharset(bomIn)));
        CSVParser csvParser = new CSVParser(reader, getCSVFormat())) {

Related Context:

  • Caller: AbstractReadStrategy.resolveArchiveCompressedInputStream() passes InputStream starting from position split.getStart() in split scenario
  • wrapInputStream() internally calls safeSlice(resultStream, split.getStart(), split.getLength())
  • safeSlice() definition: protected static InputStream safeSlice(InputStream in, long start, long length) (AbstractReadStrategy.java:510)

Problem Description:

In split scenario (enableSplitFile=true), the original inputStream already starts from the middle of the file (e.g., starting from byte 1000). The safeSlice() in wrapInputStream() method will skip split.getStart() bytes AGAIN (e.g., skip 1000 bytes), causing:

  1. Duplicate skip: Actual data start position = 原始start + split.start, skipping 2x offset
  2. Data loss: Data at the beginning of the file is skipped
  3. BOM detection failure: BOM is at the beginning of the file (bytes 0-2), but split starts from the middle, BOM can never be detected

Example Scenario:

文件: [BOM 3字节][Header 20字节][Data1 50字节][Data2 50字节]...
       0-2        3-22          23-72        73-122

Split1: start=0, length=75 → 读取 [BOM+Header+Data1]
Split2: start=75, length=50 → 读取 [Data2]

Split2处理流程:
1. HDFS返回从byte 75开始的InputStream(已跳过前75字节)
2. wrapInputStream()调用safeSlice(stream, 75, 50)
3. safeSlice再次跳过75字节 → 实际从byte 150开始
4. 结果:Data2完全丢失,读取了错误位置的数据

Root Cause:
Confusion between two split modes:

  1. HDFS/FileSystem native split: HDFS API can open(path, start), return stream starting from start position (already skipped first start bytes)
  2. Application layer split: Use safeSlice() to trim complete stream

Need to determine whether inputStream is already a split stream.

Potential Risks:

  • Risk 1: Data error (skipping expected data)
  • Risk 2: Array out of bounds (skip exceeds stream length)
  • Risk 3: Silent error (data partially missing but no exception)

Impact Scope:

  • Direct impact: CsvReadStrategy.readProcess() in enableSplitFile=true scenario
  • Indirect impact: All CSV files using split read
  • Impact surface: CSV Connector (common production scenario)

Severity: BLOCKER - Breaks core functionality

Improvement Suggestions:

// Solution 1: Pass whether InputStream is chunked via parameter (recommended)
private InputStream wrapInputStream(InputStream inputStream, FileSourceSplit split, 
                                    boolean isAlreadySliced) throws IOException {
    InputStream resultStream;
    switch (compressFormat) {
        case LZO:
            LzopCodec lzo = new LzopCodec();
            resultStream = lzo.createInputStream(inputStream);
            break;
        case NONE:
            resultStream = inputStream;
            break;
        default:
            log.warn("Csv file does not support this compress type: {}",
                    compressFormat.getCompressCodec());
            resultStream = inputStream;
            break;
    }
    // Only perform slice when InputStream is not chunked
    if (enableSplitFile && split.getLength() > -1 && !isAlreadySliced) {
        resultStream = safeSlice(resultStream, split.getStart(), split.getLength());
    }
    return resultStream;
}

// Caller needs to pass the flag
// Flag source: Check if inputStream is BoundedInputStream or other chunked stream type

or

// Solution 2: Determine by checking InputStream type (more complex implementation)
private InputStream wrapInputStream(InputStream inputStream, FileSourceSplit split) 
                                    throws IOException {
    InputStream resultStream;
    switch (compressFormat) {
        case LZO:
            LzopCodec lzo = new LzopCodec();
            resultStream = lzo.createInputStream(inputStream);
            break;
        case NONE:
            resultStream = inputStream;
            break;
        default:
            log.warn("Csv file does not support this compress type: {}",
                    compressFormat.getCompressCodec());
            resultStream = inputStream;
            break;
    }
    // Check if it's already BoundedInputStream (return type of safeSlice)
    boolean isAlreadySliced = (resultStream instanceof BoundedInputStream);
    if (enableSplitFile && split.getLength() > -1 && !isAlreadySliced) {
        resultStream = safeSlice(resultStream, split.getStart(), split.getLength());
    }
    return resultStream;
}

Rationale:

  • Need to distinguish between "split stream" and "complete stream" to avoid duplicate skip
  • Solution 1 is clearer but requires interface modification or using ThreadLocal to pass state
  • Solution 2 is simple to implement but relies on specific class type (fragile)
  • Need to synchronously modify the same logic in TextReadStrategy and JsonReadStrategy

Issue 2: Bug in Old LZO Compression + Split Scenario Not Explained in PR (CRITICAL)

Location: CsvReadStrategy.java:194 (new code) vs original code line 118

Old Code:

switch (compressFormat) {
    case LZO:
        LzopCodec lzo = new LzopCodec();
        actualInputStream = lzo.createInputStream(inputStream);  // Create decompression stream
        break;
    // ...
}
if (enableSplitFile && split.getLength() > -1) {
    actualInputStream = safeSlice(inputStream, split.getStart(), split.getLength());  // ❌ Use original inputStream instead of actualInputStream
}

New Code:

switch (compressFormat) {
    case LZO:
        LzopCodec lzo = new LzopCodec();
        resultStream = lzo.createInputStream(inputStream);
        break;
    // ...
}
if (enableSplitFile && split.getLength() > -1) {
    resultStream = safeSlice(resultStream, split.getStart(), split.getLength());  // ✓ Use resultStream
}

Problem Description:

Old code has serious bug in LZO compression + split scenario:

  1. First create decompression stream lzo.createInputStream(inputStream)actualInputStream
  2. Then use original compressed stream inputStream for safeSlice() → overwrites actualInputStream
  3. Result: Actually reading compressed stream middle part, not decompressed stream

Example Impact:

场景:LZO压缩CSV文件,分片读取

旧版行为:
1. actualInputStream = 解压流(解压后100MB)
2. actualInputStream = safeSlice(压缩流, start=1GB, length=10MB)
   → 对10MB压缩数据进行slice,然后作为CSV解析
   → 错误:解析了压缩二进制数据而非CSV文本

预期行为:
1. resultStream = 解压流
2. resultStream = safeSlice(解压流, start=500MB, length=10MB)
   → 对解压后的数据进行slice

Potential Risks:

  • Risk 1: Data parsing failure (treating compressed data as CSV)
  • Risk 2: Data corruption
  • Risk 3: Production task silent failure

Impact Scope:

  • Direct impact: Split read of LZO compressed CSV files
  • Indirect impact: All CSV data sources using LZO compression
  • Impact surface: CSV Connector + LZO compression scenario

Severity: CRITICAL - Fixed existing bug but not explained

Improvement Suggestions:

  1. Clearly explain in PR description:

    ### Fixed Bugs
    - Fixed a bug where LZO compressed CSV files with split enabled would parse 
      compressed binary data instead of decompressed CSV text. The old code 
      incorrectly sliced the original compressed stream instead of the decompressed 
      stream.
    
  2. 添加测试用例:

@Test
public void testLzoCompressedCsvWithSplit() throws Exception {
    // Need to prepare LZO compressed CSV test file
    // Verify correctness of split read
}
  1. 同步检查其他Reader:
    • TextReadStrategy.java:209 - 相同的bug需要修复
    • JsonReadStrategy.java:119 - 相同的bug需要修复

** Reason**:

  • 这是对现有bug的修复,而非仅BOM处理
  • 应该在PR中明确说明,便于Code Review和回归测试
  • 其他Reader有相同bug,应该一起修复或创建跟踪Issue

## Issue 3: BOM handling not tested when firstLineAsHeader=false (MAJOR)

** Location**: CsvReadStrategyTest.java

** Problem Description**:

当前测试仅覆盖 firstLineAsHeader=true 场景:

map.put(FileBaseSourceOptions.CSV_USE_HEADER_LINE.key(), "true");

firstLineAsHeader=false 时(用户显式提供schema),行为不同:

  • Header从 inputCatalogTable 获取,不从文件读取
  • BOMInputStream仍会跳过BOM
  • 但第一行数据可能以BOM开头(如果首列被当作数据)

** Potential Risk**:

  • 风险1:firstLineAsHeader=false 时,第一行数据首列可能包含BOM字符
  • 风险2:用户schema无BOM字符,数据不匹配

** Example Scenario**:

File: 9821,hawk,37,M
User schema: id(INT), name(STRING), age(INT), gender(STRING)

Current behavior:
1. BOMInputStream skips BOM
2. First line read as: 9821,hawk,37,M ✓ Correct

But if BOM handling fails:
1. First line read as: \uFEFF9821,hawk,37,M
2. Parse 9821 as INT → NumberFormatException or id=\uFEFF9821

** Scope of Impact**:

  • 直接影响:使用 firstLineAsHeader=false 且文件有BOM的场景
  • 间接影响:无
  • 影响面:CSV Connector特定配置

** Severity**: MAJOR - Edge case not covered

** Improvement Suggestion**:

添加测试用例:

@Test
public void testUtf8BomCsvWithoutHeaderRead() throws Exception {
    URL resource = CsvReadStrategyTest.class.getResource("/csv/utf8_bom.csv");
    String path = Paths.get(resource.toURI()).toString();
    CsvReadStrategy csvReadStrategy = new CsvReadStrategy();
    LocalConf localConf = new LocalConf(FS_DEFAULT_NAME_DEFAULT);
    csvReadStrategy.init(localConf);
    csvReadStrategy.getFileNamesByPath(path);
    // Do not set CSV_USE_HEADER_LINE, defaults to false
    csvReadStrategy.setPluginConfig(ConfigFactory.empty());
    csvReadStrategy.setCatalogTable(
            CatalogTableUtil.getCatalogTable(
                    "test",
                    new SeaTunnelRowType(
                            new String[] {"id", "name", "age", "gender"},
                            new SeaTunnelDataType[] {
                                BasicType.INT_TYPE,
                                BasicType.STRING_TYPE,
                                BasicType.INT_TYPE,
                                BasicType.STRING_TYPE
                            })));
    TestCollector testCollector = new TestCollector();
    csvReadStrategy.read(path, "", testCollector);
    final List<SeaTunnelRow> rows = testCollector.getRows();
    // Should read 3 lines (including header line as data)
    Assertions.assertEquals(3, rows.size());
    // First line (original header) should parse correctly, first column should not have BOM
    Assertions.assertEquals(9821, rows.get(0).getField(0));
    Assertions.assertEquals("hawk", rows.get(0).getField(1));
    // ...other assertions
}

** Reason**:

  • 测试覆盖不足,边界场景未验证
  • BOMInputStream在 firstLineAsHeader=false 时的行为需要确认
  • 增加测试可防止未来回归

## Issue 4: Insufficient explanation for cleanedHeaders necessity (MINOR)

** Location**: CsvReadStrategy.java:122-125

** Code**:

// Clean up BOM characters (\ uFEFF) in the header to solve occasional BOM residue
// issues
List<String> cleanedHeaders =
        headers.stream()
                .map(header -> header.replace("\uFEFF", ""))
                .collect(Collectors.toList());

** Problem Description**:

注释提到"occasional BOM residue issues"(偶然的BOM残留问题),但未说明:

  1. 什么情况下BOMInputStream会失败?
  2. 如果BOMInputStream工作正常,为什么还需要 replace("\uFEFF", "")
  3. 这是防御性编程还是已知边缘情况?

** Potential Risk**:

  • 风险1:代码意图不清晰
  • 风险2:可能掩盖真实的BOM处理bug

** Scope of Impact**:

  • 直接影响:代码可维护性
  • 间接影响:无
  • 影响面:CsvReadStrategy单一方法

** Severity**: MINOR - Documentation/comment issue

** Improvement Suggestion**:

// Although BOMInputStream should handle BOM correctly at the byte stream level,
// we add this defensive cleaning to handle edge cases where:
// 1. BOMInputStream fails to detect certain BOM variants
// 2. BOM characters are embedded in the file content (not at the start)
// 3. Encoding conversion issues (e.g., when BOM is present but encoding mismatch)
List<String> cleanedHeaders =
        headers.stream()
                .map(header -> header.replace("\uFEFF", ""))
                .collect(Collectors.toList());

或如果确实不需要,考虑移除并添加断言:

List<String> headers = getHeaders(csvParser);
// Assert: BOMInputStream should have removed BOM, so headers shouldn't contain \uFEFF
if (headers.stream().anyMatch(h -> h.contains("\uFEFF"))) {
    log.warn("BOM character found in CSV headers despite BOMInputStream processing. " +
             "Headers: {}", headers);
}

** Reason**:

  • 代码可读性和可维护性
  • 防御性编程需要明确说明原因
  • 如果BOMInputStream工作正常,这个清理是多余的

## Issue 5: Lack of support documentation for other BOM types (MINOR)

** Location**: CsvReadStrategy.java:199-202

** Code**:

private Charset getCharset(BOMInputStream bomIn) throws IOException {
    return bomIn.getBOM() == null
            ? Charset.forName(encoding)
            : Charset.forName(bomIn.getBOM().getCharsetName());
}

** Problem Description**:

当前代码支持自动检测多种BOM类型(UTF-8, UTF-16 LE, UTF-16 BE, UTF-32等),但:

  1. 测试仅覆盖UTF-8 BOM
  2. 文档未说明支持其他BOM类型
  3. 未说明当用户配置encoding与BOM声明的encoding不一致时的行为

** Example Scenario**:

File: UTF-16 LE BOM file
User config: encoding = "UTF-8"

Current behavior:
1. BOMInputStream detects UTF-16 LE BOM
2. getCharset() returns "UTF-16LE"
3. But user expects UTF-8 → may cause data error

** Potential Risk**:

  • 风险1:用户配置被忽略
  • 风险2:不同BOM类型的行为未测试

** Scope of Impact**:

  • 直接影响:非UTF-8 BOM文件的处理
  • 间接影响:无
  • 影响面:CSV Connector

** Severity**: MINOR - Feature enhancement/documentation issue

** Improvement Suggestion**:

  1. 添加测试(可选,如果支持其他BOM):
@Test
public void testUtf16BomCsv() throws Exception {
    // Prepare CSV file with UTF-16 BE/LE BOM
    // Verify automatic detection and correct parsing
}
  1. 文档说明:
/**
 * Detects the character set from BOM if present.
 * 
 * Supported BOM types:
 * - UTF-8 (EF BB BF)
 * - UTF-16 BE (FE FF)
 * - UTF-16 LE (FF FE)
 * - UTF-32 BE (00 00 FE FF)
 * - UTF-32 LE (FF FE 00 00)
 * 
 * If BOM is detected, the charset declared by the BOM takes precedence
 * over the user-configured encoding. This follows the standard behavior
 * of BOMInputStream.
 * 
 * @param bomIn the BOMInputStream to check
 * @return the detected or configured charset
 */
private Charset getCharset(BOMInputStream bomIn) throws IOException {
    return bomIn.getBOM() == null
            ? Charset.forName(encoding)
            : Charset.forName(bomIn.getBOM().getCharsetName());
}
  1. 或考虑添加配置选项:
// If true, respect BOM charset; if false, always use user-configured encoding
private boolean respectBomEncoding = true;

** Reason**:

  • 功能完整性
  • 用户明确知道行为
  • 如果不支持其他BOM,应该明确说明

## Issue 6: Formatting error in comments (MINOR)

** Location**: CsvReadStrategy.java:120

** Code**:

// Clean up BOM characters (\ uFEFF) in the header to solve occasional BOM residue
// issues

** Problem Description**: There's a space in the comment (\ uFEFF) which should be (\uFEFF)

** Scope of Impact**: Comment format

** Severity**: MINOR - Typo

** Improvement Suggestion**:

// Clean up BOM characters (\uFEFF) in the header to solve occasional BOM residue issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Connector-File] UTF-8 BOM causes the first column of data to be null during CSV parsing

3 participants