Skip to content

Fix EPAM Indigo parsing so that the S groups needs to saved before collection block#3531

Open
sapiosciences-dev wants to merge 5 commits intoepam:masterfrom
sapiosciences-dev:yqiao_pr/CHEMBUGS-184
Open

Fix EPAM Indigo parsing so that the S groups needs to saved before collection block#3531
sapiosciences-dev wants to merge 5 commits intoepam:masterfrom
sapiosciences-dev:yqiao_pr/CHEMBUGS-184

Conversation

@sapiosciences-dev
Copy link
Collaborator

Participants consulted

Matthew Sage, Roles Portfolio Engineering Senior Manager, BIOVIA

Rob Brown, VP, Product Design, Sapio

Yechen Qiao, Core Product Developer, Sapio

This pull request fixes the V3000 COLLECTION / SGROUP block ordering interoperability issue described in Indigo issue #2109.

Sapio had previously maintained a local hot patch that swapped the emitted SGROUP and COLLECTION sections during CT export, after observing interoperability issues with other toolkits consuming Indigo-generated V3000 output.

Subsequent discussion under the Pistoia Alliance Chemical Exchange Format Committee (CEFC), including clarification from BIOVIA, confirmed that Indigo’s writer should emit these blocks in the documented CTFile order. In particular, the M V30 BEGIN SGROUP ... END SGROUP block should precede the M V30 BEGIN COLLECTION ... END COLLECTION block in V3000 CTAB output. BIOVIA also noted that this ordering helps minimize forward references that must otherwise be maintained by file readers.

This pull request applies that writer-side fix in Indigo and includes unit test coverage for the corrected behavior. Manual validation was also performed after cherry-picking the change into our environment, and the updated output appears to resolve the original interoperability issue.

This fix resolves the observed MOL V3000 interoperability issue with RDKit. At the same time, the CEFC discussion also suggested a broader interoperability principle for V3000 readers: while writers should follow the documented block order, parsers may also benefit from tolerating reasonable block-order variation where practical.

Related to earlier feedback:
Two V3000 export issues: implict H cause wrong index on s groups, and reversed order for COLLECTON / SGroup Blocks. #2109

…the BEGIN SGROUP is before BEGIN COLLECTION

This is consistent with RDKIt in https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/FileParsers/MolFileParser.cpp
 and that of the manual in https://www.daylight.com/meetings/mug05/Kappler/ctfile.pdf
(Page 85 vs 90)
RDKIT will fail the load if the BEGIN SGROUP is after BEGIN COLLECTION.
Also, while loading CTAB v2000/v3000, the add implicit h “add_implicit_h” adds s groups after the number of s group counts is already written in the header. So if that was ever executed, RKit will have a read failure.
…gen counts added earlier for Issue 1 in epam#2109. Added more unit tests. Tested manually and confirmed original problem went away.
@AlexanderSavelyev
Copy link
Collaborator

this change will affect all the tests and the origin issue seems not to be related to the indigo

image

@sapiosciences-dev
Copy link
Collaborator Author

this change will affect all the tests and the origin issue seems not to be related to the indigo

image

Hi,

This is the part that was missed.

“Collection Block
BIOVIA_CTFileFormats_2026.pdf

A collection block specifies all collection information for objects in the current connection table context. Collection blocks must be provided after the blocks that define the objects included in the collection to minimize the number of forward object references that must be maintained by the file reader.”

Cited Page 22, in this updated manual provided by BIOVIA in CEFC this year.

@sapiosciences-dev
Copy link
Collaborator Author

It is however interesting that in the newest manual, your citation is indeed still there in page 8:
"The atom block and counts line are required. The counts line, atom block, and bond block must appear
in the order indicated. The Sgroup block, 3D block, and link lines can occur in any order after the atom
and bond blocks."

However, I did check with BIOVIA before making the pull request and it is confirmed that the writer fix is correct.

But I think you also raised a good point that the manual here seems unclear when both sentences appear at same time in version in perspective of a reader/implementer. I'd like to also take your feedback back to BIOVIA team for refinement of this "CT File Formats" manual.

Thanks!

@AlexanderSavelyev
Copy link
Collaborator

Collection blocks must be provided after the blocks that define the objects included

so basically, only if collection includes s-groups, it should be provided after the s-group block

I can't find any information in the specification, that collection should be after sgroups in all cases
I understand, that making collection block last in 100% will be easier, but it will break many tests, which we can't do unfortunately. I would approve the fix for the collection goes after s-group only if s-groups used

@sapiosciences-dev
Copy link
Collaborator Author

Collection blocks must be provided after the blocks that define the objects included

so basically, only if collection includes s-groups, it should be provided after the s-group block

I can't find any information in the specification, that collection should be after sgroups in all cases I understand, that making collection block last in 100% will be easier, but it will break many tests, which we can't do unfortunately. I would approve the fix for the collection goes after s-group only if s-groups used

Yes that should work for us, and it would also be in spirit of the latest CT File Format manual. I can refine the code so that the order is switched only if S-block is used.

1. Use the old ordering unless S-Group block is actively being used.
2. Changed unit test on molecule.cpp so that it is no longer testing reaction but individual problematic molecules in reaction.
@sapiosciences-dev
Copy link
Collaborator Author

sapiosciences-dev commented Mar 12, 2026

@AlexanderSavelyev I have made the changes so that S-groups in-use will flip the collection block order, and reverted the logical C++ code order to be the same as before.
I have also adjusted unit tests to have molecule.cpp test file no longer require reaciton and split the current test file into molecules.

It appears there are some tests failing in Python side due to the change of S-group order expectations, though the core tests are now successful.

Do you want to examine those changes yourself first?

At a glance, these are not real failures but script is prescribing these order due to hard coded results. But otherwise satisfies both CT format and parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants