Skip to content

Fails to parse round-trip string objects from ruamel.yaml #36

@Erotemic

Description

@Erotemic

Describe the bug

When parsing text from a YAML file in ruamel.YAML it it returns a SingleQuotedScalarString object, which does inherit from the str type. Sending this string to the pure-python lark parser seems to work fine, but when sending it to the cython variant it throws a TypeError.

To Reproduce

The following is a MWE that reproduces the issue:

"""
Requirements:
    pip install ruamel.yaml lark-cython lark
"""
import io
import ruamel.yaml


NEW_RUAMEL = 1


class _YamlRepresenter:

    @staticmethod
    def str_presenter(dumper, data):
        # https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data
        if len(data.splitlines()) > 1 or '\n' in data:
            text_list = [line.rstrip() for line in data.splitlines()]
            fixed_data = '\n'.join(text_list)
            return dumper.represent_scalar('tag:yaml.org,2002:str', fixed_data, style='|')
        return dumper.represent_scalar('tag:yaml.org,2002:str', data)


def _custom_new_ruaml_yaml_obj():
    """
    References:
        https://stackoverflow.com/questions/59635900/ruamel-yaml-custom-commentedmapping-for-custom-tags
        https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another
        https://stackoverflow.com/questions/76870413/using-a-custom-loader-with-ruamel-yaml-0-15-0
    """

    # make a new instance, although you could get the YAML
    # instance from the constructor argument
    class CustomConstructor(ruamel.yaml.constructor.RoundTripConstructor):
        ...

    class CustomRepresenter(ruamel.yaml.representer.RoundTripRepresenter):
        ...

    CustomRepresenter.add_representer(str, _YamlRepresenter.str_presenter)
    yaml_obj = ruamel.yaml.YAML()
    yaml_obj.Constructor = CustomConstructor
    yaml_obj.Representer = CustomRepresenter
    yaml_obj.preserve_quotes = True
    yaml_obj.width = float('inf')
    return yaml_obj


def codeblock(text):
    """
    Create a block of text that preserves all newlines and relative indentation
    """
    import textwrap
    return textwrap.dedent(text).strip('\n')


# For common constructs see:
# https://github.com/lark-parser/lark/blob/master/lark/grammars/common.lark
RESOLUTION_GRAMMAR_PARTS = codeblock(
    '''
    // Resolution parts of the grammar.
    magnitude: NUMBER

    unit: WORD

    numeric_unit: (magnitude WS* unit)
    implicit_unit: unit

    resolved_unit: numeric_unit | implicit_unit

    %import common.NUMBER
    %import common.WS
    %import common.WORD
    ''')

RESOLVED_UNIT_GRAMMAR = codeblock(
    r'''
    // RESOLVED WINDOW GRAMMAR. Eg. 2GSD
    ?start: resolved_unit
    ''') + '\n' + RESOLUTION_GRAMMAR_PARTS


def main():
    yaml_obj = _custom_new_ruaml_yaml_obj()
    file = io.StringIO("{key: '1mGSD'}")
    data = yaml_obj.load(file)
    text = data['key']

    # https://github.com/lark-parser/lark/blob/master/docs/_static/lark_cheatsheet.pdf
    import lark
    try:
        import lark_cython
        parser = lark.Lark(RESOLVED_UNIT_GRAMMAR, start='start', parser='lalr', _plugins=lark_cython.plugins)
    except ImportError:
        parser = lark.Lark(RESOLVED_UNIT_GRAMMAR, start='start', parser='lalr')

    print(f'{type(text)=}')
    print(f'{text.__class__.__mro__=}')

    parser.parse(text)


if __name__ == '__main__':
    """
    CommandLine:
        python ~/code/lark_cython/tests/test_yaml.py
    """
    main()

The type information it prints before it fails is:

type(text)=<class 'ruamel.yaml.scalarstring.SingleQuotedScalarString'>
text.__class__.__mro__=(<class 'ruamel.yaml.scalarstring.SingleQuotedScalarString'>, <class 'ruamel.yaml.scalarstring.ScalarString'>, <class 'str'>, <class 'object'>)

I've tested with versions:

3.11.9 (main, May 14 2024, 08:04:54) [GCC 12.2.0]
ruamel.yaml.__version__ = 0.18.6
lark.__version__ = 1.1.9
lark_cython.__version__ = 0.0.15

And

3.11.9 (main, May 13 2024, 14:03:39) [GCC 11.4.0]
ruamel.yaml.__version__ = 0.17.22
lark.__version__ = 1.1.7
lark_cython.__version__ = 0.0.15

My thought is that cython would handle a class that inherits from a str, but perhaps it doesn't I'm not sure if this can be fixed on the lark-cython side, but I figured it was worth reporting.

My current workarond is to do something like this:

        try:
            tree = parser.parse(text)
        except TypeError:
            if isinstance(text, str) and type(text) is not str:
                # We could be in a case where cython is failing to handle
                # overloaded string types. Try casting to a regular str.
                tree = parser.parse(str(text))
            else:
                raise

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions