feat(core/str): implement str::first_char, last_char, split_first_char, split_last_char#154544
feat(core/str): implement str::first_char, last_char, split_first_char, split_last_char#154544F0RREALTHO wants to merge 2 commits intorust-lang:mainfrom
Conversation
|
r? @jhpratt rustbot has assigned @jhpratt. Use Why was this reviewer chosen?The reviewer was selected based on:
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
504171b to
e3b58c0
Compare
This comment has been minimized.
This comment has been minimized.
e3b58c0 to
8a6c194
Compare
This comment has been minimized.
This comment has been minimized.
8a6c194 to
3b44c45
Compare
|
Not sure if the |
That's a fair point. I included |
3b44c45 to
5622b1d
Compare
Thank you for the warm welcome! That makes total sense. I’ve updated the PR description to use Tracking issue: #154393 so it stays open as intended. I've also just pushed a refactor that incorporates your suggestions for the |
This comment has been minimized.
This comment has been minimized.
…st_char Implements the API proposed in the tracking issue.
5622b1d to
9fe51b5
Compare
|
I'm sorry to ask but are you using an LLM to generate code and/or interact in this PR? Both your messages and code seem possibly LLM generated to me, so I'm just asking to be sure. In any case, there is plenty of UTF8 decoding machinery in core already, and the chars iterator, may I ask why you are not using those? If it is because of constness have you tried extracting the part of them (for example the logic of next_code_point) that can be made const into a common function? |
library/core/src/str/mod.rs
Outdated
| } | ||
|
|
||
| #[inline] | ||
| const fn decode_utf8_char(bytes: &[u8]) -> char { |
There was a problem hiding this comment.
This function must be marked unsafe as it is unsafe to call on anything but the UTF8 encoding of a single char
|
Thank you for contacting me, @krtab. I would like to make it very clear that I am a second-year math student who is new to Rust and contributing to Open Source. I also used some AI support to learn how to work with the code base. But I've run every test locally, so I know what didn't work. Thank you for your technical advice on:
I'll resolve these problems and upload again tomorrow. Finally, I want to thank @GrigorenkoPV for his previous comments! |
|
I am not against AI per se (I myself use it extensively to understand and dive into new codebases) and I commend you for willing to contribute to open source, especially so early in your career. If however, as I suspect, your PR is mostly vibe coded (which your comment has not really convinced me to the contrary), I think you'll have a hard time finding someone to review and accept it given that -- to the best of my knowledge -- most reviewers here find interacting with someone who merely serves as a proxy to an LLM agent rather unpleasant. |
| #[must_use] | ||
| #[unstable(feature = "str_first_last_char", issue = "154393")] | ||
| #[rustc_const_unstable(feature = "str_first_last_char", issue = "154393")] | ||
| pub const fn split_last_char(&self) -> Option<(char, &str)> { |
There was a problem hiding this comment.
Is there a particular reason for returning (char, &str) instead of (&str, char) here, since the latter would mirror the order in the original string?
There was a problem hiding this comment.
the ordering was (char, &str) used to provide symmetry with split_first_char, so both would return the first character followed by the rest of the string...aslo could argue that having (&str, char) for split_last_char is better because it preserves the natural order of a string.
There was a problem hiding this comment.
Let's keep it as-is for now. I agree that it probably should be swapped, but that's not what the ACP was for.
| // above len | ||
| check_many("hello", 5..=10, 5); | ||
| } | ||
| const _: () = { |
There was a problem hiding this comment.
Why is this done as a const assertion?
| /// # Safety | ||
| /// | ||
| /// `bytes` must be the UTF-8 encoding of exactly one valid Unicode scalar value. | ||
| #[inline] | ||
| const unsafe fn decode_utf8_char(bytes: &[u8]) -> char { | ||
| let ch = match bytes { | ||
| &[a] => a as u32, | ||
| &[a, b] => ((a & 0x1F) as u32) << 6 | (b & 0x3F) as u32, | ||
| &[a, b, c] => ((a & 0x0F) as u32) << 12 | ((b & 0x3F) as u32) << 6 | (c & 0x3F) as u32, | ||
| &[a, b, c, d] => { | ||
| ((a & 0x07) as u32) << 18 | ||
| | ((b & 0x3F) as u32) << 12 | ||
| | ((c & 0x3F) as u32) << 6 | ||
| | (d & 0x3F) as u32 | ||
| } | ||
| // SAFETY: All valid UTF-8 sequences are covered above; this arm is unreachable for valid input. | ||
| _ => unsafe { crate::hint::unreachable_unchecked() }, | ||
| }; | ||
| // SAFETY: the caller must ensure `bytes` contains a valid UTF-8 sequence. | ||
| unsafe { char::from_u32_unchecked(ch) } | ||
| } |
There was a problem hiding this comment.
Can this be moved to either a standalone method or (ideally) a method on char? It would be pub(crate), but this doesn't feel like the best of places for such a method.
| /// # Safety | ||
| /// | ||
| /// `bytes` must be the UTF-8 encoding of exactly one valid Unicode scalar value. | ||
| #[inline] |
There was a problem hiding this comment.
I do agree that is probably best to not be declared #[inline], as it's not something that can trivially be optimized out and is shared between a few methods.
| const unsafe fn decode_utf8_char(bytes: &[u8]) -> char { | ||
| let ch = match bytes { | ||
| &[a] => a as u32, | ||
| &[a, b] => ((a & 0x1F) as u32) << 6 | (b & 0x3F) as u32, |
There was a problem hiding this comment.
Where are these hex numbers coming from? The way UTF-8 is encoded?
| #[must_use] | ||
| #[unstable(feature = "str_first_last_char", issue = "154393")] | ||
| #[rustc_const_unstable(feature = "str_first_last_char", issue = "154393")] | ||
| pub const fn split_last_char(&self) -> Option<(char, &str)> { |
There was a problem hiding this comment.
Let's keep it as-is for now. I agree that it probably should be swapped, but that's not what the ACP was for.
|
@rustbot author |
|
Reminder, once the PR becomes ready for a review, use |
View all comments
Tracking issue: #154393
What this PR does
Implements the four
strmethods proposed in the tracking issue:str::first_char(&self) -> Option<char>str::last_char(&self) -> Option<char>str::split_first_char(&self) -> Option<(char, &str)>str::split_last_char(&self) -> Option<(char, &str)>All methods are
const fnand correctly handle all UTF-8 character widths (1, 2, 3, and 4 bytes).Implementation notes
utf8_char_widthto stayconst-compatiblesplit_first_charandsplit_last_charare the core methods;first_charandlast_chardelegate to them#[must_use]and#[inline]Tests
constevaluation tests proving these work asconst fnOpen question
The tracking issue has an unresolved question about the
_charsuffix. This PR uses_charas proposed in the ACP, but I'm happy to rename if the team decides otherwise.