Skip to content

fix(scraper.py): double sending articles#48

Merged
Helias merged 6 commits intoUNICT-DMI:mainfrom
Giuseppe-Tornello:test_1
Nov 9, 2025
Merged

fix(scraper.py): double sending articles#48
Helias merged 6 commits intoUNICT-DMI:mainfrom
Giuseppe-Tornello:test_1

Conversation

@Giuseppe-Tornello
Copy link
Contributor

I figured out that the function soup.find_all('article') in line 178, listed two times some articles, here is the output of that function.
I currently have no clue about the reason of this behavior.

I found that <header class="sow-entry-header"> was exclusively used for articles, here is the output of soup.find_all('header', class_='sow-entry-header')
I tested this with the latest 10 ERSU articles and it works perfectly.

closes #40

@Helias
Copy link
Member

Helias commented Nov 9, 2025

merging this PR the bot will be updated, let's monitor for 1 month the ersu website and the channel to be sure that it continues to publish the articles

@Helias Helias merged commit e70f738 into UNICT-DMI:main Nov 9, 2025
6 checks passed
@Helias
Copy link
Member

Helias commented Nov 9, 2025

btw, good job!

@Giuseppe-Tornello
Copy link
Contributor Author

Thank you!

@Giuseppe-Tornello
Copy link
Contributor Author

Giuseppe-Tornello commented Nov 9, 2025

I figured out that the function soup.find_all('article') in line 178, listed two times some articles, here is the output of that function.

I might have found a reason of that behavior.
By looking up the ERSU website i noticed this article structure:

<article ...>
  News
<article ...>
  Article1
</article>
<article ...>
  Article2
</article>
</article>

Since the function was looking for <article ...> </article> it led to return a duplicated value:
first of all it returned the content of the first <article> found, which in this case was the whole list of articles and then it returned the single articles.
I assume that a simpler way to fix this was to write something like this:
temp = soup.find('article')
articles = temp.find_all('article)

@Helias
Copy link
Member

Helias commented Nov 9, 2025

I am afraid that I could have done the mistake, due to the non-intuitive HTML structure.

Btw, if the current solution works we can keep as it is, otherwise feel free to open a new PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

it sends the news twice

2 participants