Skip to content

Staging to main: wide and deep to PyTorch and other improvements#2286

Open
miguelgfierro wants to merge 111 commits intomainfrom
staging
Open

Staging to main: wide and deep to PyTorch and other improvements#2286
miguelgfierro wants to merge 111 commits intomainfrom
staging

Conversation

@miguelgfierro
Copy link
Collaborator

Description

Related Issues

References

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • I have signed the commits, e.g. git commit -s -m "your commit message".
  • This PR is being made to staging branch AND NOT TO main branch.

SimonYansenZhao and others added 30 commits February 2, 2026 21:54
* Rewrite testing workflows using only GitHub-hosted runners instead of AzureML

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Rewrite test_groups.py

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Replace test_groups.py with test_groups.yml

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Rename all workflows

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct paths

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Use GitHub GPU runners

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Enable unit-tests.yml

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct shell command and action names

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct commands and python versions

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct Dockerfile path

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct yq install command

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add entrypoint.sh

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct paths

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Copy repo to be along with dockerfile

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct paths

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct paths

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct yq command

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct paths and yq version

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Drop Python 3.18

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct openjdk version

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Set openjdk<23

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* replace recodatasets with guthub resource repo

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* replace deeprec info

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* kdd

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* kdd

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* MIND

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* 🐛

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* Update docs

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* update criteo URL (#2260)

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* Merge small test groups and update testing time tally

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correc test group name

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try to run on the runner group instead of single runner

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Revert

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Remove marks for pytest fixture

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try only Python 3.9 for simplification

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update testing time

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Enable tests for Python 3.8, 3.10 and 3.11

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update docker base image

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try self-hosted GPU instead of GitHub-hosted

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Test nvidia-smi

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try self-hosted GPU

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Test nvidia-smi inside container

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Test nvidia-smi inside container

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try container directly instead of docker action

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct path

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct path

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct path

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct conda activation

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add lightgcn model dir

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Disable parallel testing on GPU

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Disable parallel excution on GPU

* Remove pytest-xdist on GPU testing

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct variable substitution

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct if statement

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct typo

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Use CUDA 12.2.2 and cuDNN 8.9

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try self-hoste gpu

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try cuda 13.1.0

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Downgrade tensorflow version

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Test on GPU directly

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Install yq and label with timestamp

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try uv

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try call docker directly

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Install unzip and zip

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Install curl

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct base image and docker commands

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct commands

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Cache uv downloaded python packages

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Remove docker image when finished

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Keep iamge and container for debugging

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Rewrite test workflows and Dockerfile

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Rewrite Dockerfile and correct tests.yml

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct env path

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try all tests

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Install zip and unzip

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct test_groups.yml

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct SDKMAN! setup

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try parallel testing

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Disable parallel testing for group_gpu_001

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Disable parallel testing on GPU

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct commands

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Test pr_gate and nightly

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Remove pytest-xdist

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try GitHub-hosted runners only

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try cpu-nightly on larger runners
* Change GPU image
* Check CPU and memory before tests
* Update testing time

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct labels

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try runner groups

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try using group and labels together

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try again

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add annotations for hardware checks

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try cpu-nightly with Python 3.11 only

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update testing time

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

---------

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Co-authored-by: Miguel Fierro <3491412+miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Fixed issue with Huggingface's etag of MIND dataset
Signed-off-by: Rohit Goyal <sprkgoyal@gmail.com>
Signed-off-by: Rohit Goyal <sprkgoyal@gmail.com>
Signed-off-by: Rohit Goyal <sprkgoyal@gmail.com>
Signed-off-by: Rohit Goyal <sprkgoyal@gmail.com>
Signed-off-by: Rohit Goyal <sprkgoyal@gmail.com>
Signed-off-by: Rohit Goyal <sprkgoyal@gmail.com>
Speed up `lightfm_utils.py:prepare_test_df`
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
The _get_item_feature_similarity function computes cosine similarity
between item feature vectors using the formula dot(f1, f2) / (norm(f1) * norm(f2)).
When either feature vector is a zero vector (norm = 0), this causes a
ZeroDivisionError at runtime.

This handles the zero-norm edge case by returning 0.0 similarity when
either vector has zero magnitude, which is the mathematically correct
convention for cosine similarity with zero vectors.
…ivision

Fix ZeroDivisionError in item feature cosine similarity
Replace `logs={}` with `logs=None` and add `if logs is None: logs = {}`
guard in all Keras callback methods across multinomial_vae.py and
standard_vae.py.

Using a mutable default argument like `{}` is a well-known Python
anti-pattern (W0102) — the same dict object is shared across all calls,
which can lead to unexpected state leakage between invocations.
Remove the data in SASRec test from the repo and put it to temp
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
ds-wook and others added 30 commits March 24, 2026 11:05
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
This reverts commit 56e470e.

Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
…elens

Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants