-
Notifications
You must be signed in to change notification settings - Fork 471
[CI][XPU] enable unit test for XPU device #2814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
a736c41
enable xpu ci test
DiweiSun 7c96ad4
Revert "enable xpu ci test"
DiweiSun d1122fc
enable ci test for xpu
DiweiSun 4593d95
Create ci_test_xpu.sh
DiweiSun 7d90b8c
Update .github/workflows/pr-test-xpu.yml
DiweiSun e6bc407
Update .github/workflows/pr-test-xpu.yml
DiweiSun c34601f
fix for trigger scenarios
DiweiSun 3085c2b
port from pytorch repo
DiweiSun d9ab09e
Rename action.yml to xpu-action.yml
DiweiSun f87892a
update to align with pytorch
DiweiSun c6f07b5
Revert "Rename action.yml to xpu-action.yml"
DiweiSun 544593a
Revert "port from pytorch repo"
DiweiSun 2e1dc50
Update .github/workflows/pr-test-xpu.yml
DiweiSun 188a0f8
debug for runner
DiweiSun 421d02c
lint format fix
DiweiSun 6f6cd17
format fix
DiweiSun bae7000
format fix
DiweiSun 7bd3d29
format fix
DiweiSun 4fd2909
format fix
DiweiSun 4a7d9af
format fix
DiweiSun c3f4384
format fix
DiweiSun e8936cb
format fix
DiweiSun 50e56ec
trigger by tag only
DiweiSun 5a46341
add xpu label for xpuci
DiweiSun 030121f
fix docker path
DiweiSun File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,3 +4,4 @@ ciflow_push_tags: | |
| - ciflow/tutorials | ||
| - ciflow/rocm | ||
| - ciflow/4xh100 | ||
| - ciflow/xpu | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| #!/bin/bash | ||
|
|
||
| python3 -m pip install torch torchvision torchaudio pytorch-triton-xpu --index-url https://download.pytorch.org/whl/nightly/xpu --force-reinstall --no-cache-dir | ||
| python3 setup.py install | ||
|
|
||
| pip install pytest expecttest parameterized accelerate hf_transfer 'modelscope!=1.15.0' | ||
|
|
||
| cd test/quantization | ||
| pytest -v -s *.py |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,224 @@ | ||
| # TODO: this looks sort of similar to _linux-test, but there are like a dozen | ||
| # places where you would have to insert an if statement. Probably it's better to | ||
| # just use a different workflow altogether | ||
|
|
||
| name: xpu-test | ||
|
|
||
| on: | ||
| push: | ||
| tags: | ||
| - ciflow/xpu/* | ||
|
|
||
| permissions: | ||
| id-token: write | ||
| contents: read | ||
|
|
||
| concurrency: | ||
| group: xpu_ci_test-${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }} | ||
| cancel-in-progress: true | ||
|
|
||
| jobs: | ||
| test: | ||
| # Don't run on forked repos or empty test matrix | ||
| # if: github.repository_owner == 'pytorch' && toJSON(fromJSON(inputs.test-matrix).include) != '[]' | ||
| timeout-minutes: 60 | ||
| runs-on: linux.idc.xpu | ||
| env: | ||
| DOCKER_IMAGE: ci-image:pytorch-linux-jammy-xpu-n-py3 | ||
| PYTORCH_RETRY_TEST_CASES: 1 | ||
| PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1 | ||
| XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla | ||
| steps: | ||
| # [see note: pytorch repo ref] | ||
| - name: Checkout PyTorch | ||
| uses: pytorch/pytorch/.github/actions/checkout-pytorch@main | ||
|
|
||
| - name: Checkout Torchao | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Clean all stopped docker containers | ||
| if: always() | ||
| shell: bash | ||
| run: | | ||
| # Prune all stopped containers. | ||
| # If other runner is pruning on this node, will skip. | ||
| nprune=$(ps -ef | grep -c "docker container prune") | ||
| if [[ $nprune -eq 1 ]]; then | ||
| docker container prune -f | ||
| fi | ||
|
|
||
| - name: Runner health check system info | ||
| if: always() | ||
| shell: bash | ||
| run: | | ||
| cat /etc/os-release || true | ||
| cat /etc/apt/sources.list.d/oneAPI.list || true | ||
| cat /etc/apt/sources.list.d/intel-gpu-jammy.list || true | ||
| whoami | ||
|
|
||
| - name: Runner health check xpu-smi | ||
| if: always() | ||
| shell: bash | ||
| run: | | ||
| timeout 30 xpu-smi discovery || true | ||
|
|
||
| - name: Runner health check GPU count | ||
| if: always() | ||
| shell: bash | ||
| run: | | ||
| ngpu=$(timeout 30 xpu-smi discovery | grep -c -E 'Device Name' || true) | ||
| msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified" | ||
| if [[ $ngpu -eq 0 ]]; then | ||
| echo "Error: Failed to detect any GPUs on the runner" | ||
| echo "$msg" | ||
| exit 1 | ||
| fi | ||
|
|
||
| - name: Runner diskspace health check | ||
| uses: pytorch/pytorch/.github/actions/diskspace-cleanup@main | ||
| if: always() | ||
|
|
||
| - name: Runner health check disconnect on failure | ||
| if: ${{ failure() }} | ||
| shell: bash | ||
| run: | | ||
| killall runsvc.sh | ||
|
|
||
| - name: Preserve github env variables for use in docker | ||
| shell: bash | ||
| run: | | ||
| env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" | ||
| env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}" | ||
|
|
||
| - name: XPU set GPU_FLAG | ||
| shell: bash | ||
| run: | | ||
| # Add render group for container creation. | ||
| render_gid=`cat /etc/group | grep render | cut -d: -f3` | ||
| echo "GPU_FLAG=--device=/dev/mem --device=/dev/dri --group-add video --group-add $render_gid" >> "${GITHUB_ENV}" | ||
|
|
||
| - name: configure aws credentials | ||
| id: aws_creds | ||
| uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0 | ||
| with: | ||
| role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only | ||
| aws-region: us-east-1 | ||
|
|
||
| - name: Login to Amazon ECR | ||
| id: login-ecr | ||
| uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1 | ||
|
|
||
| - name: Calculate docker image | ||
| id: calculate-docker-image | ||
| uses: pytorch/test-infra/.github/actions/calculate-docker-image@main | ||
| with: | ||
| docker-image-name: ${{ env.DOCKER_IMAGE }} | ||
| docker-build-dir: pytorch/pytorch/.ci/docker | ||
|
|
||
| - name: Use following to pull public copy of the image | ||
| id: print-ghcr-mirror | ||
| env: | ||
| ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }} | ||
| shell: bash | ||
| run: | | ||
| tag=${ECR_DOCKER_IMAGE##*:} | ||
| echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}" | ||
|
|
||
| - name: Pull docker image | ||
| uses: pytorch/test-infra/.github/actions/pull-docker-image@main | ||
| with: | ||
| docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }} | ||
|
|
||
| - name: Runner health check GPU count | ||
| if: always() | ||
| shell: bash | ||
| run: | | ||
| ngpu=$(timeout 30 clinfo -l | grep -c -E 'Device' || true) | ||
| msg="Please file an issue on pytorch/ao reporting the faulty runner. Include a link to the runner logs so the runner can be identified" | ||
| if [[ $ngpu -eq 0 ]]; then | ||
| echo "Error: Failed to detect any GPUs on the runner" | ||
| echo "$msg" | ||
| exit 1 | ||
| fi | ||
|
|
||
| - name: Test | ||
| id: test | ||
| env: | ||
| TEST_COMMAND: .github/scripts/ci_test_xpu.sh | ||
| DOCKER_IMAGE: ci-image:pytorch-linux-jammy-xpu-n-py3 | ||
| PR_NUMBER: ${{ github.event.pull_request.number }} | ||
| GITHUB_REPOSITORY: ${{ github.repository }} | ||
| GITHUB_WORKFLOW: ${{ github.workflow }} | ||
| GITHUB_JOB: ${{ github.job }} | ||
| GITHUB_RUN_ID: ${{ github.run_id }} | ||
| GITHUB_RUN_NUMBER: ${{ github.run_number }} | ||
| GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }} | ||
| SHA1: ${{ github.event.pull_request.head.sha || github.sha }} | ||
| timeout-minutes: 60 | ||
| run: | | ||
| set -x | ||
|
|
||
| # detached container should get cleaned up by teardown_ec2_linux | ||
| # Used for GPU_FLAG since that doesn't play nice | ||
| # shellcheck disable=SC2086,SC2090 | ||
| container_name=$(docker run \ | ||
| ${GPU_FLAG:-} \ | ||
| -e PR_NUMBER \ | ||
| -e GITHUB_ACTIONS \ | ||
| -e GITHUB_REPOSITORY \ | ||
| -e GITHUB_WORKFLOW \ | ||
| -e GITHUB_JOB \ | ||
| -e GITHUB_RUN_ID \ | ||
| -e GITHUB_RUN_NUMBER \ | ||
| -e GITHUB_RUN_ATTEMPT \ | ||
| -e JOB_ID \ | ||
| -e BRANCH \ | ||
| -e SHA1 \ | ||
| --user $(id -u):$(id -g) \ | ||
| --ulimit stack=10485760:83886080 \ | ||
| --ulimit core=0 \ | ||
| --security-opt seccomp=unconfined \ | ||
| --cap-add=SYS_PTRACE \ | ||
| --shm-size="8g" \ | ||
| --tty \ | ||
| --detach \ | ||
| --name="${container_name}" \ | ||
| --user jenkins \ | ||
| --privileged \ | ||
| -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \ | ||
| -w /var/lib/jenkins/workspace \ | ||
| "${DOCKER_IMAGE}" | ||
| ) | ||
| # save container name for later step | ||
| echo "CONTAINER_NAME=${container_name}" >> "$GITHUB_ENV" | ||
| # jenkins user does not have write permission to mounted workspace; work-around by copying within container to jenkins home | ||
| docker exec -t "${container_name}" sh -c "bash ${env.TEST_COMMAND}" | ||
|
|
||
| - name: Change permissions | ||
| if: ${{ always() && steps.test.conclusion }} | ||
| run: | | ||
| docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "sudo chown -R jenkins:jenkins test" | ||
|
|
||
| - name: Collect backtraces from coredumps (if any) | ||
| if: always() | ||
| run: | | ||
| # shellcheck disable=SC2156 | ||
| find . -iname "core.[1-9]*" -exec docker exec "${CONTAINER_NAME}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \; | ||
|
|
||
| - name: Stop container before exit | ||
| if: always() | ||
| run: | | ||
| # Workaround for multiple runners on same IDC node | ||
| docker stop "${{ env.CONTAINER_NAME }}" | ||
|
|
||
| - name: Store Core dumps on GitHub | ||
| uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2 | ||
| if: failure() | ||
| with: | ||
| name: coredumps-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }} | ||
| retention-days: 14 | ||
| if-no-files-found: ignore | ||
| path: ./**/core.[1-9]* | ||
|
|
||
| - name: Teardown XPU | ||
| uses: pytorch/pytorch/.github/actions/teardown-xpu@main | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can reuse the action in pytorch directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is literally ported from pytorch