vithl.github.io/index.html at master · ViTHL/vithl.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <!-- Meta tags for social media banners, these should be filled in appropriatly as they are your "business card" -->
  <!-- Replace the content tag with appropriate information -->
  <meta name="description" content="ViTHL – Vision‑Transformer‑based Hybrid Localization for Humanoid Soccer Robots">
  <meta property="og:title" content="ViTHL – ViT‑Based Hybrid Localization"/>
  <meta property="og:description" content="Hybrid Vision‑Transformer + UKF‑MCL localization framework accepted at RoboCup 2025 Symposium"/>  <meta property="og:url" content="https://vithl.github.io"/>
  <!-- Path to banner image, should be in the path listed below. Optimal dimenssions are 1200X630-->
  <meta property="og:image" content="static/images/social_banner.png" />
  <meta property="og:image:width" content="1200"/>
  <meta property="og:image:height" content="630"/>

  <!-- Keywords for your paper to be indexed by-->
  <meta name="keywords" content="Robot Localization, Vision Transformer, UKF, Monte Carlo Localization, RoboCup, Humanoid Robot">
  <meta name="viewport" content="width=device-width, initial-scale=1">


  <title>ViTHL</title>
  <link rel="icon" type="image/x-icon" href="static/images/favicon.ico">
  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
  rel="stylesheet">

  <link rel="stylesheet" href="static/css/bulma.min.css">
  <link rel="stylesheet" href="static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="static/css/fontawesome.all.min.css">
  <link rel="stylesheet"
  href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="static/css/index.css">

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script src="https://documentcloud.adobe.com/view-sdk/main.js"></script>
  <script defer src="static/js/fontawesome.all.min.js"></script>
  <script src="static/js/bulma-carousel.min.js"></script>
  <script src="static/js/bulma-slider.min.js"></script>
  <script src="static/js/index.js"></script>
</head>
<body>


  <section class="hero">
    <div class="hero-body">
      <div class="container is-max-desktop">
        <div class="columns is-centered">
          <div class="column has-text-centered">
            <h1 class="title is-1 publication-title">ViTHL: A ViT-Based Hybrid Localization Method for a Humanoid Soccer Robot</h1>
            <div class="is-size-5 publication-authors">
              <!-- Paper authors -->
              <span class="author-block"><a href="https://www.linkedin.com/in/soheilkhatibi/" target="_blank">Soheil Khatibi</a>,</span>
              <span class="author-block"><a href="https://www.linkedin.com/in/arash--rahmani/" target="_blank">Arash Rahmani</a>,</span>
              <span class="author-block"><a href="https://www.linkedin.com/in/amazadfar/" target="_blank">AmirMasoud Azadfar</a>,</span>
              <span class="author-block"><a href="https://www.linkedin.com/in/pradovinicius/" target="_blank">Vinicius Prado da Fonseca</a>,</span>
              <span class="author-block"><a href="https://www.linkedin.com/in/teado/" target="_blank">Thiago Eustaquio Alves de Oliveira</a></span>
            </div>
            <div class="is-size-5 publication-authors">
              <span class="author-block">Lakehead University · Politecnico di Torino · Memorial University of Newfoundland<br>XXVIII RoboCup International Symposium 2025</span>            </div>

            <div class="column has-text-centered">
              <div class="publication-links">
                <span class="link-block">
                  <a href="https://vithl.github.io/" target="_blank"
                  class="external-link button is-normal is-rounded is-dark">
                  <span class="icon">
                    <i class="fab fa-github"></i>
                  </span>
                  <span>Code - Coming Soon</span>
                  </a>
                </span>
            </div>
          </div>
        </div>
      </div>
    </div>
  </div>
</section>


<!-- Paper abstract -->
<section class="section hero is-light">
  <div class="container is-max-desktop">
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Abstract</h2>
        <div class="content has-text-justified">
          <p>Accurate localization is essential for autonomous performance in humanoid robot soccer, where dynamic environments and partial observations challenge conventional methods. In this paper, we propose ViTHL (Vision Transformer-based Hybrid Localization), a novel localization framework that combines vision-based global estimation with probabilistic filtering to enhance robustness and accuracy. ViTHL utilizes a Vision Transformer (ViT) architecture to process images captured from the robot’s onboard camera and provide a global estimate of the robot’s position. These global observations are fused with kinematic and inertial data through a hybrid filtering scheme that combines an Unscented Kalman Filter (UKF) with a Monte Carlo Localization (MCL) process, which refines the estimate by accounting for motion uncertainty and sensor noise. The proposed method is validated in simulation on an OP3 humanoid robot using the Webots platform. Experimental results demonstrate that our approach outperforms traditional vision-based UKF and MCL methods in localization accuracy and convergence time. Furthermore, the system exhibits robust performance under partial occlusions and changing lighting conditions, which are common in RoboCup scenarios. Our findings highlight the effectiveness of combining deep learning-based perception with probabilistic filtering for real-time humanoid localization in complex, adversarial environments.          </p>
        </div>
      </div>
    </div>
  </div>
</section>
<!-- End paper abstract -->

<section class="section">
  <div class="container is-max-desktop">
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Overview</h2>
        <div class="content has-text-justified">
          <h3 class="title is-4">Overall Pipeline of ViTHL</h3>
          <img src="static/images/Fig1.png"
               alt="Overall Pipeline"
               class="center-img"/>
          <p>a) Overview of the Robotis OP3 Robot b) The image observed by the Robot c) Position prediction made by the ViT from the robot image d) Odometry Estimation e) Hybrid Localization Loop f) Obtaining the pose estimation of the robot in the field from the hybrid filter.</p>
        </div>
      </div>
    </div>
  </div>
</section>

<section class="section">
  <div class="container is-max-desktop">
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Methodology</h2>
        <div class="content has-text-justified">
          <h3 class="title is-4">Vision Transformer Front‑End</h3>
          <p>Input RGB frames (224×224) are split into 16×16 patches and projected to a token sequence. Six Transformer encoder blocks with multi‑head self‑attention learn global field context. The CLS token passes through a regression head that outputs a 7‑D pose vector (quaternion + XYZ).</p>
          <img src="static/images/Fig2-1-new.png"
               alt="ViT Architecture"
               class="center-img"/>
          <h3 class="title is-4">Hybrid UKF–MCL Back‑End</h3>
          <ul>
            <li><strong>Prediction:</strong> Dead‑reckoning from joint encoders & IMU updates each particle and UKF state.</li>
            <li><strong>Update:</strong> High‑confidence ViT observations trigger a UKF update; ambiguous observations trigger an MCL reweight/resample, preserving multimodal beliefs.</li>
            <li><strong>Switch Criterion:</strong> ViT output entropy & field symmetry inform the UKF↔MCL switch.</li>
          </ul>
          <div class="box" style="overflow-x:auto;">
<pre><code>Algorithm:&nbsp;&nbsp;&nbsp;Hybrid&nbsp;Localization&nbsp;Update&nbsp;Per&nbsp;Step
Require:&nbsp;Image I<sub>t</sub>,&nbsp;Odometry&nbsp;u<sub>t</sub>,&nbsp;Previous&nbsp;state&nbsp;s<sub>t-1</sub>
1:&nbsp;&nbsp;ŝ<sub>t</sub>&nbsp;←&nbsp;ViT_Predict(I<sub>t</sub>)
2:&nbsp;&nbsp;s̄<sub>t</sub>&nbsp;←&nbsp;MotionModel(s<sub>t-1</sub>,&nbsp;u<sub>t</sub>)
3:&nbsp;&nbsp;if&nbsp;Confidence(ŝ<sub>t</sub>)&nbsp;&gt;&nbsp;τ&nbsp;then
4:&nbsp;&nbsp;&nbsp;&nbsp;s<sub>t</sub>&nbsp;←&nbsp;KalmanUpdate(s̄<sub>t</sub>,&nbsp;ŝ<sub>t</sub>)
5:&nbsp;&nbsp;else
6:&nbsp;&nbsp;&nbsp;&nbsp;s<sub>t</sub>&nbsp;←&nbsp;MCLUpdate(s̄<sub>t</sub>,&nbsp;ŝ<sub>t</sub>)
7:&nbsp;&nbsp;end&nbsp;if
8:&nbsp;&nbsp;return&nbsp;s<sub>t</sub>
</code></pre>
</div>
          <h3 class="title is-4">Training & Sim‑to‑Real</h3>
          <p>15k synthetic frames collected in Webots with domain randomization (lighting, textures, camera jitter). The ViT is trained with Adam (LR 1e‑4, batch 64, 100 epochs) minimizing SSE on 6‑DoF poses, then fine‑tuned on real footage.</p>
        </div>
      </div>
    </div>
  </div>
</section>

  <footer class="footer">
  <div class="container">
    <div class="columns is-centered">
      <div class="column is-8">
        <div class="content">

          <p>
            This page was built using the <a href="https://github.com/eliahuhorwitz/Academic-project-page-template" target="_blank">Academic Project Page Template</a> which was adopted from the <a href="https://nerfies.github.io" target="_blank">Nerfies</a> project page.
            You are free to borrow the source code of this website, we just ask that you link back to this page in the footer. <br> This website is licensed under a <a rel="license"  href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative
            Commons Attribution-ShareAlike 4.0 International License</a>.
          </p>

        </div>
      </div>
    </div>
  </div>
</footer>

  </body>
  </html>