SLAM-TT: Visualize Ping Pong in 3D!

CCS 1L Fall Project. View real ping pong games in 3D.

SLAM-TT: Visualize Ping Pong in 3D!
A real game of ping pong in 3D

I finished this project last quarter and I lowkey forgot to write a post. Here's the repo:

GitHub - ccs-cs1l-f24/SLAM-TT: Transform monocular footage of table tennis into 3D scene. Based on TTNet and WHAM
Transform monocular footage of table tennis into 3D scene. Based on TTNet and WHAM - ccs-cs1l-f24/SLAM-TT

Looking back, the video is cringey and too long. Too bad we're not making one for this quarter; I had some fire ideas for my VR Chinese game.

How I Built It

Since I have a lot of stuff written in the README of my GitHub repo, I'll do a very high level overview.

TTNet

This project provides me with data and PyTorch code to train a vision model to detect the ball position at each frame. I also get data such as table boundaries and whether it thinks the ball is currently touching the table.

I this model to detect all ball bounces, then use homography to map the frame coordinates into real world coordinates (relative to the table). We can use this data to reconstruct the ball position at every frame.

Ball detection + Homography
TTNet: Real-time temporal and spatial video analysis of table tennis
We present a neural network TTNet aimed at real-time processing of high-resolution table tennis videos, providing both temporal (events spotting) and spatial (ball detection and semantic segmentation) data. This approach gives core information for reasoning score updates by an auto-referee system. We also publish a multi-task dataset OpenTTGames with videos of table tennis games in 120 fps labeled with events, semantic segmentation masks, and ball coordinates for evaluation of multi-task approaches, primarily oriented on spotting of quick events and small objects tracking. TTNet demonstrated 97.0% accuracy in game events spotting along with 2 pixels RMSE in ball detection with 97.5% accuracy on the test part of the presented dataset. The proposed network allows the processing of downscaled full HD videos with inference time below 6 ms per input tensor on a machine with a single consumer-grade GPU. Thus, we are contributing to the development of real-time multi-task deep learning applications and presenting approach, which is potentially capable of substituting manual data collection by sports scouts, providing support for referees’ decision-making, and gathering extra information about the game process.

WHAM

This project takes videos of humans and finds their pose and world position. I use this to accurately get the player's movements and strokes.

Luckily, we can export the result into Blender, which I then load the animation and export it into Unity for final processing.

Video to 3D Model
WHAM

Unity

Loads bounce data and player animations for the final result!

Final Result

Conclusion

If I feel like it, I'll update this post with more visuals. I put a lot more effort into the project than it seems at first glance. Checkout the video I made for a more in-depth showcase.