Utilizing Single-Angle Broadcast Feeds and Computer Vision to Extract 3D MLB Biomechanical Data

It has been commonly pointed out in baseball analytics that analyzing biomechanical data is the logical “next frontier” in searching for an edge against other teams. Organizations have spent millions upon millions of dollars acquiring systems to collect such data and hiring analysts/biomechanists to translate the data into valuable insights. Given the novelty of the matter, most of this data is held privately as teams attempt to preserve their competitive edge in this newer field. This has left continually hungry baseball fans with empty stomachs, having most fans settle for the data points that the MLB Stats API and Baseball Savant have graciously offered to the public.

Luckily, sites like Baseball Savant not only offer data points … they offer video as well. In this new age of advancement in artificial intelligence models, using advanced neural networks to process images/frames. This is generally known as Computer Vision, which is a field dedicated to replicating human vision through the use of advanced modeling techniques. The models involved have made turning raw video into data points possible. While the exact data outputs can vary based on the task at hand, a prominent use case is the extraction of 3D Pose (or biomechanical structure) data of the human body. Utilizing these models, it is now possible for the average MLB fan to gain access to cutting-edge biomechanical data. With it, fans can hopefully extract more insights to predict the sustainability and level of performance of their favorite teams and players.

Assembly of the Pipeline

In many computer vision-based problems, implementing a proper pipeline is key to ensuring that the long translation from individual pixel values turns into genuine, precise data. For this problem, the start and end points are fairly obvious—there needs to be a feed of the broadcast and then a prediction and visualization of 3D points for given points in the video. Therefore, the solution from A to B needs to be clarified.

To get these real 3D points, the 2D points of the actual object are needed. To extract these, 2D Pose Models can be utilized—these models are responsible for estimating a given amount of key points on an object. In the case of the human pose, this is commonly estimated using the H36M format, which consists of 17 key points across the human body. However, only certain labeled poses are needed. To solve for this, a detection model is utilized for identifying the bounding box of an object and what the object is labeled. Assuming that is done properly, xy rectangular bounding box coordinates can be extracted so the user can pinpoint a certain object within an image. In only getting information for the correct poses, the model should only access pose predictions if they are within a bounding box for a labeled object. This leaves labeled 2D key points, which then need to be adjusted for the next step: the 3D Keypoint Model. 2D Pose Models often are not directly compatible with 3D Pose Models, so the numbers often have to be translated into a format that is compatible with the model. If the input is correct, utilizing a pre-trained model that is familiar with the movement of the human body leads to a 3D positional estimate of the H36M points.

The raw 3D points are now present, but their formatting is usually less than desirable for actual use. Moving the points to an array and adjusting them to a positive space is crucial in allowing plotting and analysis. Each 3D Model will often vary in its output, so it is crucial to read the documentation so that both static and dynamic transformations can be applied to put the objects in a real 3D space. With these real 3D points, true biomechanical analysis is now possible.

While this serves as a basic summary of the bottom-up process in this use-case, the below flow-chart visualization represents the entire pipeline:

Proposed Pipeline for Extracting 3D Pose Data from Broadcast Feeds

Additional functionality for the types (and numbers) of video input was added, with the output bridging by both providing a visual reference as well as the actual 3D points. Collectively, some sort of visual gauge of accuracy should be able to be estimated. Each aspect of this pipeline will be broken down for further understanding of this process.

Video Input

In choosing the right path for this modeling pipeline, the consideration of the actual video input is crucial. In prioritizing accessibility, home broadcast video feeds from Baseball Savant were utilized. An example of this can be seen below:

This choice was primarily due to the massive quantity of videos available and the clear cutting/labeling of the clips. Within the site, almost any clip for a recent pitch can be found and downloaded just by having the Play ID. The clips are adjusted so that the majority of the video is only during the duration of the pitch captured, which negates any need for video editing software in potentially analyzing a game or creating variety in training data. As mentioned, the clips coordinate with Play IDs, so any play information (the players involved, count, play result, etc.) about a given video can be retrieved. This makes it easy to organize tools that only access videos under certain conditions, allowing for faster processing of multiple clips.

While there are plenty of pros to the Savant Video, it comes with a good number of limitations. The main drawback is the limit to a singular broadcast angle. While the site does offer home and away feeds, these are often the same angles and provide no variety in examining the players. A variety of angles is often helpful for Computer Vision models to generalize objects more easily from multiple vantage points. The video is also limited in both quality and frame rate. The quality output of these videos is 1080p, which is a lower quality than most high-capacity cameras utilized for sports. Quality is crucial in these models, as fewer pixels lead to incorrect assumptions about the visual nature of an object. The frame rate for these feeds slightly varies from video to video, but it generally is between 59 and 60 FPS, which isn’t ideal for tracking fast-moving body parts. Higher frame rate video allows for fast-moving objects to be visualized more clearly, which is crucial for models in articulating where an object is in a frame.

Although some of the specifics of these source videos could be improved, they meet the minimum threshold beyond which accurate predictions can be made and then translated into real-life data. While an improvement in these specifics would likely lead to more accurate estimations, you can only play with the cards you are dealt.

Player Detection

To ultimately get 3D points of a given player, we need to know where the given player is in the frame. In solving this problem, a YOLOv8 Pre-Trained Object Detection Model was utilized. YOLO (You Only Look Once) models are special because they only have to view an image in one pass, allowing for faster processing. Other object detection model infrastructures have to view an image multiple times through different tasks to ultimately generate a prediction. This one-stop prediction is achieved through a unique architecture, using the combination of a pyramid-shaped backbone (responsible for extracting features from the images) that is connected to multiple layers throughout the head (which is responsible for taking the features and detecting the object / generating bounding box predictions). The entire architecture is advanced, but I have included it in case the reader wants to better understand the processes by which these predictions are made.

YOLOv8 Model Architecture

This architecture, combined with easy pre-built fine-tuning tools, has made the model incredibly efficient and accurate at detecting certain objects within a frame. From there, a choice was required between differently sized models, with larger models having better performance but taking longer to predict (the exact metrics can be accessed here). Considering this trade-off in terms of the marginal gain of performance versus the marginal increase in inference time about the problem at hand is crucial to properly solving the problem. Based on experience from prior trials, the large model was chosen.

Pre-trained models are great at detecting general objects—a fine-tuned version is better suited to solving this type of unique player detection problem. To fine-tune a model, the types of objects the user wants to track need to be specified. This is done through a process called annotation, which consists of manually marking the objects in the image to let the model know the ground-truth location of what it is trying to locate. For this model specifically, the pitcher, hitter, and catcher were marked with bounding boxes across 4,090 images. This is on the relatively low end for annotations, and will likely be improved in future versions. Pictured is an example image with these applied:

Example Annotation for the Pitcher / Hitter / Catcher Detector Model

With these annotations, fine-tuning is now possible. This is done with a wide array of hyperparameters, which are a set of pre-defined configurations for how machine learning models train. They can ultimately affect how well the model performs. To optimize for these configurations within a given space, a tuning script was run to iterate through 40 models for a short five Epochs (or training cycles), which took about 10.1 hours on a T4 GPU instance. The overall fitness of the model as it was tuned is shown:

Model Fitness Measure as Tuning Progressed

Pictured below is the relative performance of the model for each given hyperparameter selection within the search space that the model was investigating.

Results of each Hyperparameters through Search

While the exact specifics of each of these hyperparameters are not important to the reader in understanding the model (although they can be found here), it is worth noting that several optimal choices (or at least optimal choices relative to the given search space) were found. Others needed to be searched further, as some results were at a minimum or maximum point of the given search range set before the hyperparameter search. Due to my computational limits with available GPU hours, the model was unfortunately not able to be tuned for more instances. To that end, the tuning did lead to some improvement. In measuring the standard performance of these models, the most prominent metric is mAP50-95, which measures the average area under the precision-recall curve (the trade-off between the percentage of correct positive predictions vs. the percentage of positives identified) at different intersections over union (the amount of overlap between a prediction and the ground truth bounding box) thresholds. On a scale that ranges from 0 to 1, the model improved from 0.8 to 0.817 after hyperparameter tuning was implemented. While this may seem minute, the benchmark large model added 24.5 million parameters for the extra-large model to have its mAP50-95 improve by 0.01 (less than the improvement from the Pitcher/Hitter/Catcher Detector model)—improvements to the better models are incredibly difficult.

This marginal performance gain was able to be seen in the visual predictions by the detector.

Video Player

Media error: Format(s) not supported or source(s) not found

Download File: https://www.baseballprospectus.com/wp-content/uploads/2024/09/Phc_det_example.mp4?_=1

00:00

Use Up/Down Arrow keys to increase or decrease volume.

It is worth noting that the struggles were not equal across classes; detection of the pitcher object seemed much easier than that of the hitter and catcher. This is likely due to the proximity of the bounding boxes, as the hitter and catcher objects are in a relatively similar place, and sometimes partially obfuscated by the pitcher. Additional images and further fine-tuning would likely further address this issue.

2D Keypoint Estimation

Given that the labeled bounding boxes are in place for the different positions, human pose key points can be extracted for those given players. For the 2D Keypoint Model, the RTMPose Medium model was chosen. This model utilizes a top-down approach for inference/prediction, taking a general bounding box human detection and cropping the image to a uniform scale before extracting the pose key points. The model achieves this by using a robust architecture of Convolutional Neural Networks and Gated Attention Units (GAUs) to extract high-level features and then focus on the most important relevant information, which then translates into a classification problem based on the x and y coordinates of the key points. I have included a visual for those aiming to further understand the model architecture, although it is not needed for a general understanding of this pipeline.

Model Architecture of RTMPose

The most important thing to take away from this model choice is its intended use. While most pose estimation research focuses on overall accuracy with a heavy cost to prediction times, this model was designed to be utilized in real-world industrial situations prioritizing the trade-off of inference speed versus accuracy. Given that broadcast feeds contain a large amount of video, a minimal gain in speed would have a large-scale impact on production. Paired with the model’s easy accessibility and pre-built libraries to make inference less time-consuming, it provides a great partial solution for this unique problem.

In actually applying this solution, we allow the model to fully predict the image with a maximum number of human poses to detect. For the broadcast feeds, that maximum is set to four, utilizing the most confident predictions of human poses (an extra from the three to allow for error). After these predictions are made, the Pitcher / Hitter / Catcher Detector is responsible for predicting the bounding boxes of the players (set with a maximum detection of one per class), as previously mentioned in the detection phase. From there, the top three poses that are at least within 70% of the bounding box are kept. The 70% benchmark is somewhat arbitrary, although lower thresholds that were tested saw incredibly misleading results. This allows for the 2D key points of a given player to be measured with a high degree of confidence, which can be seen in this example.

Box Detection and 2D Pose Prediction on Play ID 1faa20cf-cd2f-44ae-b118-6d152a5aff4b

As visualized in the image, the bounding box of the given position of a player is set with the vast majority of the key points being inside the box. In the interest of honesty, while this is a perfect example, a failure of the PHC detector to recognize a given player also leads to the 2D key points not being detected, which ultimately causes the model to fail for that given player in that frame. The success of the aforementioned detector is crucial in this entire process, which is why it is continually undergoing further development. With pose points detected, the next problem comes in translating them into a 3D space.

3D Keypoint Estimation

Compared to 2D Pose Estimation, 3D Pose Estimation is a relatively new endeavor still undergoing major work in ensuring the most accurate systems. Despite that, workable models are still publicly available that allow these video-based pose points to be “lifted” into a 3D world. The model utilized for this project was MotionBERT, a monocular-based system that was pre-trained by taking ground-truth 2D Poses and corrupting them with random noise, allowing for a MotionEncoder and a Fully Connected Layer to figure out (or recover) the 3D connection. It was then fine-tuned with the MotionEncoder to identify and perform well with certain sports movements. This visual represents the framework for the system.

Framework Overview of the MotionBERT system

Once again, understanding the background of this model is not absolutely necessary. One key piece to take away from its infrastructure, however, is its monocular nature. In other words, while some models need multiple views within a video of a skeleton to reconstruct a 3D Pose, MotionBERT only needs a singular frame to estimate the points. A monocular-based approach allows for a much more straightforward pipeline in directly converting single 2D estimations into 3D space.

In utilizing this model in predictions, it is important to note that it accepts the 2D points as inputs, although the RTMPose Model does not naturally work in concert with MotionBERT. The sequence of pose points is extracted from the 2D model and padded with values so that the format of the output matches the format of the expected input. The 3D Model then acts as a Pose Lifter from the 2D points, which outputs the unadjusted raw data that the whole project is trying to solve. In this specific use case, the output includes a long string of numbers corresponding to key points, the detector box, as well as the confidence scores in the detection. Having raw data is a plus, but to gain any insight, these have to be adjusted.

Adjusting and Utilizing Key Points

Some versions of the key points are available—all that is needed is some translation so that these can be visualized and labeled in a 3D Space. While the 3D output is already normalized to the size of the video, the direct plotting of these points is relatively unsatisfactory in terms of the view. With most videos, right-handed pitchers would be outputted from a back-side angle (with only left-handers being clearly visible). To address this, the x-axis of each player was inverted for a right-handed standard. After translation, the format of the points needed to be physically altered concerning the output that the Pitcher / Hitter / Catcher detector produced so that the key points were under given player labels rather than just random pose determinations for spots of interest. With these transformations, the points were plotted as such:

Example of the 3D Pose Estimation Visual

Using a singular camera angle, preliminary 3D points were able to be extracted! For reference to the points in the image, this is the format of the H36M dataset.

Human 3.6M Points for Body parts

In considering these 3D predictions, one main question persists: how accurate are these points? The general model is evaluated using a metric called Mean Per Joint Position Error (MPJPE), which measures the Euclidian distance from the predicted and ground-truth key points. For the standardized datasets, it was one of the best-performing publicly available models. Unfortunately, it is not possible to gauge the exact score of this specific prediction due to the fact that ground-truth 3D points are not publicly available. However, it is possible to perform a subjective “eye test” comparing the 2D points on a given video with the 3D points nearby. In performing this evaluation, just a singular image can help try to decode the accuracy of this system.

2D and 3D Predictions Visualized Side-by-Side

With consideration to the pose being mirrored, the points seem to line up fairly well with where the frame is in this particular moment. For the pitcher, the foot location and body structure appear accurate, with the arm being slightly back to where the release is at that moment. The hitter’s structure looks almost perfect, with the arms and body pose exactly mimicking the representation in the video. The catcher pose is not as accurate (which is likely due to the crouch itself), with a predicted part of the leg connecting to the glove. While the model struggled to extract the bottom half of the catcher, the upper body seems to be correct. While not perfect, this does provide some preliminary results that could lead to some signal.

Usage Considerations

Utilizing Predictions

Having all of the necessary parts in place, it’s now possible to utilize these extracted points as a tool. While not the final product, the video below demonstrates the working prototype that could be generated for every singular pitch with broadcast data.

The descriptions are relatively basic, although any needed information for the pitch could be accessed through the MLB Stats API (a nicer-looking format is currently in development). As mentioned in the overview of the pipeline, a file is also outputted with 3D point data on each of the respective positions. The current output consists of a CSV file containing the Frame Number and a JSON string of the corresponding data (due to that being the preferred format in many data processing pipelines), although the ultimate product would likely format the entire data for a given play into a singular JSON string that could be appended to a data frame similar to one generated by PyBaseball or BaseballR. The current format for a single-player during a single frame appears like this:

"pitcher": {
      "Center Hip": {"x": -0.0, "y": 0.0, "z": 0.7882807850837708},
      "Right Hip": {"x": -0.05186059698462486, "y": -0.11169151216745377, "z": 0.7761267423629761},
      "Right Knee": {"x": -0.024043701589107513, "y": 0.03424089029431343, "z": 0.37822145223617554},
      "Right Ankle": {"x": 0.004068847745656967, "y": 0.16503921151161194, "z": 0.0},
      "Left Hip": {"x": 0.05172949284315109, "y": 0.11062503606081009, "z": 0.801512598991394},
      "Left Knee": {"x": 0.007803955115377903, "y": 0.15864424407482147, "z": 0.38779690861701965},
      "Left Ankle": {"x": 0.013590332120656967, "y": 0.23970089852809906, "z": 0.02020728588104248},
      "Thorax": {"x": -0.011980713345110416, "y": -0.03061842918395996, "z": 1.007973313331604},
      "Neck": {"x": -0.031169677153229713, "y": -0.0776493027806282, "z": 1.2440235614776611},
      "Head": {"x": -0.1498754918575287, "y": -0.046518050134181976, "z": 1.348961353302002},
      "Nose": {"x": -0.1529199182987213, "y": -0.13321222364902496, "z": 1.3743559122085571},
      "Left Shoulder": {"x": 0.05527563393115997, "y": 0.027419473975896835, "z": 1.244412899017334},
      "Left Elbow": {"x": 0.05129101499915123, "y": 0.24122972786426544, "z": 1.0844157934188843},
      "Left Wrist": {"x": -0.17926855385303497, "y": 0.18774807453155518, "z": 1.1955029964447021},
      "Right Shoulder": {"x": -0.09562744200229645, "y": -0.18671827018260956, "z": 1.2085908651351929},
      "Right Elbow": {"x": -0.26944994926452637, "y": -0.12566475570201874, "z": 1.0434675216674805},
      "Right Wrist": {"x": -0.22061987221240997, "y": 0.02009918913245201, "z": 1.2051341533660889}
 },

Labeling of the given points corresponds to the H36M format of Pose Estimation, so that standardized methods of pose evaluation can be applied to a baseball setting. The exact data points are also relative to their space in the image, so they will likely need to be translated into real-world coordinates based on the known distances between static baseball objects visible in the broadcast feeds.

Despite these data points not being real-world measurements, assumptions about the human body can still be made that would allow for biomechanical analysis to be performed with the help of other public data (such as the height and weight of the player serving as a proxy for body-part length). Having access to both the raw data and comparison video, analysis can be conducted and checked for accuracy, allowing 3D MLB Biomechanical Data to be available to the public.

Acknowledging Limitations

Providing something relatively new in this space is exciting. However, it’s necessary to acknowledge the drawbacks of the system and be realistic about what it currently accomplishes / general performance. The main focus of these limitations is the overall fragility of the pipeline, from which most of the system’s problems stem.

It was noted earlier that a failure in the PHC Detector to recognize a player would lead to points not being picked up at all. The potential for failure does not end there—a misreading of a box or a prediction too large also causes failure of the entire system. This includes player detections between a hitter and catcher being switched, with the biomechanical data being mislabeled so that some of the output is erroneous. Umpires are also sometimes inside the bounding box above the 70% threshold, causing pose predictions to be returned for them versus the actual player. While inaccurate data can be filtered based on the predictions of prior frames, the ideal outcome would be avoiding these errors in the first place.

Beyond the detector, an inaccurate 2D reading of a given frame automatically corrupts the resulting 3D output. While the RTMPose model is stronger than most alternatives (the YOLO Pose was significantly inferior to the current implementation), it is not perfect by any means, sometimes picking up incorrect points on the body. When this happens, the fragility of the prediction mechanisms is exposed. An ideal world would have the model be more specific to baseball movements, although this would be a timely endeavor (to that point, still achievable).

Future Development

As mentioned throughout this post, while the feat of predicting relatively accurate 3D Data Points from a singular camera angle is valuable, several aspects concerning this project warrant future development in the hopes that such a system could be applied more accurately on the major-league level and later scaled for amateur use. These considerations range in both their scale of change and reasoning and will be elaborated upon individually.

Construction of Direct Video Database and Pipeline

The main hindrance in the entire development of this project was the reliance on an external source for video. This is primarily because 3D processing could not even occur until a given video was specified as needing to be downloaded in the first place, which then led to the need to submit a request to Baseball Savant that returned the video. While this pipeline does work currently, it is not ideal for long-term processing.

In a complete full-scale version of this project, thousands of videos (corresponding to each pitch) would need to be processed every day during the season. The seemingly minuscule time it takes to retrieve a video from the site would quickly add up, making the collection and storage of a video database a necessity in saving many hours each day in retrieving videos. In the hope of extending this project beyond just MLB video data, amateur data is less easily retrieved, and would likely need its own solution so that the videos have some form of labeling within a database.

Further Training on the PHC Detector

The imperfections of the Pitcher / Hitter / Catcher detector were noted throughout this piece, with a specific emphasis on the confusion of the model in classifying catchers and some hitters. In further reviewing the most prominent weaknesses, the model struggled to detect catchers at all and was confused when catchers stood up during a play, seemingly mimicking hitters. These shortcomings are likely due to a lack of quantity and variety of the images, which likely did not capture the catcher in a large enough variety.

The original detector had a little over 4,000 images—it is estimated that over 10,000 images would be needed to properly detect the players. The images also likely need to be processed at a higher resolution, as it seems that the current 640×640 input resolution is not sufficient to complete this task. For the sake of allowing inference speeds to still be reasonable, the model will stay the same size, although the extra-large model will likely be explored to achieve the highest degree of accuracy and ensure the entire pipeline functions correctly.

Increased Variety of Training Images

While an increased variety of training images would benefit both the PHC detection model and the RTMPose 2D Keypoint model, the ways to approach each are somewhat different. For the PHC Model, a script was originally utilized to filter through many Baseball Savant videos and take some plays throughout each game. The inherent problem for this was that to get a great variety of training frames, you had to pull a great deal of video and take a minimal number of frames per video. The degree to which this was done for the model was less than ideal, which is why this is being improved in future iterations of the model.

The RTMPose 2D Keypoint model would benefit from additional fine-tuning on baseball-specific images. It was likely not exposed to these types of figures during training and is therefore performing only on the generalization it has developed in articulating a wide degree of human poses. Although performing human key point annotations is relatively time-consuming, pre-trained pose models could likely be implemented into annotation software, allowing the annotator to manually adjust the points for inaccurate images to hasten the process.

Altering / Improving 3D Model Predictions

While the MotionBERT 3D Pose Model that was utilized was generally one of the best performers on public benchmark datasets, these metrics are not specific to estimating the pose for baseball motions. Given that these types of models perform differently when faced with different actions, the model would likely benefit from additional fine-tuning with ground-truth 2D and 3D Pose key points (in other words, the true location of the points in a given space). It is worth noting that this process would be somewhat time-consuming, although the addition of fine-tuning would likely increase the degree of accuracy for baseball poses and provide general quantitative metrics for evaluating the model.

To perform this task, ground-truth 2D points would be collected with manual annotation, and ground-truth 3D points would need to be collected using marker-based motion capture technology in a variety of environments.

Example of Marker-Based Capture for Biomechanical Data (University of Nebraska – Omaha)

Video would also need to be taken during the capture, as the model would need to train on the additional video data in attempting to reconstruct the 3D Poses for those baseball situations. Such an endeavor would likely yield favorable results if conducted properly by an independent company or team, although such a pursuit would likely not be beneficial to an individual user in terms of the needed investment and time.

Moving Player Predictions into a 3D Landscape

The current outputs of the 3D key points consist of coordinates relative to their space within a given RTMPose Detection bounding box. This is satisfactory for conducting basic biomechanical analysis, although it is not sufficient for recreating the layout of the field and the player’s relative positioning to other players. The desired result would likely be similar to that of MLB’s new Gameday 3D system, which utilizes each stadium’s Hawkeye cameras set-up to track player and object movement throughout a game.

Gameday 3D: Carlos Correa's 92.2 mph relay to get Shohei Ohtani at the plate pic.twitter.com/999HSzENl2

— David Adler (@_dadler) April 10, 2024

Visual of Gameday 3D System

The main difference in these systems is that the Hawk-eye setup has to utilize 12 cameras, while the proposed methodology only uses one. While more camera angles would likely be necessary to get a similar degree of coverage (the entire field is accessible through Gameday 3D), a fine-tuned 3D Pose model could return comparable results with a fraction of the equipment needed.

Addition of 6D Pose Estimation in Considering Objects

In an era where tracking objects has just begun to be achieved through the utilization of 2D Computer Vision methods and depth estimation of multiple angles, the logical next step in making this technology more practical and accessible is the integration of 6D Pose models in estimating the moving parts of the game, such as the catcher’s glove or the ball. Assuming you’re not familiar with 6D Pose Estimation, it is the process of using the RGB color channels in an object to estimate the six degrees of freedom of an object. These six degrees of freedom include the object in the xyz-space, as well as the orientation of the object around the xyz-axis. This type of modeling is most notably used in robotics and self-driving cars, as gauging the exact presence of objects is crucial in those use cases. With that in mind, a great deal of research has been dedicated to ensuring this technology is usable in industrial applications, making it a great fit for a potential application to baseball. Shown is an example of 6D Pose Estimation:

6D Pose Estimation Example (“Occlusion Resistant Object Rotation Regression from Point Cloud Segments”)

While implementing such models would significantly boost what can be tracked with minimal camera equipment, the main hurdles are found in the areas concerning fine-tuning, accuracy, and excessive computational cost/inference times. Unlike simple 2D models, annotating 6D objects is incredibly time-consuming due to the increased number of points necessary for each object and the increased difficulty of accurately measuring these points. Assuming that struggle is overcome, the actual occlusion of the object creates more challenges, including the number of objects, the background of the object, and countless other considerations that make this problem incredibly difficult. Even if a model can be developed, the computational expense required to predict 6D points is significantly more costly and time-consuming than the cost of predicting 2D points. If added to a system like this, the extra value brought would need to be significant.

Conclusion

In my original pursuit of the application of computer vision to baseball video, the motive behind the process was to find an alternative to Bat Tracking data, as it was not public at the time. The team at Savant had been hinting about results from this data, but nothing was available. Shortly after I began my pursuit, they were ready to release that data into the world. Biomechanical data is a different story—there has been no indication that this type of data will be accessible publicly anytime soon.

The above outlines a pipeline that takes available broadcast video and turns it into biomechanical data points. The pipeline consists of three main models—the PHC Detector, the RTMPose Estimator, and the MotionBERT 3D Pose Lifter. These three models are all responsible for predicting based on an input of broadcast video. The Detector recognizes players and sets limitations for the 2D Pose Estimator. The 2D Pose Estimator then feeds its predictions into the MotionBERT Estimator for 3D points. Each model’s output plays into the other, with any break in the chain causing a partial or total failure of the entire system. The pipeline then can output both a biomechanical visual of the play and raw data points so that individual data analysts can perform their analysis. While the system has its fair share of shortcomings, it does provide a great start to harnessing new models in extracting data from sources never before considered.

If made public and eventually scaled/fine-tuned correctly to the amateur space, it would completely change the way that player analysis is conducted. Assuming video is present, low evaluation costs would allow players to have biomechanical data dating back to Little League. This would allow for the progression of player’s mechanics to be tracked over a long period, providing insights for both the individual player as well as information on the long-term macro trends in youth-to-professional player development. The implications of properly utilizing these new models to create this wave of information are large, and could potentially lead to enhanced research on training programs, injury prevention, and projection modeling, among other aspects of the game.

While the majority of this project is not open-source (at least right now), this piece was written as part of a greater initiative to expand computer vision into the baseball space. This is being done through BaseballCV, a repository co-founded by Baseball Prospectus writer Carlos Marcano and myself that provides baseball-specific computer vision tools, models, and datasets. The models include the Pitcher / Hitter / Catcher detector, which can now have its performance documentation reviewed and its weights directly downloaded for further fine-tuning or prediction. With write-ups like these to serve as examples, the hope is that others could pursue similar projects with these publicly available tools to ultimately advance research in the sport. As this is a somewhat grand aspiration, I’ll settle for a few people reading about another use case for Computer Vision within baseball.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now