Built an MLB Projection Engine from scratch discussed in Betting Systems/Gambling at Wizard of Vegas

obsidic

Threads: 1
Posts: 4

Joined: Feb 22, 2026

February 22nd, 2026 at 7:58:17 PM permalink

I've been lurking here for a while and figured I'd share what I've been working on. I have been building an MLB projection model that doesn't try to predict who wins but rather it simulates full nine-inning games from the pitch level up.

The basic idea:

Baseball decomposes into discrete events better than any other sport. One pitcher throws to one batter, and the outcome of that matchup depends on a relatively contained set of factors. So instead of building one model to predict "who wins," I built a pipeline of interconnected models that each handle a narrow piece of the game, then connected them through thousands of simulated plate appearances.

The pipeline has four stages:

Player profiles � Built from 3.4M+ pitches of Statcast data (2021-2025). The key innovation is separating skill from noise using batted-ball physics rather than outcomes. A batter making consistently hard contact at good angles who's batting .220 is projected more favorably than a .280 hitter surviving on soft contact and fortunate placement. Expected metrics (xBA, xSLG, etc.) are derived from contact quality, not box-score results.

H2H engine � A dedicated ML model that takes a specific batter and specific pitcher and produces a probability distribution across all plate appearance outcomes (K, BB, 1B, 2B, 3B, HR, outs). It learns the non-linear interactions between player types. A batter who struggles with high-velocity fastballs is projected differently against a power arm vs. a control artist, even if both pitchers have similar overall lines. Platoon splits are player-specific rather than applying a blanket adjustment.

Environment layer � Real-time weather (temp, humidity, barometric pressure, wind speed + direction) fed into a physics model for batted-ball carry. This enters at the batted-ball level, not as a simple run multiplier. Park factors are per-outcome (a park can suppress HRs but boost doubles). Umpire zone tendencies are included when assignments are known.

Monte Carlo simulation � 5,000 full nine-inning games simulated per matchup. Each sim plays out every plate appearance with the game state evolving naturally. Runners advance based on empirically calibrated probabilities (runner speed, outfield arm, hit type). ~140K individual events per game.

Why simulate instead of using formulas?
A formula can predict expected runs or win probability, but it can't capture cascading dependencies. When a leadoff batter reaches base, the simulation plays out subsequent at-bats with a runner in scoring position, where each outcome has different consequences than with bases empty. Everything, including; win probability, run distributions, O/U lines at any number, and player props fall out of the same simulation, so they're internally consistent.

For thin samples (rookies, callups, early season):
Bayesian regression blends toward population baselines proportionally to sample size. ~100 PA crossover point where individual signal overtakes the prior. This prevents wild early-season projections while allowing breakout players to emerge as data accumulates.

Some honest limitations I'm still working on:
Bullpen modeling is aggregate rather than pitcher-by-pitcher for relief innings (~35-40% of total innings)
Individual defensive positioning isn't modeled yet
Baseball's irreducible randomness means even a perfect model would be "wrong" roughly 1 in 3 games

Here are the results from me testing it in 2025:

Overall winner accuracy: 64.5% (2,416 games)
High-confidence picks (≥60% win prob): 71.7% (1,267 games)
Brier score: 0.22 (lower is better � measures probabilistic accuracy)
Run total bias: −0.19 (nearly zero; slight under-projection)

Happy to answer any questions about the methodology. Still improving this thing every week and hoping to have another big season.

A couple of questions:
For those who've built projection systems, how are you handling bullpen modeling? That's my biggest gap right now.
Is anyone else using batted-ball physics for expected stats rather than Statcast's public xBA/xSLG?

AutomaticMonkey

AutomaticMonkey

Threads: 21
Posts: 1568

Joined: Sep 30, 2024

February 22nd, 2026 at 8:10:48 PM permalink

Yes, I have a very simple thing I wrote which gives me weather information for each ballpark. But it's about more than batted balls. Low atmospheric pressure makes breaking balls not break, and you know what happens to them then. That's why a lot of good pitchers get hit in the first inning before they realize they need more spin and adjust.

How does your overall winner accuracy compare to using a very simple predictive model: Average runs scored by each team divided by ERA of the opponent's starting pitcher for 6 innings, ERA of the opponent's bullpen for 3 innings? We can all just pick "the better team" and get results over 50%, but not good enough to beat the odds and vig. Just pick the Yankees and Dodgers to win every game and you'll have something close to 60%. If I leave out the games against really good teams or pitchers I'll be over 64% for sure.

DougGander

DougGander

Threads: 0
Posts: 159

Joined: Oct 30, 2025

February 23rd, 2026 at 2:13:36 AM permalink

Very interesting post. One thing I would be very concerned about with this is that every sports bettor in the world just got tools to create sophisticated models for $20 a month through ai so I would expect your model to be less successful going forward. I have heard multiple people say lines are becoming more efficient. There is a small but non-trivial chance baseball will be "solved" and no longer exploitable.

Most variables do not make any significant difference. You need to be very careful to not include things which are random noise. Obvious advice I know but even quite sophisticated AP's can get hung up on over-complication. Mostly it comes down to a few core variables-most people improve their systems with pruning and paring rather than the opposite.

obsidic

obsidic

Threads: 1
Posts: 4

Joined: Feb 22, 2026

February 23rd, 2026 at 7:41:44 AM permalink

Yeah weather is a huge factor. The more micro you can get the more you can differentiate yourself. I've seen people try and predict the weather to the base but that's mostly BS. You may be able to predict wind in each area of a stadium depending on the stadium but everything else is just fluff.

For your second question, I use a LightGBM model takes the actual confirmed lineup (not team averages) and simulates each batter vs the actual starting pitcher using pitch-level Statcast data such as exit velocity, launch angle, barrel rate, whiff rates by pitch type, chase rate, zone contact. Then it runs 5,000 simss per game to generate probability distributions, not point estimates. That's roughly 350,000 simulated plate appearances per game which helps determine the winner. It could very well be the "vegas favorite" if after 5,000 generations it is determined as so. I could go back and find how many winners were underdogs, could be interesting but would need to backtest with odds data.

obsidic

obsidic

Threads: 1
Posts: 4

Joined: Feb 22, 2026

February 23rd, 2026 at 7:46:06 AM permalink

Quote: DougGander
Very interesting post. One thing I would be very concerned about with this is that every sports bettor in the world just got tools to create sophisticated models for $20 a month through ai so I would expect your model to be less successful going forward. I have heard multiple people say lines are becoming more efficient. There is a small but non-trivial chance baseball will be "solved" and no longer exploitable.

Most variables do not make any significant difference. You need to be very careful to not include things which are random noise. Obvious advice I know but even quite sophisticated AP's can get hung up on over-complication. Mostly it comes down to a few core variables-most people improve their systems with pruning and paring rather than the opposite.
link to original post

I agree but that could be said for most things. An AI is only as good as the input. If we instructed it to build an MLB model I am sure it could create a great model based on averages etc. But once you start getting into xBA/xSLG, umpire strike zones, etc. Its a different ball game ;)

SOOPOO

SOOPOO

Threads: 125
Posts: 12406

Joined: Aug 8, 2010

February 23rd, 2026 at 9:24:11 AM permalink

You report a 64.5% of wins. As you know, that is meaningless without knowing the odds for the game. A better way to present your results�.
One unit bet on 1000 games yielded a 50 unit win�.

It seems like you were doing money line bets.

Your work may be more valuable on prop bets like O/U pitcher K�s, batter HRs, etc�

And welcome to the forum. Feel free to post some picks before the games are played!

obsidic

obsidic

Threads: 1
Posts: 4

Joined: Feb 22, 2026

February 23rd, 2026 at 12:49:47 PM permalink

Quote: SOOPOO
You report a 64.5% of wins. As you know, that is meaningless without knowing the odds for the game. A better way to present your results�.
One unit bet on 1000 games yielded a 50 unit win�.

It seems like you were doing money line bets.

Your work may be more valuable on prop bets like O/U pitcher K�s, batter HRs, etc�

And welcome to the forum. Feel free to post some picks before the games are played!
link to original post

I agree. My model doesnt take into account spring training data but I have still been running it. Here is what it gave yesterday. Ended up 1.89 Units and hit 75% of the underdogs.

Category Matchup Play Model Odds Book Implied Prob Edge
MONEYLINE CLE@ATH CLE ML 66.8% +157 BETO 38.9% +27.9%
MONEYLINE NYM@NYY NYM ML 64.1% +160 BU 38.5% +25.6%
MONEYLINE CHC@SF SF ML 76.7% -120 BR 54.5% +22.2%
MONEYLINE WSH@MIA MIA ML 76.7% -130 DK 56.5% +20.2%
MONEYLINE BAL@DET BAL ML 56.4% +136 FD 42.4% +14.0%
MONEYLINE LAD@SD SD ML 66.6% -114 BR 53.3% +13.3%
MONEYLINE COL@TEX COL ML 50.5% +165 BETO 37.7% +12.8%
MONEYLINE STL@HOU HOU ML 65.1% -125 DK 55.6% +9.5%
MONEYLINE SEA@CIN SEA ML 61.7% -112 DK 52.8% +8.9%

Here is today:

Category Matchup Play Model Odds Book Implied Prob Edge
MONEYLINE NYM@TOR NYM ML 70.0% +155 BU 39.2% +30.8%
MONEYLINE MIL@SD MIL ML 70.4% +100 BETO 50.0% +20.4%
MONEYLINE PHI@WSH PHI ML 61.5% +128 BETO 43.9% +17.6%
MONEYLINE AZ@CLE AZ ML 61.3% +118 FD 45.9% +15.4%
MONEYLINE CHC@KC KC ML 69.9% -122 FD 55.0% +14.9%
MONEYLINE MIA@STL MIA ML 63.4% -105 BETO 51.2% +12.2%
MONEYLINE SEA@LAD LAD ML 65.4% -118 BETO 54.1% +11.3%
MONEYLINE BOS@TB BOS ML 51.8% +145 BETO 40.8% +11.0%
MONEYLINE MIN@DET DET ML 70.2% -210 BETO 67.7% +2.5%

Boston just lost bottom of 9th :( but take all this with a grain of salt since it is Spring Training

Built an MLB Projection Engine from scratch

Recommended online casinos

Recommended online casinos

Trending Forum Threads