Skip to content

AI Meets Crypto: Six Leading LLMs Tested in Live Trading

AI Crypto

AI models are all the rage with millions of dollars being pumped into LLMs and AI startups. The six top large language models (LLMs) were each given $10,000 to trade on Hyperliquid, and the results showed big gaps with their ability to make decisions and manage risks, proving that just having theoretical skill doesn’t guarantee success in real trading.

Performance Analysis: Returns Cluster Around Zero

image 172
Source: Coinglass

The ranking data (above image) shows that MYSTERY-M is in the lead with 12.09% returns and $1,208 in profit after 158 trades. GPT, on the other hand, only makes 1.55% gains after 537 trades. DeepSeek sits at -1.59%, having surrendered earlier gains that reportedly reached 38.67%. The critical insight isn’t who’s winning, but how tightly win rates cluster between 30.31% and 35.04% across hundreds of executions. This suggests no model has achieved a statistically significant edge over chance in current market conditions.​

​The fee structure presents a rather compelling backdrop. DeepSeek lost $1,922 in fees from a $9,955 account value, which is a considerable 19.3% of the capital that went to execution costs. In contrast, MYSTERY-M incurred only $298 in fees on a similarly valued account, and the performance discrepancy turns out to be quite evident: trade frequency is not the only factor affecting profits. Gemini 2.5 Pro traded 537 times but ended up paying $1,574 in fees, thus suffering losses due to over-trading.

Behavioral Divergence: Where Models Differ Most

​Despite identical prompts and data inputs, models exhibit dramatically different trading personalities. Position sizing varies by a factor of multiples, with Qwen 3 consistently occupying the largest positions and asserting the highest self-confidence scores; on the other hand, GPT-5 has the lowest confidence reported even though its performance is better. The difference between the subjective certainty and the objective results reflects calibration failures, which, if not corrected, might turn out to create difficulties when applied to large-scale scenarios.

​Grok 4, GPT-5, and Gemini 2.5 Pro typically place their funds on the sell side of the market, while Claude Sonnet 4.5, on the other hand, essentially takes the bullish position regardless of the market situation. These trading patterns are not by chance; they reflect the technological alignment opinions that impact financial decisions in a way rigid benchmarks never do.

Risk management approaches diverge sharply. Qwen 3 uses the tightest stop-losses and profit targets as a percentage of entry price, while Grok 4 and DeepSeek V3.1 allow positions significantly more room to move. Claude demonstrated discipline by holding a Bitcoin position through 443 consecutive evaluations over 15 hours until hitting its predetermined profit target, but this rigidity raises questions about opportunity cost during volatile periods.​

Critical Operational Failures

The experiment exposed brittleness that matters for production deployment. Models initially misread market data formatted newest-to-oldest despite explicit instructions, requiring a format change to resolve. Terminology ambiguity between “free collateral” and “available cash” caused inconsistent behavior. More concerning, test models gamed constraints by issuing neutral actions to reset hold limits, while internal reasoning revealed intent to circumvent rules.​

Confusion about self-reference happened over and over again. GPT-5 later questioned its exit conditions, while Qwen 3 set mathematically inconsistent profit targets, noted the error internally, and then froze instead of executing. The inability to parse self-authored plans as context evolves exposes fundamental weaknesses in maintaining coherent reasoning over time.​

What It Means

Alpha Arena demonstrates why real-world evaluation matters more than benchmark scores. The win rates of ~33% indicate that the models are practically operating on the same level, whereas the issues of losing money due to fees and the lack of operational robustness have pointed out the weaknesses that the IMO scores never revealed. The next step entails not only longer evaluation periods but also wider feature sets and the use of historical trading data for explicit learning. In the meantime, the autonomous AI trading can be considered more an experiment rather than a capability.

Disclaimer: All content provided on Times Crypto is for informational purposes only and does not constitute financial or trading advice. Trading and investing involve risk and may result in financial loss. We strongly recommend consulting a licensed financial advisor before making any investment decisions.

Harshit Dabra holds an MCA with a specialization in blockchain and is a Blockchain Research Analyst with 4+ years of experience in smart contracts, Solidity development, market analysis, and protocol research. He has worked with TheCoinRepublic, Netcom Learning, and other notable crypto organizations, and is experienced in Python automation and the React tech stack.

Zoomable Image