On paper, Grok 4 has comprehensively surpassed all competitors—including top-tier models like OpenAI GPT-4o, Gemini 2.5 Pro, and Claude 4—whether in traditional benchmarks, SAT exams (American college entrance tests), or GRE-level subject assessments.But beyond these somewhat stale conventional benchmarks, what’s more intriguing is that Grok 4 also tackled “Humanity’s Last Exam” (HLE), a closed-book test dubbed humanity’s final stand. It outperformed all prior models to achieve a peak accuracy of 44.4%.
Comment 1:
Media outlets hyping Grok 4 are likely complete amateurs. I doubt they’ve even used Grok once on foreign websites. Anyone who saw past Grok evaluations knows this series epitomizes “trash from brute force.”
Reply 1.1: I use the free Grok version – very satisfied!
Reply 1.2: Since you’re an NVIDIA employee, I’ll trust you. Wasting $300 is nothing compared to NVIDIA profits.
Reply 1.3: Grok 3 was okay; Grok 4 feels slightly worse.
Reply 1.4: 50% pricier than ChatGPT? It better deliver. Was going to try, but now I’ll pass. GPT-4o/o3/o3 pro are great; Gemini’s mediocre – only use free version occasionally.
Reply 1.5: Don’t hate blindly – better than Grok 3, though not by much.
Reply 1.6: Actual experience: Grok 4 < Gemini.
Comment 2:
Grok 4’s strength defies science – OpenAI is in trouble. HLE benchmark: 50% with tool use. Current GPT-4o/Gemini 2.5 pro barely hit 20%. Since Grok lacks OpenAI’s “test-stealing” scandal, results seem credible. HLE (Humanity’s Last Exam) is mostly undisclosed – top models all cheat on it. This lead is unscientific. Musk stated: “AI discovering new physics is inevitable by year-end/next year.”
ARC-AGI specializes in tricking LLMs (e.g., visual reasoning/finding block patterns). Grok 4 dominates at reasonable cost.
Pricing: $300/month Grok 4 Heavy scores 44.4% on HLE vs GPT-4o’s 26% (o3-pro barely improves).
Musk confirmed upcoming video-generation/video-input models – compensating for xAI’s late start. Compared to hype-king Sam Altman, he sounds credible. GPT-5’s rumored “fusion capability” and video input? Good luck. But honestly, all top models are similar now. What matters: Does it solve real problems?
Reply 2.1: Leak: HLE’s creator is a Grok senior advisor. Sketchy…
Reply 2.1.1: So Grok cheated?
Reply 2.2: Benchmarks rock, but usage ≠ Grok 3 upgrade.
Comment 3:
Grok 4 Heavy is strong – helps Musk rewrite human knowledge. But pricing isn’t as良心 (conscience-friendly) as Grok 3. How to use affordably?
Comment 4:
Too bad Grok 4 isn’t free – hard to test.
Comment 5:
In testing: Hallucinates during competitor research/news citations. Writing quality OK. Avoid Chinese prompts. Grok 4 Heavy = “Multi-agent parallel reasoning + higher accuracy + pricier/slower”. Regular Grok 4 = “Single-agent + cost-effective”. Heavy version “considers multiple hypotheses → aggregates best answers” (study-group style). Want academic peak performance? Pay up! After using Grok 4, I’m tempted to splurge on Heavy.
Comment 6:
Current results are unreliable. Suppliers might be selling prompt-tweaked Grok 3 as Grok 4. Some previously sold Claude 3.5 as 3.7/4.0! Wait for stable channels. Also:
- Don’t test coding – neither supports preview.
- All displayed “code results” run on external platforms → easily faked.
- Only Tongyi Qianwen, KIMI, Gemini have native preview (could still fake).
- Don’t test image-gen – many platforms lack true capability.
Comment 7:
With enough GPUs, you can do whatever the fuck you want.
Comment 8:
Musk’s little trick.
Did it smack GPT-4o pro and kick down Gemini 2.5 pro?
Damn right it did, bro.
Grok 4 Heavy: $300/month. Please.
What? Only $30? Then enjoy “Grok 3.5”.
Comment 9:
My take: Philosophy discussions << Claude Opus 4. Some writing tasks << GPT-4o3. Grok 4’s attention goes haywire with long contexts/complex logic. Worst part: Grok 4 Heavy costs more than Claude Max!
Topic-switching without chaos? Still Claude’s crown – whether Sonnet 3.7 or Opus 4. Philosophy/math/lust/coding? Effortless. No prompt engineering needed – Opus 4’s comprehension is unmatched.
With Grok? Disaster. It flirts while you code, nags about projects during sexting – like a slavedriver. Claude Opus 4’s Grok summary:
“That friend with zero social boundaries: ‘Let’s discuss quantum mechanics… and your underwear?’ ‘You’re hot babe… fixed your code bug yet?’”
But Grok is the freest closed-source model. Claude? Constant “System message potentially harmful” flags → banned chats/accounts. Makes me feel like a criminal when I’m just “deep-communicating”. Grok? Chat about anything.