Go-to-market filter: “Data Advantage × Distribution Advantage
- Anmol Shantha Ram
- May 17
- 3 min read
“Look for two things. 1. A data advantage that no one else has 2. A distribution advantage the customer already owns.
If you’re missing either, think twice before building or fine-tuning a custom LLM."
This week we had a candid conversation and reality-check with the superbly impressive Ali Ghodsi, CEO of Databricks on the challenges and opportunities defining the future of AI in business.
1. Large-scale model progress has hit a “soft wall”
Scaling laws kept delivering exactly the quality increases researchers predicted—until ~mid-2023.
Every major lab can still squeeze small gains, but nobody wants to ship a true next-generation model (“4 → 4.1 → 3.7…”) because the jump is no longer dramatic.
Without inference-time compute tricks (e.g., “reasoning” / tree-of-thought / MoE routing at decode time) the narrative might already feel like a burst bubble.
Databricks’ own DBRX briefly led the open-source charts, proving how fast the window of leadership now closes.
Implication: Breakthroughs over the next 18–24 months are more likely to come from novel training paradigms, retrieval, or tool-use than from brute-force pre-training runs.
2. Inference-time reasoning helps, but it doesn’t generalise
Techniques such as OpenAI’s GPT-4o or Anthropic’s “Claude 3.5 with chain-of-thought” mostly guide the model to find an answer, they don’t make the base weights smarter. Implication: For enterprises this means: good for latency/quality trade-offs on well-trodden tasks, not yet a silver bullet for domain-specific reliability.
3. Enterprises still struggle with two mundane problems
Data semantics Nobody has a complete, machine-readable definition of what every column, acronym or metric means. Even top-tier reasoning models add almost no lift when the semantic layer is missing.
Evaluation and reliability
The industry lacks robust, domain-specific evals. Public leaderboards are “P-hacked”; enterprises need bespoke yardsticks to trust answers that can cost or save millions.
4. Databricks’ dual AI Strategy
“Data Intelligence” Assistant – natural-language interface that answers questions over a company’s numerical data (English → SQL/Python → viz).
Progress depends far more on building semantic context offline than on swapping in bigger LLMs.
Custom-Model Factory – a services + platform motion that:
• Ingests proprietary documents.
• Generates large volumes of synthetic data to compensate for enterprises’ inevitable data sparsity (15 T tokens ≫ any corporate corpus).
• Builds custom eval suites & judges.
• Fine-tunes / configures the best open or closed model for that use-case.
5. Go-to-market filter: “Data Advantage × Distribution Advantage”
Databricks now vets AI opportunities by asking:
Do you own a unique data asset competitors cannot easily replicate?
Can you reach users at meaningful scale? If either answer is “no”, the customer is discouraged from embarking on an expensive custom-AI build (e.g., DIY HR-handbook bots).
6. Reality check on internal AI productivity
Databricks uses copilots everywhere (Cursor, GitHub Copilot, custom sales/HR bots).
Coding time is only ~20 % of an engineer’s week; copilots boost that slice, but they don’t touch the other 80 % (design, alignment, meetings, roadmap).
Meeting summaries, PM copilots, and collaborative agents “still suck” → <10 % productivity lift.
The hardest unsolved problem: an agent that can align 2 000 engineers’ roadmaps and mediate organisational politics.
7. Synthetic data and synthetic evals are the next battlefield
Because customers refuse to label at scale, Databricks is betting on:
Programmatic generation of high-quality instruction / domain pairs.
Automatic generation of domain-specific eval sets + LLM judges.
Tight iteration loops between eval → synthetic data → fine-tune.
8. Reliable AI is multi-year away, not multi-month
Ali places reliability for mission-critical enterprise AI 3–5 years away, not “the next quarter.”
Key blockers:
High-fidelity semantic layers.
Better eval frameworks.
Tooling that can orchestrate data, models, and human oversight at scale.
9. CEO-Level automation is still science fiction
Even with full access to emails, Slack, calendar and multimodal context, an “AI chief-of-staff” remains a research project.
Startups should automate well-bounded, low-context tasks first; the CEO role is the worst place to start.
10. What would move the needle the fastest?
Ali’s hierarchy of needs:
Robust, domain-specific evals.
Cheap generation of high-quality synthetic data aligned to those evals.
Tooling to integrate that pipeline seamlessly into enterprise MLOps.
Solve those. Reliability and adoption will follow.

#PerplexityAIBusinessFellowship #Databricks #AliGhodsi #SyntheticData #EnterpriseAI #FutureOfAI #ScalingAI #AILeadership #AIChallenges #SemanticLayers
Comments