Go-to-market filter: “Data Advantage × Distribution Advantage

Anmol Shantha Ram
May 17
3 min read

“Look for two things. 1. A data advantage that no one else has 2. A distribution advantage the customer already owns.

If you’re missing either, think twice before building or fine-tuning a custom LLM."

This week we had a candid conversation and reality-check with the superbly impressive Ali Ghodsi, CEO of Databricks on the challenges and opportunities defining the future of AI in business.

1. Large-scale model progress has hit a “soft wall”

Scaling laws kept delivering exactly the quality increases researchers predicted—until ~mid-2023.
Every major lab can still squeeze small gains, but nobody wants to ship a true next-generation model (“4 → 4.1 → 3.7…”) because the jump is no longer dramatic.
Without inference-time compute tricks (e.g., “reasoning” / tree-of-thought / MoE routing at decode time) the narrative might already feel like a burst bubble.
Databricks’ own DBRX briefly led the open-source charts, proving how fast the window of leadership now closes.

Implication: Breakthroughs over the next 18–24 months are more likely to come from novel training paradigms, retrieval, or tool-use than from brute-force pre-training runs.

2. Inference-time reasoning helps, but it doesn’t generalise

Techniques such as OpenAI’s GPT-4o or Anthropic’s “Claude 3.5 with chain-of-thought” mostly guide the model to find an answer, they don’t make the base weights smarter. Implication: For enterprises this means: good for latency/quality trade-offs on well-trodden tasks, not yet a silver bullet for domain-specific reliability.

3. Enterprises still struggle with two mundane problems

Data semantics Nobody has a complete, machine-readable definition of what every column, acronym or metric means. Even top-tier reasoning models add almost no lift when the semantic layer is missing.
Evaluation and reliability
The industry lacks robust, domain-specific evals. Public leaderboards are “P-hacked”; enterprises need bespoke yardsticks to trust answers that can cost or save millions.

4. Databricks’ dual AI Strategy

“Data Intelligence” Assistant – natural-language interface that answers questions over a company’s numerical data (English → SQL/Python → viz).
Progress depends far more on building semantic context offline than on swapping in bigger LLMs.
Custom-Model Factory – a services + platform motion that:
• Ingests proprietary documents.
• Generates large volumes of synthetic data to compensate for enterprises’ inevitable data sparsity (15 T tokens ≫ any corporate corpus).
• Builds custom eval suites & judges.
• Fine-tunes / configures the best open or closed model for that use-case.

5. Go-to-market filter: “Data Advantage × Distribution Advantage”

Databricks now vets AI opportunities by asking:

Do you own a unique data asset competitors cannot easily replicate?
Can you reach users at meaningful scale? If either answer is “no”, the customer is discouraged from embarking on an expensive custom-AI build (e.g., DIY HR-handbook bots).

6. Reality check on internal AI productivity

Databricks uses copilots everywhere (Cursor, GitHub Copilot, custom sales/HR bots).
Coding time is only ~20 % of an engineer’s week; copilots boost that slice, but they don’t touch the other 80 % (design, alignment, meetings, roadmap).
Meeting summaries, PM copilots, and collaborative agents “still suck” → <10 % productivity lift.
The hardest unsolved problem: an agent that can align 2 000 engineers’ roadmaps and mediate organisational politics.

7. Synthetic data and synthetic evals are the next battlefield

Because customers refuse to label at scale, Databricks is betting on:

Programmatic generation of high-quality instruction / domain pairs.
Automatic generation of domain-specific eval sets + LLM judges.
Tight iteration loops between eval → synthetic data → fine-tune.

8. Reliable AI is multi-year away, not multi-month

Ali places reliability for mission-critical enterprise AI 3–5 years away, not “the next quarter.”

Key blockers:

High-fidelity semantic layers.
Better eval frameworks.
Tooling that can orchestrate data, models, and human oversight at scale.

9. CEO-Level automation is still science fiction

Even with full access to emails, Slack, calendar and multimodal context, an “AI chief-of-staff” remains a research project.
Startups should automate well-bounded, low-context tasks first; the CEO role is the worst place to start.

10. What would move the needle the fastest?

Ali’s hierarchy of needs:

Robust, domain-specific evals.
Cheap generation of high-quality synthetic data aligned to those evals.
Tooling to integrate that pipeline seamlessly into enterprise MLOps.

Solve those. Reliability and adoption will follow.

#PerplexityAIBusinessFellowship #Databricks #AliGhodsi #SyntheticData #EnterpriseAI #FutureOfAI #ScalingAI #AILeadership #AIChallenges #SemanticLayers

Anmol Shantha Ram