AI Model Selection: The Overlooked Secret to Better AI Content Creation

TL;DR Summary

It is obvious that models perform differently, but we were surprised at the actual differences when tested in a real-life application.

We ran a simple test in collaboration with an SEO expert and were surprised at their findings. Jump ahead to “Our Testing Journey” to see the results.

Introduction

We've been thinking a lot about which AI models to use for our content needs. It seems obvious now, but we realized most companies (including us, initially) just default to whatever OpenAI released most recently without really questioning if it's the best fit.

Our marketing team spent months wrestling with various off-the-shelf AI tools like ChatGPT. The pattern was always the same: spend days iteratively prompting the AI to get the content closer to what we wanted. And even after multiple rounds of refinement, we'd still need a human for that final polish.

Frustrating and inefficient, right?

Seeing others face the same challenges, we spotted a chance to solve this problem. When we started building our own tool, we started thinking about complex solutions like vector databases and RAG systems, but we realized model selection is the foundation—we need to get that right first before adding in all the complexity.

This blog shares what we learned about choosing the right model for specific AI content creation needs with insights from in-house SEO expert. It is probably the easiest yet most overlooked way to get better results from AI. We learned that different models have distinct strengths and weaknesses that significantly impact content quality, often in ways that no amount of prompt engineering can overcome.

Our Testing Journey

To move quickly and get real results in human hands as fast as possible, we took a deliberately lightweight approach. Instead of building a complex UI that might slow us down, we created a simple API endpoint that could generate content from different models.

Bonus Tip: To help us move even faster, we leveraged OpenRouter, a platform that provides access to multiple AI models through a single API. This allowed us to rapidly test different models without integrating with multiple providers.

We tested four popular models:

GPT-4o-mini
Claude 3.5 Sonnet
Gemini 2.0
DeepSeek R1

Here's what we found at a high level:

DeepSeek R1

DeepSeek R1 gave us the strongest intro paragraph and CTA’s compared to the other models. For comparison, OpenAI gave a very brief introduction that reads more like a statement than an actual introductory paragraph.

However, we noticed that DeepSeek tends to pack information densely into paragraphs that could benefit from better spacing and structure.

The sections between the introduction and conclusion often lacked depth, giving the content an unbalanced feel. While the bones of good content were there, the sections would need more development to create a truly comprehensive piece.

Claude 3.5 Sonnet

Claude 3.5 Sonnet had good structure with well-defined H2 sections, but struggled with overall readability. Our SEO specialist noticed signs of keyword stuffing, which could potentially hurt SEO performance rather than help it. While Claude created logically organized content, it struggled to place keywords naturally into the text.

What stood out about Claude was its distinctly human tone. Among all models tested, it produced content with nuanced phrasing that felt most authentic. Our SEO expert particularly noted this natural writing style during blind evaluations.

OpenAI Model - GPT-4o-mini

GPT-4o-mini created detailed sections with good information density, but completely missed the introduction that was explicitly requested in the brief. Our SEO specialist also found the content structure problematic, with an awkward "Why It Matters/Business Impact" format that disrupted the reading flow. While the overview paragraphs were strong, later sections lacked depth and simply restated the brief without adding value.

GPT-4o-mini performed as an all-rounder with no particular strengths or weaknesses in content creation. It functioned as a typical "jack of all trades" model, not specializing in content marketing tasks compared to its strengths in other areas like coding and mathematics.

Gemini 2.0

While Gemini produced clean formatting, it struggled significantly with content development. Paragraphs often lacked cohesion, and the overall readability suffered from abrupt transitions between ideas. Our SEO specialist noted that the content would require substantial rewriting before it would be suitable for publication.

Gemini's output exhibited the cleanest formatting but suffered the most in terms of content cohesion and depth. The SEO specialist pointed out that despite its clean presentation, the content itself would need the most substantial revision of all models tested.

Findings & Conclusion

For an untrained eye, the outputs from different models seemed similar, but our SEO specialist spotted crucial variations in

structure
readability
depth

None of the models produced publication-ready content in a single generation—each article would require substantial rewriting or further prompting.

In blind testing sessions, our SEO expert selected Claude's output as preferred, citing its "human-friendly tone" as the determining factor. The natural language patterns and nuanced phrasing made Claude's content stand out, despite its tendency toward keyword stuffing.

Each model demonstrated distinct strengths and weaknesses:

DeepSeek R1: Excellent introductions and CTAs but unbalanced content development throughout the article.
Claude 3.5 Sonnet: Most human-like tone with logical structure, though suffered from keyword stuffing issues.
GPT-4o-mini: Strong information density but problematic structure and missed key brief requirements.
Gemini 2.0: Cleanest formatting but weakest content development and readability.

This testing highlights an important reality: different AI models excel at different aspects of content creation. The best model for your needs depends on what you value most—whether that's human-like tone, strong introductions, or logical structure.

Next Steps

Based on our SEO specialist's feedback, we're focusing on addressing these common weaknesses:

Inconsistent formatting: Developing standardized templates to ensure consistent structure across all generated content.

Brief sections lacking added value: Adding additional web scraping to get additional context.

Our next phase of development will include:

Providing explicit SEO frameworks/guidelines in prompts
Using example high-quality articles scraped as reference points
Creating evaluation agents to assess content quality before delivery

The key takeaway from our experiment is clear:

Model selection matters significantly more than most companies realize.

It's often the overlooked foundation that determines your AI content creation's quality and effectiveness. By understanding each model's unique strengths and weaknesses, you can make more informed decisions about which AI to use for your specific content needs.

Keep tuned for our next tests to see the improvement in the AI output.