Writing literature reviews is time-consuming and labor-intensive. A new wave of AI tools claims to make this process more efficient by transforming how we search for and synthesize academic content. Many industry observers have sung the praises of agentic search recently, as OpenAI (and Google, and Perplexity, and Grok) roll out variants of multi-step research tools. These systems generally share common components: an LLM creates a research plan, connects to search engines to find and read relevant web pages, and synthesizes findings into a report-style output.
Alongside these general-purpose tools, specialized LLM-based systems for academic research have emerged. Products like Consensus, Elicit, and Scite aim to combine traditional literature search capabilities with LLM-powered assistance and synthesis. But evaluating these tools presents a unique challenge—unlike coding tasks with objectively correct outputs, synthesizing academic literature involves subjective interpretation and requires grounding in high-quality sources.
My evaluation therefore focuses more on qualitative assessment than quantitative metrics—which tools produce the most well-written synthesis of the most relevant literature, based on expert judgment. To test these systems in a domain I know well, I asked them to synthesize research on effective headline writing, an area where I've published multiple academic papers.
Methodology
For this test, I wrote a prompt asking each tool to analyze research on headline practices that increase click-through rates, with specific requirements for empirical validation, news site focus, and methodological rigor. Here's the full prompt I used:
Analyze academic research, marketing studies, and industry reports from reputable sources on headline writing practices that demonstrably increase click-through rates. Specifically:
1. Summarize the most empirically validated headline techniques from A/B testing and controlled studies
2. Focus on effectiveness for news sites (as opposed to, e.g., social media or search)
3. Identify headline characteristics with statistically significant impact on engagement metrics
4. Note any contradictory findings or limitations in the research
5. Provide example headlines illustrating best practices based on the evidence
Present findings in a concise format with citations to original research where available, prioritizing studies with larger sample sizes and methodological rigor.
I tested this prompt across seven AI research tools: four general-purpose LLM search products (ChatGPT Deep Research, Gemini Deep Research, Perplexity Deep Research, and Claude Search) and three academic-specific research tools (Scite, Consensus, and ScholarQA). To ensure fair comparison, I accepted each tool's first output unless it explicitly requested clarification.
To evaluate the quality of these outputs, I ran a round of blind(-ish, since I had to copy-paste each output from its source) side-by-side comparisons, a la Chatbot Arena, and ranked them.
Results
You can read the full output from each tool in this gist, but this exercise left me with three main takeaways:
ScholarQA and Gemini Deep Research performed the best. The differentiation between these tools and the other options came down to sourcing and writing style. ScholarQA offered the strongest grounding in academic literature, while Gemini did the best job of blending research studies with industry insights. In addition, both tools produced write ups that struck a good balance between specific headline writing techniques, limitations and contrary findings, and broader context around headline writing online—exactly the elements I was looking for! Here, for example, is an excerpt from ScholarQA's response:
Headlines in news sites serve multiple functions simultaneously, from summarizing content and attracting attention to signaling the publication's voice and optimizing for search engines. In the online environment, headlines have become increasingly important as they often represent the only visible part of articles in social media feeds, microblog posts, and news aggregation sites (Szymanski et al., 2017). This heightened importance has complicated the work of news editors tasked with crafting optimal headlines for multiple contexts.
Other Deep Research/search tools suffered from poor sourcing. Claude, Perplexity, and OpenAI all failed to identify good sources for this prompt. Claude's search is brand new, and is not marketed as a research tool, so I'm not surprised by this. However, after reading so much praise for OpenAI's Deep Research, I was surprised to see that ChatGPT did not dig into any relevant academic literature. Rather, it focused on industry publications and advertising and marketing blogs that only tangentially addressed my query. These missteps really drive home just how crucial sourcing and ranking are for this research paradigm.
Academic-specific tools are valuable for deep dives. From this exploration, I also became more familiar with the role that tools like Scite and Consensus might play in a literature review. These tools offered the richest interfaces by far for exploring the academic literature, with a brief synthesis that ties together search results for a query. This makes them a great jumping-off point for a deeper dive. They would be valuable as a starting point for situating a lit review in an unfamiliar area, but they don't have fully agentic search and synthesis capabilities (yet).
Verdict: A good starting point for exploring the literature
After testing these AI research tools on a topic I know well, my assessment is pragmatic: they're useful starting points, not complete solutions. The standout performers—ScholarQA and Gemini Deep Research—demonstrate what's possible when these systems get both search and synthesis right.
What separates effective research tools from mediocre ones comes down to source quality. Even advanced LLMs produce shallow analysis when working with subpar sources, as seen with some of the general-purpose tools that failed to incorporate relevant academic literature.
For researchers, these systems offer the most value as discovery tools—mapping unfamiliar territory, surfacing contradictions, and identifying potentially valuable sources. They complement rather than replace the critical evaluation and synthesis that remains fundamentally human work.
The technology continues to evolve rapidly, but for now, consider these tools as research accelerators rather than automated literature reviewers.