Building Semantic Search for News: A Technical Teardown of My NLWeb Implementation

A Developer's Guide to Transforming NLWeb from General Purpose to News-Focused

Jul 15, 2025

Last month, I published a post demonstrating how Microsoft's NLWeb framework could bring semantic search to news publishers. In this post, I'll give a more detailed teardown of the key changes I made to get NLWeb running for news content, along with considerations for anyone thinking about implementing something similar.

The Schema Challenge: Teaching NLWeb About News

The biggest technical hurdle—and the most important change I made—was implementing proper schema support for news articles. NLWeb uses schema.org definitions to understand different types of content and the metadata associated with them. Think of schemas as structured templates that tell the system "this is what a podcast looks like" or "this is what a recipe contains."

Out of the box, NLWeb doesn't include a schema for news articles. When I first pointed it at an RSS feed from a news site, the system defaulted to treating articles as podcast episodes. This meant that the metadata didn’t make any sense, and the LLM kept referring to search results as “episodes” or “shows.”

The fix required implementing the NewsArticle schema from schema.org. Here's what that looked like in practice:

def parse_rss_2_0_as_news(root: ET.Element, feed_url: Optional[str] = None) -> List[Dict[str, Any]]:
    # Create NewsArticle schema
    article = {
        "@type": "NewsArticle",
        "headline": headline,
        "datePublished": pub_date,
        "publisher": publisher,
    }

    # Extract author information
    author_elem = item.find("author")
    if author_elem is not None and author_elem.text:
        article["author"] = {"@type": "Person", "name": author_elem.text}

    # Extract category/section
    category_elem = item.find("category")
    if category_elem is not None and category_elem.text:
        article["articleSection"] = category_elem.text

But I didn't want to completely eliminate podcast support — many news organizations produce both articles and audio content. So I added intelligent feed detection that checks for iTunes namespaces and audio enclosures to automatically route content to the appropriate parser. This means a publisher can ingest both their main news feed and their podcast feed without manual configuration.

Frontend Flexibility: Building for Different Reader Needs

One of NLWeb's biggest strengths is that it's "headless"—the API makes it possible to build whatever interface makes sense for your audience. The backend provides the intelligence; you control the experience.

For my prototype, I implemented three key features that demonstrate this flexibility:

Display mode toggle: Readers can choose between three different ways to interact with search results:

List results: A traditional ranked list for readers who want to browse sources
Summarize: A hybrid view with an AI-generated summary followed by source articles
Generate answer: A direct conversational response with citations

The implementation was straightforward — just a radio button that changes a single API parameter:

if (generateMode === 'list') {
                    articles = data.results || [];
                } else if (generateMode === 'summarize') {
                    articles = data.results || [];
                    if (data.summary && data.summary.message) {
                        summaryContent.textContent = data.summary.message;
                        summarySection.style.display = 'block';
                    }
                } else if (generateMode === 'generate') {
                    articles = (data.nlws && data.nlws.items) ? data.nlws.items : [];
                    if (data.nlws && data.nlws.answer) {
                        generatedAnswerContent.textContent = data.nlws.answer;
                        generatedAnswerSection.style.display = 'block';
                    }
                }

Relevance score filtering: Every search result comes back with a relevance score between 0 and 1. I exposed these scores in the UI and added a slider for filtering. This gives power users fine-grained control while still working out of the box for casual readers. You could imagine different approaches here—maybe a backend threshold that only shows results above 0.7, or using scores to determine which articles get premium placement.

Conversation history: I implemented basic conversation memory so the system maintains context across queries. Ask about “structured outputs” and then follow up with “what about for small LMs?”—the system understands you're still talking about structured outputs. This could be expanded into a full chatbot experience or stripped out entirely for simpler one-off searches.

Managing Token Limits for Embeddings

Embedding models often have shorter context windows than LLMs: While OpenAI’s GPT-4.1 can hold >1M tokens in context, their text-embedding-3 models max out around 8,200 tokens.

To deal with this, I added a helper function to truncate text before embedding, using tiktoken:

def _truncate_text_by_tokens(text: str, model: str, max_tokens: int = 8142) -> str:
    """
    Truncate text to fit within the token limit for the given model.

    Args:
        text: The text to truncate
        model: The model name to get the appropriate encoding
        max_tokens: Maximum number of tokens (default: 8192 - 50 = 8142)

    Returns:
        Truncated text that fits within the token limit
    """
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        # Fallback to cl100k_base encoding if model not found
        encoding = tiktoken.get_encoding("cl100k_base")

    tokens = encoding.encode(text)
    if len(tokens) <= max_tokens:
        return text

    # Truncate tokens and decode back to text
    truncated_tokens = tokens[:max_tokens]
    return encoding.decode(truncated_tokens)

This ensures that articles can be ingested without errors, though there's definitely room for smarter truncation strategies.

Features Worth Exploring

While implementing the core news search functionality, I also experimented with a couple NLWeb features that publishers might find valuable:

Memory hooks: NLWeb can be configured to remember user preferences and context over time. If a reader mentions they're interested in local politics or environmental issues, that context could inform all future searches. The framework provides the hooks; you just need to wire them up to a database with user IDs. Given the push for personalization in news products, this could be a powerful retention tool.

Relevance pre-assessment: The framework includes an optional pre-search step where an LLM assesses whether a query is even relevant to your content. The thinking is to short-circuit spam or off-topic queries before expensive embedding searches. In my testing, this was overly aggressive (flagging legitimate news queries as irrelevant), but with better prompting it could help manage costs for high-traffic sites.

Closing Thoughts: Building on Solid Foundations

After spending time with NLWeb, I can say it's not a "dump in your content and it magically works" solution. You'll need to:

Understand your content structure and map it to appropriate schemas
Make decisions about UI/UX that fit your audience
Handle edge cases like long articles or mixed content types
Think through personalization and privacy tradeoffs

But it's still far less engineering effort than building semantic search from scratch. You're not writing embedding logic, building vector databases, or implementing retrieval algorithms. You're configuring and customizing a pipeline built to leverage best practices.

The real value is in the control you maintain. Your data stays in your database. Your choice of LLM providers. Your UI decisions. Your user experience. This framework feels like a practical path for publishers who want to experiment with semantic search without going all in on a single vendor's vision of the future.

For newsrooms considering semantic search, NLWeb offers a compelling middle ground: sophisticated enough to deliver real value to readers, flexible enough to evolve with your needs, and open enough to avoid platform lock-in.

The code for my implementation is available on GitHub, and I'm happy to answer questions about specific implementation details. The journalism industry needs more open tools like this — frameworks that offer genuinely useful functionality while respecting our content and independence.

Attention Markets

Discussion about this post