Reaching SOTA on SimpleQA with a Specialized Web Extraction Model

# Reaching SOTA on SimpleQA with a Specialized Web Extraction Model A common problem in web scraping is that specialized scrapers do not generalize to all websites, but LLMs are far too expensive and slow to use at scale while maintaining a high bar for quality. To solve this we trained [Schematron](https://inference.net/blog/Schematron), a family of models that specialize in one specific task: extracting structured JSON from HTML given a schema. ### The LLM-as-Judge Challenge After training the Schematron models, our next task was effectively evaluating them. Because there are no benchmarks for this specific task, we first turned to LLM-as-a-Judge. The problem we faced was that LLM-as-a-Judge has known limitations that make it difficult to directly compare models with similar scores on this task. For one, integer scoring (1-5) loses granularity when the true accuracy is somewhere between two integer scores. It's also sensitive to the exact evaluation prompt, and in long-context evaluations the judge model may hallucinate or be susceptible to [context rot](https://research.trychroma.com/context-rot). ### SimpleQA as the Answer A more appropriate evaluation method is SimpleQA, a benchmark by OpenAI to measure the factuality of LLMs. It consists of short, fact-seeking questions and corresponding answers that are available on the web. For base LLMs, this benchmark largely tests memorization. However, when enhanced with web search tools, this benchmark evaluates LLMs on crafting queries and synthesizing information. In our case, we wanted to see whether extracting structured web data relevant to the query would improve factuality, and how factuality changes when we use frontier models for extraction vs Schematron. This has practical implications on web retrieval, as context-rot and high token costs per web-query are a common problem in web retrieval pipelines, and showing extracting only the relevant data on the page from the query can enhance factuality means that much of the cost associated is unnecessary. Our pipeline to figure these two things out was the following: For each 'problem' in the SimpleQA dataset, we use a primary frontier LLM to generate a search query. The prompt looks like this: ``` You are a search query generator. Convert the user's question into an effective search query for a web search engine. Return only the search query, nothing else. DO NOT TRY TO ANSWER THE QUESTION. DO NOT INCLUDE ANY COMMENTARY. ``` At the same time, we generate a schema that will be used for web extraction with the same primary frontier LLM. The prompt is this: ``` Write a JSON schema that can be used to allow an LLM to extract structured data from an HTML page that answers a user question which will be listed below. Your job is to consider all of the different types of properties that would be relevant to the question AND also be rendered on web pages that may be returned from the user's query. The JSON object that you return should be a valid http://json-schema.org schema spec. ONLY RESPOND WITH THE JSON SCHEMA. DO NOT REPLY WITH ANY COMMENTARY. Schemas can include the following properties: ${schemaSpec} <BEGIN-USER-QUERY> ${userQuery} <END-USER-QUERY> ``` We search for 10 pages using the query using a search engine (either a SERP or Exa) and extract JSON matching our schema from each page with either Schematron or another LLM. Then we pass this JSON to the primary frontier LLM as context to answer the problem, and evaluate the correctness of that answer. This way we can test Schematron's actual task performance instead of subjective quality. ![image](https://hackmd.io/_uploads/Sy9Fadt0el.png) ### Results We found that GPT-5 Nano (a tiny LLM) wasn't very factual without a search tool, but with Exa as a search tool and Schematron-8B for page extraction its accuracy on SimpleQA rose from 8.54 to 64.15. Interestingly, the search engine used mattered more than the LLM: while Schematron-8B for extraction marginally outperformed Gemini 2.5 Flash given the primary model and the search engine stay the same (a 2% improvement), the switch from a SERP to Exa for search improved the results by around 19%. ![image](https://hackmd.io/_uploads/rkezR_K0ge.png) These results suggest that Schematron-8b performs at least as well as Gemini 2.5 Flash if not better on this task—although the comparison isn't apples to apples as Schematron doesn't have a prompt as a hyperparameter but Gemini 2.5 Flash was used with an unoptimized prompt. We found this to be a positive result given Schematron-8b is 10x cheaper than Flash and several times faster (full benchmarks [in the announcement blog](https://inference.net/blog/Schematron)). ### Future Work While evaluating Schematron on SimpleQA showed proficiency at this task and showed that intelligent extraction from web documents can ground LLMs in a more token-efficient manner, there are many improvements that could be made. For one, generating a structured schema for every query is slow, and rewriting the query for every user request also adds to the latency of the web retrieval request. While parallelizing this helps, this makes this method unsuitable for many production settings, including those with users waiting on the response. Instead of this process of query->schema->JSON extraction, we can imagine a model that extracts directly from a page based on a query. If you're interested in training models to solve real-world problems like this, come work with us!