news

Gemini 3.0 Benchmark Test:174% Iteration Efficiency Improvement, 97% Accuracy

2025.11.28

Reproducing factors from research reports has always been a common method for researchers to expand their analytical frameworks and identify new alpha-generating opportunities. The traditional approach relies on researchers reading reports word by word and coding manually, which is time-consuming, labor-intensive, and prone to omissions. With the development of AI large language models, the industry has begun experimenting with using AI to automatically reproduce research reports. However, inaccuracies in code logic and long processing times have consistently hindered this application from experimentation to production.

Starfish AI is an intelligent research report factor mining and analysis tool developed by DolphinDB. It integrates a high-performance large language model (DeepSeek) at its core, deeply combining natural language processing with quantitative research business logic. It streamlines the entire process—from uploading research reports, parsing factors, generating code, backtesting and evaluation, to outputting reports—providing efficient and precise intelligent assistance for quantitative research. While Starfish AI has demonstrated solid performance, we still encountered two common issues during actual production implementation. First, iteration time was too long: in previous versions, an average of 6.8 iterations were required to obtain runnable code. Second, the accuracy of generated code was insufficient: only 20% of code both conformed to the actual research report logic and was executable. Additionally, the model's misinterpretation of report language led to some factors being missed.

With the release of Gemini 3 in November, Starfish AI's performance after integrating the new model has been impressive. We selected 20 research reports covering different styles as a test suite to evaluate core capabilities such as factor reproduction, code generation, and accuracy. The results showed that with Gemini 3, Starfish AI achieved a factor code pass rate (syntax correctness) of 97%, code logic accuracy of 50%, and an average iteration efficiency improvement of 174%, representing a significant advancement over previous LLMs. In this article, we take a closer look.

01 Gemini 3 Benchmark Test:Starfish AI Factor Mining Performance Leap

In Starfish AI, users can directly upload PDF research reports, and the system automatically performs factor identification and code generation.

We used DeepSeek V3.1, which has strong comprehensive performance, as a baseline for comparison. After optimizing prompts and workflows with DeepSeek V3.1, we selected 20 out-of-sample research reports (covering styles such as momentum, value, and event-driven) as the test set. Under identical hardware environments, data sources, and backtesting frameworks, we ran both Gemini 3 and DeepSeek V3.1, recording core metrics including factor mining success rate, iteration count, and code accuracy.

Test Results: Gemini 3 Leads Across Factor Mining Success and Efficiency

From the test results, compared to DeepSeek V3.1, the introduction of Gemini 3 achieved significant breakthroughs in two core dimensions of factor mining: success rate and efficiency.

  • Factor Coverage Completeness: Gemini 3 identified 247 factors across the 20 research reports, mining approximately 96% (247/250) of the factors, while DeepSeek V3.1 identified only 76% of factors.
  • Code Generation: For the 247 test factors, Gemini 3 successfully generated code for 240 factors, a success rate of 97%, an increase of nearly 8 percentage points compared to DeepSeek V3.1's 89% (172/192), essentially eliminating the risk of task failure.
  • Iteration Efficiency: The average number of iterations for code generation with Gemini 3 decreased to 2.33, compared to 6.3 with DeepSeek V3.1, representing a 174% efficiency improvement.
  • Actual Factor Reproduction: Moving from "being able to generate code" to "correctly reproducing research report logic" involves multiple hurdles including data alignment, parameter calibration, and logic verification. We manually sampled one factor from each of 15 research reports for testing. Gemini 3 achieved an actual success rate of approximately 50%, a significant breakthrough from DeepSeek's approximately 20%.

Looking at the details, Gemini 3's advantage lies in its rigor in "staying true to the original text". When reproducing factor formulas, it can more accurately replicate mathematical expressions from research reports, avoiding the common tendency of DeepSeek to "simplify formulas". In code reproduction, Gemini 3 has a higher success rate with logic that is highly consistent with the research report. Common errors involve being unfamiliar with the number of parameters or parameter formats of certain DolphinDB functions, but these are typically fixed within 1–2 iterations.

In rare scenarios, Gemini 3 may unnecessarily introduce rolling windows, fail to optimally use built-in DolphinDB functions like mbeta or mcorr, instead using complex and cumbersome calculation steps, and occasionally shows deviations in understanding sophisticated grouping logic such as context by + interval. For understanding such complex formulas, a "human + AI" collaborative model is still required—in quantitative research, human experts must always maintain the final checkpoint for creative judgment and risk validation.

02 Dlang Script Code Generation: Model Capabilities Enhance Tool Performance

The core capability of Starfish AI lies in converting natural language factors into high-performance, executable Dlang formula code. DolphinDB's scripting language, Dlang, is known for its high performance and vectorized computation. The underlying LLM's ability to "understand and generate" directly determines the tool's user experience.

After switching to Gemini 3, we conducted rigorous benchmark tests on the accuracy of Dlang code generation logic. Across 1,481 test questions covering various financial computing scenarios, compared to DeepSeek R1, Gemini 3 improved code logic accuracy from 17% to 34%. This leap is precisely the source of Starfish AI's performance breakthrough. In short, Gemini 3 has approximately a one-in-three chance of producing directly runnable correct code, while in other cases it provides highly usable logical frameworks.

It is important to note that 34% is not a theoretical upper limit. As DolphinDB injects more Dlang best practices and financial computing paradigms as domain knowledge into the training process, this figure will continue to rise—the leap in tool efficiency essentially stems from the underlying model's "brainpower" upgrade.

03 Starfish AI: The Evolution of "Understanding"

The introduction of Gemini 3 has significantly enhanced Starfish AI's factor mining capabilities—deeper semantic understanding, more accurate code logic, and higher iteration efficiency. However, this capability is just one facet of Starfish AI's broader capabilities. Starfish AI is an end-to-end solution built by DolphinDB for quantitative research, covering core areas including factor calculation, evaluation and analysis, strategy backtesting, performance attribution, and workflow management, forming a complete closed loop from factor research to strategy execution. On this foundation, its AI capability matrix has achieved intelligent upgrades including automated factor code generation, one-click strategy logic conversion, and intelligent data analysis script writing.

The deep integration of DolphinDB with cutting-edge LLMs has substantially expanded the efficiency boundaries of quantitative research. Starfish AI is currently available for trial to professional financial institutions. Interested parties are welcome to apply for a trial at: https://dolphindb.cn/product