Tech

A new AI coding challenge just published its first results – and they aren’t pretty

The brand new synthetic intelligence coding problem revealed the primary winner-and put a brand new tape for software program engineers who work with synthetic intelligence.

On Wednesday at 5 pm, the Lod Non -profit Institute introduced the primary Konk award winner, a problem to coding from the multi -round synthetic intelligence launched by Databricks and co -founder of the confusion of Andy Konwinski. The winner was a Brazilian directed engineer named Eduardo Rocha de Andradi, who will obtain $ 50,000 for the award. However extra stunning than successful was his remaining levels: he received right solutions over solely 7.5 % of the check questions.

“We’re glad that now we have already constructed a troublesome customary,” mentioned Koninski. He continued: “The standards should be troublesome if they’re vital,” including: “The grades will probably be completely different if the massive laboratories have entered their bigger fashions. However it is a kind of level. The Okay prize was run with out contact with a restricted account, so you favor smaller and open fashions. I adore it.

Konwinski pledged one million {dollars} of an open supply mannequin that may file greater than 90 % within the check.

Like a widely known SWE-Pect system, Okay prize is examined for fashions for issues which have been within the area of reporting Github as a check of how a lot fashions are capable of take care of programming issues in the true world. However though Swe-Bused is dependent upon a hard and fast set of issues that fashions may be educated, the Okay award is designed as a “Swe-Bus air pollution free model”, utilizing a time-guarding entry system towards any requirements coaching. For the primary spherical, the fashions have been due by March 12. Then the prize organ organizers constructed the check utilizing solely GitHub issues that have been marked after that date.

The upper grades are 7.5 % in a noticeable contradiction with SWE-Bench itself, which presently exhibits 75 % larger within the best “verification” check and 34 % in its most troublesome “full” check. Konwinski remains to be undecided whether or not the distinction is because of air pollution on the bench or only a problem to gather new issues from GitHub, however it’s anticipated to reply the A Prize mission on the query quickly.

“Whereas we get extra runs, we could have a greater feeling,” as a result of we count on folks to adapt to the dynamics of competitors for this each few months. “

TECHRUNCH occasion

San Francisco
|
27-29 October, 2025

It might seem to be a wierd place to progress, given a variety of synthetic intelligence coding instruments already obtainable to the general public – however with the requirements change into very straightforward, many critics see initiatives just like the Okay award as a crucial step in direction of fixing them The growing evaluation problem of artificial intelligence.

Prinston, Siash Kapoor researcher, who put ahead an identical concept In a modern sheet. “With out such experiments, we can’t really know whether or not the issue is air pollution, and even focusing on the Swe-Bench plate with an individual within the episode.”

For Konwinski, it isn’t only a higher customary, nevertheless it represents an open problem for the remainder of the business. He says: “When you hearken to the noise, we should see synthetic intelligence docs, synthetic intelligence attorneys and synthetic intelligence applications, and this isn’t true.” “If we can’t get greater than 10 % on a air pollution -free seat, that is to examine the fact for me.”

2025-07-24 00:00:00

Related Articles