Day: October 1, 2025

The tasks resemble those that lawyers, doctors, financial analysts, and management consultants solve for a living. One asks for a diagnosis of a six-year-old patient based on nine pieces of multimedia evidence; another asks for legal advice on a musician’s estate; a third calls for a valuation of part of a healthcare technology company.

Mercor, which claims to supply “expert data” to every top AI company, says that it spent more than $500,000 to develop 200 tasks that test whether AIs “can perform knowledge work with high economic value” across law, medicine, finance, and management consulting. The resulting AI Productivity Index (APEX), published Wednesday, lists among its co-authors a former global managing director of McKinsey, a former dean of Harvard Business School, and a Harvard Law School professor, who advised on the design and scope of the tasks in their respective domains, according to Mercor. APEX is “focused on going very deep,” says Brendan Foody, the company’s 22-year-old CEO. “How do we get very comprehensive about what it means to be a consultant or a banker or a doctor or lawyer?”

[time-brightcove not-tgx=”true”]

To create the tasks, Mercor contracted white-collar professionals whose former employers include top banks (Goldman Sachs, JPMorgan), consulting firms (McKinsey, Boston Consulting Group), law firms (Latham & Watkins) and hospitals (Mount Sinai). They average 7.25 years of professional experience, and their pay at Mercor is competitive with their previous, highly prestigious employers. Mercor’s website advertises an average hourly rate of $81 per hour, reaching over $200 per hour—equivalent to an annual salary of about $400,000—for “Senior Domain Experts,” who require at least four years’ professional experience to apply.

“It’s hard to imagine a better hourly job from a pay perspective,” says Matt Seck, a former investment banking analyst at Bank of America, who is contracted by Mercor to write finance tasks similar to those included in the paper.

Benchmarks have long been used to assess AI capability, but directly quantifying AI models’ ability to do economically useful work represents a “paradigm shift,” says Osvald Nitski, one of the paper’s authors. On Mercor’s benchmark, “getting 100% would mean that you’d basically have an analyst or an associate in a box that you could go and send tasks to, and then they deliver it to the requirements of a partner, or an MD, or whoever would be grading the work of that person,” says Nitski.

The models aren’t there yet, but they are improving fast. OpenAI’s GPT-4o, released in May 2024, scored 35.9% on the benchmark. GPT-5, released just over a year later, achieved 64.2%—the top score on the benchmark. Getting 64.2% on the benchmark doesn’t mean that GPT-5 is delivering 64.2% of the value of a human worker—work that doesn’t hit 100% “might be effectively useless,” write the paper authors. GPT-5 only got full marks in two out of the 200 tasks—one in law and one in investment banking—which “primarily involve basic reasoning, simple calculations, and a lot of basic information searching,” according to Mercor.

Even if a model hits 100% on Mercor’s benchmark, it would probably make a poor substitute for human professionals. The tasks in Mercor’s benchmark focus on “well scoped deliverables,” such as making diagnoses or building financial models, rather than more open-ended tasks which might admit multiple right answers. This requires that the task descriptions include numerous assumptions needed to ensure that the desired output is well specified. The AIs’ outputs are entirely text-based, meaning that the benchmark doesn’t test AIs’ ability to use a computer, the way that a human worker would. (Mercor says that future versions of APEX will address these limitations.) And drafting the lengthy prompts needed for models to complete the tasks “would be more tedious than just doing it yourself,” says Seck.

Still, there are signs that AI models are becoming competitive with humans. Another benchmark, published Thursday, Sept. 25, by OpenAI, showed that expert human evaluators preferred an AI’s work to human work 47.6% of the time on 220 tasks including designing a sales brochure for a property and assessing images of a skin lesion. OpenAI also found that the performance of its models has increased substantially in a short space of time, more than doubling in their “win rate” against humans between June 2024 and Sept. 2025.

As model capability has grown, so has the complexity of the tasks that they’re being tested on and the human skill needed to create sufficiently challenging tasks. Earlier tests measured relatively abstract capabilities on reasoning puzzles and exam-style questions. Benchmarks before the 2022 release of ChatGPT, often sourced data from crowdworker services, which paid workers a few dollars an hour. By 2023, Ph.D. students were being asked to create challenging multiple-choice questions in biology, physics and chemistry. In September, xAI reportedly laid off 500 of its “generalist” data workers as part of an “expansion and prioritization” of the company’s “specialist” data workers. To be sure, low-paid data workers still contribute to the development of AI models, but the upper bound of skill and compensation needed to develop AI benchmarks is increasing rapidly.

Directly measuring the utility of AI models on economically valuable tasks is “very hard to pull off,” says Nitski. The success criteria in domains such as finance and consulting are harder to define than, for example, in software engineering. Even with the perfect criteria in hand, marking an AI’s output on a large scale is harder than in software engineering, where automated tests can check whether a piece of code runs correctly. This explains, in part, why tests aiming to measure the real-world utility of AI models have existed for software engineering since at least 2023, but have lagged in other white-collar domains. However, as AIs have improved, they have helped solve the problem of grading complex tasks. The success criteria for Mercor’s tasks are written by human experts, but the marking is done by AIs, which Mercor says agreed with human graders 89% of the time, helping to scale the evaluations.

Developing benchmarks isn’t just about knowing how good models are. In AI, as in business, “what gets measured gets done”—good tests often precipitate AI progress on those tests. “It’s ultimately the same data type for both evaluation and training,” says Foody. Evaluating performance in games such as Go is straightforward; AI was beating Go masters by 2016. In 2023, benchmarks began evaluating AIs on real-world tasks in software engineering. Two years later, the labor statistics for junior programmers look dubious.

“AI got its Ph.D.,” says Foody. “Now it’s starting to enter the job market.”

Selected Articles

Utah lawmaker wants to rename Harvey Milk Blvd. in honor of Charlie Kirk

Post author By Mike Nova
Post date October 1, 2025

The “vast majority of Utahns” would say Harvey Milk has no “connection to Utah whatsoever” — “but Charlie Kirk does,” the Republican said.

Selected Articles

Photos show the unique watch collections at Rolliefest, an invite-only gathering of watch enthusiasts

Post author By Mike Nova
Post date October 1, 2025

Hand holding watches — Luxury watches were on full display during Rolliefest.

Troy Barmore

Watch enthusiasts gathered in New York City for the biennial Rolliefest event.
This year’s Rolliefest featured over 200 collectors and $25 million in watches, one attendee said.
The event showcased both if-you-know-you-know brands and iconic ones like Rolex, Cartier, and Hermès.

Watch enthusiasts descended on New York City in September for an event that attendees have dubbed the Super Bowl of watches, and they brought their collections with them.

Rolliefest, which has taken place every other year since 2019, is a place for collectors to gather, display rare finds, and geek out over their expensive hobby of buying watches. It was founded by Geoff Hess, the global head of watches at Sotheby’s.

The invitation-only event attracts watch collectors from around the world. This year, a ticket cost $1,600. Rolliefest showcased hundreds of watches from more than 200 collectors, and hosted gatherings at the Aspire at One World Observatory, the Metropolitan Museum of Art, and the Waldorf Astoria.

“I’d estimate easily $25 million worth of vintage watches were displayed on the table over a lunch of chicken and waffles,” Joshua Ganjei, who attended Rolliefest and is the CEO of the marketplace European Watch Company, told Business Insider.

The trove of timepieces included more obscure brands that are only easily recognizable to the experienced eye and some of the most iconic labels, like Rolex, Cartier, and Hermès. Eager collectors had the chance to view rare and highly coveted watches, including a Rolex with a dragon dial, a Patek Philippe Triple Calendar Moonphase, and a colorful Hermès Arceau Les folies du ciel.

“It was what I imagine Comic Con is like, but for all of us watch freaks,” Ganjei said.

These are some of the other timepieces that Ganjei and his fellow watch lovers saw at Rolliefest this year.

The Rolex GMT-Master and Submariner were common sights.

The Rolex GMT-Master and Submariner were photographed in several collections and spotted on many wrists at Rolliefest.

Rolex’s Submariner is a classic dive watch — a piece built to withstand underwater pressure — valued for its rarity and timeless style. It is one of the most sought-after steel sports models, watch dealer Paul Altieri previously told Business Insider. A Rolex Submariner Ref. 6536-1, or “Small Crown,” which was seen on an attendee’s wrist, is listed for over $56,000 on watch marketplace Chrono24.

The GMT-Master II is known for its adaptability, making it suitable for both business and casual looks. Several collectors at Rolliefest displayed cases of the watch in its famous color variations. The GMT-Master II had a resale value of $20,595 as of July, according to data from marketplace Bob’s Watches.

The Omega is a watch brand that’s popular with billionaire Jeff Bezos.

An Omega Speedmaster Professional Alaska — named after NASA’s Alaska Project — was showcased at Rolliefest this year. Similar models go for around $20,000 on Chrono24, though watch site Hodinkee reported that rare ones can fetch much more.

Omega, the Swiss luxury watchmaker founded in 1848, earned its place in history when Apollo 13 astronauts used the Speedmaster to help navigate their safe return to Earth. More recently, the Omega Speedmaster went on a Blue Origin mission on the wrist of billionaire Jeff Bezos.

Patek Philippe got attention at Rolliefest.

Patek Philippe is a family-owned watchmaker that has been around for nearly 200 years. It’s considered one of the most prestigious brands in the industry.

One Rolliefest invitee was seen wearing a diamond-set Patek Philippe Ellipse. Another variation of the piece sold for around $64,000 at Sotheby’s in July.

Longines are Swiss watches at an entry-level price compared to others on display.

Unlike some of the other watches on display at Rolliefest, the Longines models photographed appeared modest — devoid of features like diamond settings and gold accents.

Since the 19th century, Longines has stood for Swiss innovation. It most notably debuted the first wristwatch with a rotatable bezel. Longines’ entry-level price point is below that of fellow Swiss brands, Rolex and Omega, starting at less than $1,000 as of September.

Piaget is known for using precious materials in its watchmaking.

This Piaget “Warhol” with a Tigers Eye dial — a recently released option — was spotted during Rolliefest. A similar version is for sale on the Piaget website for $55,000.

Piaget, founded in 1874 in Switzerland, is known for its ultra-thin watch movements and blending technical innovation with high jewelry production.

Hermès isn’t just known for exclusive handbags.

One watch enthusiast at Rolliefest paired their engraved watch with another famous Hermès product: an exclusive Kelly handbag. The watch — a rare Hermès Arceau Les folies du ciel — was first introduced at the international luxury watch showcase Watches and Wonders in 2022.

Its enamel dial is inspired by Hermès artist Loïc Dubigeon’s scarf motif, designed in 1984.

Cartier is a brand worn by royals, powerful executives, and celebrities.

Founded in Paris in 1847, Cartier has grown into one of the world’s most renowned names in jewelry and watchmaking. Though there weren’t many Cartiers photographed at the event, the brand remains a popular staple among collectors and public figures, like Taylor Swift.

A similar model to the Cartier Tank Americaine Dual Time Zone shown at Rolliefest is listed for nearly $24,000 on luxury watch marketplace Chrono24.

Read the original article on Business Insider

Selected Articles

Meghan Markle’s dad trapped in massive Philippines earthquake, sister blasts ‘evil’ duchess for turning blind eye

Post author By Mike Nova
Post date October 1, 2025

“Shame on my disgusting evil f–king sister forever putting our father in this position. I hope she is cursed,” Samantha Markle tweeted Tuesday.