It was mid-October, peak leaf-peeping season in Hanover, New Hampshire, and Chad Markey was on a rare break between clinical rotations during his last year of medical school. He should have been inhaling Green Mountain air and gossiping with his Dartmouth classmates about life after graduation. In a few months, they’d all be going their separate ways to start residency training at hospitals around the country.
Instead, Markey was alone in his apartment, deep down a rabbit hole, preparing to go to war.
He’d wake each morning, eat breakfast, open his laptop at the kitchen table or settle into the tan armchair with the good back support, and start coding. Some days, he wouldn’t notice the sun had gone down until one of his roommates came home and asked why the lights weren’t on.
For days, Markey had been scrolling through a Discord group about medical residency, a font of crowdsourced knowledge where students report back to their peers on every stage of the application and selection process. He’d watched as other students, lots of them, posted about the interview invitations they’d received.
Markey didn’t have any interview offers, only outright rejections. That seemed not just odd but wrong to the quiet-mannered 33-year-old from Houston, Texas, who speaks confidently about his accomplishments without bragging. He had good grades from an Ivy League medical school, author credits on articles in the Journal of the American Medical Association and The Lancet, a heart-wrenching personal statement, and glowing letters of recommendation. One professor wrote that they had “never met a medical student who is more skillful, talented, and appropriately situated in his pursuit of the field of medicine than Chad.”
Markey combed through his application looking for a fatal flaw. He didn’t find anything he thought would prompt a residency program director to toss an otherwise competitive application, so his suspicion turned to another culprit. He’d heard rumblings that some hospitals were using a free AI screening tool to help process applications—and that it had been displaying incorrect grades for some students. He began to wonder whether AI was responsible for his lack of interview offers.
On the first page of his Medical Student Performance Evaluation, a comprehensive summary of his early career prepared by his school, Markey spotted language that he suspected might trigger an automated screening tool to downgrade his application. The MSPE stated that Markey had “voluntarily” taken three separate leaves of absence, totaling about 22 months, and had chosen to extend his third year of coursework over two years for “personal reasons.”
That wasn’t quite true. In 2021, Markey was diagnosed with ankylosing spondylitis, an autoimmune disease that affects the spine and could flare up to the point where he couldn’t stand, much less do the intensive physical work expected of medical students during clinical rotations. He was on track to graduate from medical school in seven years, rather than the typical four, but his absences had been unavoidable and medically necessary. This was explained in a narrative paragraph on the first page. Calling the absences “voluntary,” Markey felt, might be interpreted as evidence that he had succumbed to the pressure of medical school and not been able to keep up with his studies.
As the days went on, Markey said, he felt increasingly afraid that his years of training would end in failure. “I crawled out of a fucking black hole,” he told WIRED, referring to his diagnosis. “I could not walk for six months. I’ve come this far, and this is happening?” He was asking himself the same question that pops into the minds of millions of other job seekers every day: Did an AI trash my application?
Even recruiters will admit it’s fair to wonder. The CEO of a hiring platform said last fall that his industry is in “an AI doom loop”: HR departments complain of a wave of AI-generated job applications, prompting the need for more AI filters. Applicants complain they’re getting unfairly filtered out. Some fight AI with AI, filling their résumés and cover letters with buzzwords. “It feels very dystopian to me,” one job seeker told researchers from Northeastern University. “My worthiness as a human and as an employee, as a worker, is based on my ability to filter myself through a series of automated gateways.”
Only a handful of states have regulated the use of AI screening tools to make hiring decisions. Laws in Illinois, New Jersey, and Colorado (not yet in effect) prohibit employers from using discriminatory tools, but mandate little in the way of transparency beyond requiring employers to notify applicants that AI is being used. California’s regulations are more robust, requiring employers to regularly test their AI hiring tools for bias. But none of those rules empower an individual to understand how a particular AI hiring tool judged them, or whether it discriminated against them.
So Markey went to work on an impossible task. He would spend the next six months writing emails, research papers, legal requests, and a constant stream of Python code, trying to peer inside the AI screener. “It turned into obsession,” Markey told WIRED in February. “I don’t think I’ve ever been this upset before in my life.”
Markey’s first medical training came in high school, when he sorted through the gallon ziplock bag where his father kept his prescription medications, recorded the names, and went to the local community college library to research their purposes. His dad was bipolar and addicted to alcohol, a charismatic, unpredictable ball of energy capable of showing great love and causing great pain.
One Christmas, which is also Markey’s birthday, his father didn’t show up because he’d been arrested for drunk driving. Another Christmas, Markey looked out the front window to find his truck being repossessed because his father had put it up as collateral for a payday loan. While Markey was away at college on Pell Grants, his family was forced to declare bankruptcy and lost their house. When he was 21, his father died.
Markey can recall the moment he became interested in pursuing psychiatry. It was when his father explained why he started drinking so heavily: In manic periods he would go days without sleeping, and the only thing that could force his eyes closed was a fifth of vodka. “It’s just so sad to think if I said, ‘Hey, let’s go to a psychiatrist and get a low-dose Seroquel prescription and just have you sleep and address some of your mania,’ like who knows what would happen?”
Markey had been preparing for a career on Wall Street. But after that conversation with his dad, he took a job in health care informatics and made plans to go to medical school. The summer before he started at Dartmouth in 2019, the stiffness he’d experienced in his back since he was a teenager grew worse and his pelvis began to feel like a cement block. By the end of his second year of school, Markey was laid flat by ankylosing spondylitis. He took a leave of absence, going from doctor to doctor seeking treatments that would allow him to continue with school.
During that same time, the Covid-19 pandemic was roiling the medical profession. Among myriad challenges, hospitals saw a massive increase in the number of applications for their residency programs. Prior to the pandemic, students typically had to travel to each hospital for interviews. When interviews went virtual, they could apply to dozens more programs than before. Markey applied to 82.
That surge has made it harder for hospitals to sort through and prioritize applications. In 2023, the Association of American Medical Colleges (AAMC) announced a partnership with Thalamus, the maker of a screening tool for residency applications called Cortex. Starting in 2025, the tool would be free to use for residency programs.
A handful of hospitals had already been working with Cortex, which displays application documents in an easily digestible dashboard and allows reviewers to search by keyword or filter applicants based on a wide variety of characteristics. Cortex also uses fine-tuned versions of OpenAI’s generative models to standardize grades between schools with different practices. The AAMC partnership opened the door to broader adoption of the tool. According to Thalamus, about 1,500 residency programs around the country, or 30 percent, used Cortex to review applicants and make selection decisions during the 2025–2026 cycle.
Issues emerged within weeks of the September 2025 deadline when hospitals started reviewing applications. The company issued a statement saying some residency programs had reported that Cortex was displaying inaccurate grades for some people. In places like Markey’s Discord group, the applicants chattered.
As Markey’s anxiety about his lack of interviews was peaking, he got an exciting bit of news: A research abstract he’d submitted was accepted to be presented at the American Society of Hematology’s upcoming annual meeting and simultaneously published in the journal Blood. What happened next deepened Markey’s belief that AI systems, rather than humans, were responsible for his diminishing chances at getting into a residency program.
Markey already had 10 publications in medical journals on his résumé, but he began emailing his top-ranked residency programs to share the update about this latest accomplishment. The shift in his fortunes was immediate, he said.
Within an hour and 15 minutes of his first email to a residency program coordinator at one of the top psychiatry programs in the country, Markey received an exuberant response from the coordinator’s boss. An interview offer followed less than an hour later, and they began to come in from Markey’s other top choices too.
To Markey, it appeared to be “the first time they were seeing an application that hadn’t even come across their desk.” As he saw it at the time, “I was getting rejections because they had already filled up the top hundred slots based on the top hundred candidates that appear on the dashboard.”
Just a couple days after Markey’s epiphany, on October 16, Thalamus published a follow-up blog post about the previously reported issues with Cortex. The company said it had indeed documented inaccuracies in grades displayed to residency programs—but only in 10 verified instances out of more than 4,000 customer inquiries. Cortex was now “99.3% accurate.”
Thalamus later told WIRED that the company received no additional reports of inaccuracies out of more than 12,000 inquiries. But at the time, a lack of clarity around how Cortex employed AI sparked forum posts and journal articles. Steven Pletcher, a head and neck surgeon who oversees the otolaryngology residency program at the University of California San Francisco Hospital, told WIRED he heard from a colleague at another institution that some of the grades Cortex was displaying were “wildly inaccurate.” Pletcher, who also conducts research into residency selection processes, wanted to investigate the platform himself.
“As a program director, when you hear, ‘Hey we have this AI system for reviewing applications,’ you think, can I just get it to give me a list of applicants that I should interview?” Pletcher told WIRED. “I had some concerns, I think as anyone would, if there’s a new system for reviewing applications and it’s presenting information inaccurately.”
At a national meeting of the Society of University Otolaryngologists in November, Pletcher sat down with a colleague and reviewed applications in Cortex. One of the system’s primary functions is the AI grade-normalization tool. From what Pletcher was seeing, the grades displayed for a given applicant on those charts could change from minute to minute.
Pletcher and four of his colleagues conducted a structured test and documented the errors they found. In January of this year, they published their results in the journal The Laryngoscope, describing “persistent errors in the Thalamus Cortex system with potential to negatively impact residency applicants and programs.”
Jason Reminick, the CEO of Thalamus, told WIRED that many of the fears about Cortex expressed by students and medical schools in the 2025–2026 cycle were the result of misunderstandings about how the tool works. “ A lot of the community suddenly had access to this and were playing with the tool without really going through the buying process,” he said. “And I don’t just mean the physical paying of money, I mean the exploratory process of understanding what the tool does.”
Reminick told WIRED that besides an email from Pletcher, Thalamus received no other complaints about the grades displayed for students changing from minute to minute. He said the error was caused by the user moving too quickly between grade distribution graphs, resulting in the display briefly getting stuck. “This would not have affected any applicant’s overall outcome” in the residency selection process, Reminick said. Thalamus requested that The Laryngoscope retract the article. The journal, which did not respond to WIRED’s request for comment, has not done so.
As the day approached when med students would learn where they’d matched, Markey’s own concerns about Cortex weren’t going anywhere. In February, he reached out to Thalamus customer support to ask whether Cortex used information about leaves of absence to score candidates. “Whether anything affects an ‘automatic score’ or ordering depends on what that specific program has chosen to use for sorting/filtering,” a Thalamus employee replied. “Programs can use different workflows and criteria, and we don’t want to imply that one field (like [leave of absence] type) is universally used as a scoring input everywhere.”
In a later statement to WIRED, Thalamus offered a clarification about Cortex’s use of AI. “We understand that there is a large segment of our community understandably nervous about how quickly AI products are being rolled out and incorporated into every facet of society—including sensitive use cases like medical students applying to residency programs,” the statement said. The company said its approach has been transparent and cautious, but that “putting more emphasis on the limited AI tools would have been helpful to prevent misunderstandings about how AI was being used.” According to Thalamus, “Not only is Cortex not a decision-making tool, it does not use AI to sort, filter, exclude, score, or rank applicants.”
Of course, Markey hadn’t heard any of that from Thalamus. As Match Day approached, all he had to go on was the February email he’d received, which he interpreted as indicating that “scoring” was at work. He still sensed AI bias—and wanted to ferret it out.
Even for professional auditors with direct access to screening algorithms, it can be impossible to understand why an algorithm reached a particular conclusion, said Shea Brown, CEO of the auditing firm Babl AI. When a system runs on an LLM, it naturally has “a very opaque reasoning core at the center, and any kind of explainability about where it made a decision is hidden,” he told WIRED. The only way to test for discrimination is in aggregate: Does the tool, for example, give measurably lower scores to equally qualified candidates with disabilities? “It can’t be done causally based on a single person’s application,” Brown said.
The best a person can do in a situation like Markey’s, where he suspected an AI system was picking up on specific language in his MSPE, is to test how an application performs with and without that language. That’s where Markey started.
First, he ran three versions of his MSPE with slightly different language through a suite of AI fairness- and bias-testing tools that the AAMC recommends. The results indicated that a natural language processing algorithm might assess a sentence describing a leave of absence for “personal reasons” differently than a sentence that specified the leave was for a “medical condition,” but Markey didn’t like that the sample size was small and the test lacked context.
Next, he ran two versions of MSPE leave-of-absence language through VADER, an open-source natural language processing model that assigns emotional sentiment values to words and phrases, and found that a medically accurate description of his leaves of absence received a more positive sentiment score than the “personal reasons” language in his MSPE. He then used Python to create a synthetic dataset of 6,000 residency applicants. Each one was assigned test scores, grades, a count of how many publications they had on their résumé, and numeric rankings for how strong their letters of recommendation were and how well-suited they were for academic research. Markey then divided them into two cohorts—one with sentiment analysis scores reflecting the leave-of-absence language in his MSPE and the other with scores reflecting medically accurate language.
The two groups were equally qualified, in terms of grades, test scores, and other characteristics. But when Markey ran the synthetic applicants through a logistic regression model trained to select the top 12 percent of applicants, those from the cohort with medically accurate MSPE language were 66 percent more likely to make the cut. Still, like his first test, this only shed light on how a generic algorithm might assess his application. Markey wanted to understand Thalamus’ tools.
He tracked down the patent for an AI residency application screener built by the company Medicratic. Thalamus acquired Medicratic in 2025. Patents describe what a system may do, not necessarily what it does do, but it was the clearest explanation Markey could find of what might be happening inside the black box.
With the help of GitHub Copilot and eventually Anthropic’s newly released Claude Code tool, Markey began to reverse engineer the system described in the Medicratic patent, mirroring the data pipeline and using the same open-source modules when he could. When necessary, he substituted Claude Code’s advice and his own research. For example, before the system described in the patent can score applications, a residency program must indicate which characteristics—such as academic performance, professionalism, or leadership—it values most. Markey reviewed published research on residency selection and surveys of residency directors to determine how to weight those features.
Markey finished his system a few weeks before Match Day, March 20. He thought its outline and general features approximated how a tool like the one described in the Medicratic patent might process the same inputs. After more than four months dissecting various algorithms, it was the best he could do. Once again, when he ran different versions of his MSPE language through the system, there were starkly different results: Changing the wording about his leave of absence from “personal reasons” to a medically accurate description resulted in a significantly higher score.
That month, Markey sent Thalamus a data access request, under the New Hampshire Privacy Act, asking for all the personal data the company held about him. That included a comprehensive accounting of every document and data point that was input into Thalamus’ systems about him; every preference parameter, weight, and scoring configuration applied to his application by residency programs; every score, attribute rating, and sentiment analysis calculated by Thalamus based on that data; and explanations of whether and how his data was processed to mitigate bias. Under the New Hampshire Privacy Act, the company had 45 days to respond.
WIRED contacted all of the residency programs Markey applied to and asked about their use of Cortex. Most didn’t respond or declined to comment. Five programs replied that they hadn’t used the tool. Yale New Haven Health told WIRED that its residency programs tried Cortex but stopped using it; a spokesperson declined to comment further. Two residency programs at Dartmouth Hitchcock Medical Center used Cortex to filter applications before program directors reviewed them, said Tennille Doyle, manager of graduate medical education programs, but most of the hospital’s staff preferred to use their own screening methods.
Jeremy Walter, director of media relations at Temple Health, said one of the hospital’s 59 residency programs used Cortex primarily to view applications during “manual screening,” and “overall, we did not find the AI information very reliable.” He declined to elaborate. According to Thalamus, multiple programs at Temple used Cortex during the recent selection cycle. “As with any new functionality, especially when introduced at scale, experiences can vary based on how features are used and interpreted,” the company said.
Kari Roberts, who oversees graduate medical examination at Tufts Medical Center, told WIRED in an email that many of the school’s residency programs tried Cortex for the first time last fall, using it to screen out any applications that were incomplete or failed to meet minimum requirements. “There were some significant errors in the algorithm that incorporated data from the MSPE, leading to wrong grade assignments,” Roberts wrote. “This was not exclusive to our organization and was raised to the Thalamus team in real time by our dean’s team.” Thalamus told WIRED that “a very small number of identified discrepancies” were “investigated and corrected promptly” and that “in some of these cases, what was initially perceived as an inaccuracy was confirmed to be consistent with the source materials.”
After Markey began cold-emailing program coordinators, he received interview offers from 10 institutions, including some of the most prestigious hospitals in the country. Ultimately he matched at Columbia University’s psychiatry program at New York Presbyterian Hospital, where he will begin his residency in July.
Three days after he got matched, Markey received a response from Thalamus to his data access request. The company’s chief of staff, Michele Li, wrote that none of the programs he had applied to had used the Medicratic tool that Markey had been attempting to reverse engineer. Cortex itself didn’t use the sentiment-scoring methodology described in the patent.
Reminick, Thalamus’ CEO, confirmed to WIRED that during the 2025–2026 cycle, Cortex did not algorithmically score or rank applicants. The tool primarily uses AI for grade normalization and to display a badge indicating whether an applicant is interested in academic research, he said. However, Thalamus plans to pilot an AI screener that will allow residency programs to create candidate profiles and then assess how well applicants match those profiles, Reminick said. During the pilot, applicants will have to opt in to the screening.
Even after matching at Columbia and receiving the letter from Thalamus denying his suspicions about his own applications, Markey said he doesn’t regret the months he devoted to unpacking screening tools. “ I’m very grateful for where I’ve gotten, so when things threaten that, I want to make sure I’m responding correctly,” he said. In fact, he has continued his investigation of how large language models pick up on semantic signals in job application material and embed them down the pipeline into decisions or recommendations.
There is proof, even in the world of AI hiring tools, that some form of due process, however imperfect, can be built and regulated into these systems. One of the most popular applications of AI in human resources is to conduct background checks. Companies like Checkr automate the process for millions of applications monthly, comparing candidate names against public records for any evidence of disqualifying criminal activity. A lot of the time, these systems make mistakes that cost people jobs.
But background-check companies, whether they use humans or AI, are subject to provisions in the federal Fair Credit Reporting Act that require them to share the results of a background check with the job candidate upon request, conduct an investigation if the accuracy of the background check is disputed, and send the job candidate the written results of that investigation. Job candidates can win or settle individual and class action lawsuits against background-check companies that provide inaccurate reports.
It’s a system with many of its own problems, but it at least offers individual job seekers an option other than screaming helplessly into the void. Not everyone should need to be an Ivy League medical student with a background in informatics and coding and a massive axe to grind.
What Say You?
Let us know what you think about this article in the comments below. Alternatively, you can submit a letter to the editor at [email protected].




