Skip to yearly menu bar Skip to main content


Introducing ICML 2026 policy for LLMs in reviews

By ICML 2026 Program Chairs Alekh Agarwal, Miroslav Dudik, Sharon Li, Martin Jaggi; Integrity Chair Weijie Su; and Associate Integrity Chair Buxin Su.
Posted December 10, 2025.

The use of LLMs in peer review is causing a wide range of reactions among researchers. Some see it as an opportunity to improve peer review, and others as a grave danger to the future of academia. To address these varied attitudes, we are trying something new at ICML 2026: a two-policy framework for LLM reviewing.

We introduce the following two policies:

  • Policy A (Conservative):
    Use of LLMs for reviewing is strictly prohibited.
     
  • Policy B (Permissive):
    Allowed: Use of LLMs to help understand the paper and related works, and polish reviews. Submissions can be fed to privacy-compliant* LLMs.
    Not allowed: Ask LLMs about strengths/weaknesses, ask to suggest key points for the review, suggest an outline for the review, write the full review.

    *By “privacy-compliant”, we refer to LLM tools that do not use logged data for training and that place limits on data retention. This includes enterprise/institutional subscriptions to LLM APIs, consumer subscriptions with an explicit opt-out from training, and self-hosted LLMs. (We understand that this is an oversimplification.)

Under both policies, reviewers are always responsible for the full content of their reviews.

Reviewers declare which policy they want to follow, and authors declare whether they require their papers to be reviewed under Policy A, or allow them to be reviewed under Policy B. Any reviewer who is an author on a paper that requires Policy A must also be willing to follow Policy A. Submissions are matched with compatible reviewers. Each reviewer is told which policy to follow on all of their assigned papers. For full details, see ICML 2026 Policy for LLM use in reviewing.
 

Where does this framework come from?

The initial idea arose in discussions among program chairs and several senior community members. Program chairs and the integrity chair then conducted two surveys of past reviewers. The survey results were used to inform the final design.

At the start of the whole process, program chairs and position paper chairs met to align on high-level goals regarding LLM use. We agreed that the top priority is to ensure that the evaluation of the papers and the composition of the reviews should not be delegated to LLMs. However, we were open to the use of LLMs to help understand the paper and polish the final reviews. While we were deliberating the LLM policy, several senior community members reached out to us. Some were concerned with the lack of realism of previous conference policies that entirely ban LLMs, while others emphasized the damage that is done to peer-review processes when reviewers delegate reviewing to LLMs, which both results in lower quality reviews and undermines the trust of authors in the process.

Based on these conversations, we formulated the following goals:

  • Retain human assessment as a crucial part of the reviewing process.
  • Meet reviewers where they are in their use of AI systems to do their work.
  • Respect authors who do not want to have their papers fed to LLMs.
  • Uphold the integrity and transparency of the peer-review process.

These seemingly contradictory goals lead us to our current design. Instead of having a uniform reviewing policy, which would alienate large portions of the community regardless of what we chose, we decided to have two policies: one more conservative, similar to the previous ICML policy, and one more permissive, allowing the use of LLMs in a way that would still keep human judgment at the center. We just needed to fill in the details of the two policies.
 

Community surveys

The two policies were designed using two community surveys of ICML 2025 reviewers (many of whom were also authors, because of the ICML 2025 reciprocal reviewing policy); both surveys were anonymous, conducted in November 2025.

The first survey was sent to 1,100 randomly chosen reviewers from that pool; we heard back from 150. We asked reviewers to state their preferences between two policies. The proposed conservative policy (less conservative than Policy A above) allowed only limited use of LLMs and did not allow the input of submissions into LLMs. The permissive policy (similar to Policy B above) allowed the input of submissions into LLMs. After we collected responses, we noted that several respondents expressed concern about why we are not considering a policy that would ban LLMs altogether. Hence, we decided to run a second survey.

In the second survey, we reached out to a disjoint set of 500 randomly chosen past reviewers from the same pool; we heard back from 74. In this case, we considered the following two policies (similar to our final versions above):

  • Policy A (Conservative): Use of LLMs for reviewing is strictly prohibited.
  • Policy B (Permissive): Reviewers may input the submission text into privacy-compliant LLMs. However, the assessment of the paper and the writing of the review must not be delegated to LLMs.

We asked the respondents which policy they would prefer as reviewers and as authors (with a reminder of the reciprocity requirement: that any reviewer who is also an author of a paper that requires Policy A must be willing to follow Policy A):

These results suggest that the community is quite evenly divided (even accounting for large error bars!). As reviewers, ∼40% of respondents strongly prefer Policy A, and ∼30% respondents strongly prefer Policy B. As authors, ∼50% of respondents require Policy A, and the remaining ∼50% allow Policy B.

We were also curious about the risk of non-compliance in case we ask reviewers that prefer Policy B to review under Policy A:

These results should be taken with a giant grain of salt (questions involve a hypothetical, responders may wish to align themselves with a more moral position and might be more likely to comply than non-responders). Nonetheless, they suggest that the non-compliance is non-negligible, but not rampant. Also, they suggest that a non-trivial fraction of our reviewers would be hampered by asking everyone to follow Policy A. This means that a uniform LLM ban is likely not the right approach to engage with the community of volunteer reviewers.

One of the remaining design considerations was what activities we should allow under Policy B. We were clear that we did not want to delegate paper judgment and critique to LLMs. We were much less sure whether we should allow the use of LLMs for polishing the reviews. This decision has non-trivial enforcement implications. Once we allow the use of LLMs for polishing reviews, we cannot use certain LLM detection tools to detect non-compliance, so permitting such use of LLMs should not be done lightly.

To help with these decisions, we asked in both surveys, how likely the respondents are to use LLMs for various activities:

The top activity, by far, was “Polishing reviewer-written text,” with ∼70% respondents saying they are likely or very likely to use LLMs for this task. The other two top activities were “Concept/background clarification” and “Related-work search.”

We also dug into the use of LLMs as writing assistants in a separate question:

The answers highlight that ∼40% respondents would be hampered and might opt out of reviewing altogether if they were not able to use LLMs to help with writing.
 

Final policy design

Since there are large portions of reviewers and authors that strongly prefer either of the policies, we decided to implement the dual-policy framework. Key evidence (from the survey results above):  

∼30% reviewers strongly prefer LLMs,
∼40% reviewers strongly prefer no LLMs,
∼50% authors allow LLMs.

Since a large fraction of the reviewers use LLMs in writing and also a large fraction would be hampered by their ban for writing, we decided to allow the use of LLMs to polish reviews. Key evidence (from the survey results above):

∼70% reviewers say they are likely to use LLMs for polishing,
∼40% reviewers would be hampered by the ban.

Besides “Polishing reviewer-written text,” the other two top activities mentioned by the survey respondents (“Concept/background clarification” and “Related-work search”) are also allowed by Policy B. The next most common activity (“Summarization”) is not allowed, because it goes counter to our goal of retaining human assessment as a crucial part of the reviewing process.
 

Implementation issues

There are many ways in which our well-intentioned design could go wrong.

First, we need to ensure that there is a sufficient number of reviewers who are willing to follow Policy A. We do this by extending our existing reciprocal reviewing policy. The policy includes the requirement that each submission must nominate one of its authors as a reviewer. We augment this policy to require that any reviewer who is an author on a paper that requires Policy A must be also willing to follow Policy A. Our survey suggests that this leads to a sufficient number of reviewers willing to follow Policy A. Moreover, this reciprocal requirement is universally viewed as fair:

We hope that this means that reviewers will be more likely to buy into this policy (which will improve compliance).

The second important issue is enforceability. In the case of Policy A, we face a similar challenge as other major machine learning conferences whose policy was similar to Policy A (like ICML 2025 and NeurIPS 2025). However, we expect that our situation regarding LLM use under Policy A is going to be less adversarial since reviewers are (largely) opting in to follow Policy A.

In the case of Policy B, we expect that the key challenge will be to get across which LLM uses are allowed and which are not allowed. We are planning to insert salient parts of the policy at various points in reviewing workflow and also do our best to communicate these to the broad community (this blog post is the initial step). That said, we acknowledge that various parts of the policy are not realistically enforceable (like the use of privacy-compliant LLMs). Here, to a degree, we trust that reviewers are honest and follow our policies much like we have always trusted them to follow confidentiality of reviewing. Similar to other academic integrity policies, any reported violations will be penalized.

When it comes to proactive detection of violations, we are planning to use automated tools that help detect LLM use, while respecting the confidentiality of the peer-review process. Such flagging does not immediately mean policy violation (both because of false positives and because many LLM uses are allowed under Policy B). However, instances of hallucinated content constitute a clear violation of our LLM policy and instances of low-quality reviews constitute a clear violation of our reciprocal reviewing policy (and will be viewed as abuse of the peer-review system).

Ultimately, we believe that as a community we need to experiment with ways of improving peer review. The policy for ICML 2026 is anchored in the belief that it is important to respect the authors’ expectation of a fair peer review (without LLMs if so desired) while also supporting reviewers where they are, so they can do their best work. We hope that this will result in a more collaborative and less adversarial peer review environment. Even if the dual-policy framework does not succeed in this goal, we hope that we will learn something in the process that will allow us and other conferences to do better in the future.