**ICLR 2024 Reviewer Guide**

Thank you for agreeing to serve as an ICLR 2024 reviewer. Your contribution as a reviewer is paramount to creating an exciting and high-quality program. We ask that:

- Your reviews are timely and substantive.
- You follow the reviewing guidelines below.
- You adhere to our
__Code of Ethics__in your role as a reviewer. You must also adhere to our__Code of Conduct__.

This guide is intended to help you understand the ICLR 2024 decision process and your role within it. It contains:

- An outline of the
__main reviewer tasks__ - Step-by-step
__reviewing instructions__(especially relevant for reviewers that are new to ICLR) __Review examples__- An
__FAQ__.

**We’re counting on you**

As a reviewer you are central to the program creation process for ICLR 2024. Your Area Chairs (ACs), Senior Area Chairs (SACs) and the Program Chairs (PCs) will rely greatly on your expertise and your diligent and thorough reviews to make decisions on each paper. Therefore, your role as a reviewer is critical to ensuring a strong program for ICLR 2024.

High-quality reviews are also very valuable for helping authors improve their work, whether it is eventually accepted by ICLR 2024, or not. Therefore it is important to treat each valid ICLR 2024 submission with equal care.

**As a token of our appreciation for your essential work, top reviewers will be acknowledged permanently on the ICLR 2024 website. Furthermore, top and high quality reviewers will receive special acknowledgement during the opening ceremony and free registration to ICLR 2024.**

**Main reviewer tasks**

The main reviewer tasks and dates are as follows:

- Create or update your OpenReview profile (September 23 2023)
- Bid on papers (September 30 2023 - October 04 2023)
- Write a constructive, thorough and timely review (October 09 2023 - October 30 2023)
- Initial paper reviews released (November 10 2023)
- Discuss with authors and other reviewers to clarify and improve the paper (November 10 2023 - November 21 2023)
- Flag any potential CoE violations and/or concerns (by November 21 2023)
- Provide a recommendation to the area chair assigned to the paper (by November 21 2023)
- Virtual Meeting with AC if paper you reviewed falls into borderline papers (November 21 2023 - December 14 2023)
- Provide a final recommendation to the area chair assigned to the paper (after Virtual Meeting)

**Code of Ethics**

All ICLR participants, including reviewers, are required to adhere to the ICLR Code of Ethics (__https://iclr.cc/public/CodeOfEthics__). All reviewers are required to read the Code of Ethics and adhere to it. The Code of Ethics applies to all conference participation, including paper submission, reviewing, and paper discussion.

As part of the review process, reviewers are asked to raise potential violations of the ICLR Code of Ethics. Note that authors are encouraged to discuss questions and potential issues regarding the Code of Ethics as part of their submission. This discussion is not counted against the maximum page limit of the paper and should be included as a separate section.

**Reviewing a submission: step-by-step**

Summarized in one sentence, a review aims to determine whether a submission will bring sufficient value to the community and contribute new knowledge. The process can be broken down into the following main reviewer tasks:

**Read the paper:**It’s important to carefully read through the entire paper, and to look up any related work and citations that will help you comprehensively evaluate it. Be sure to give yourself sufficient time for this step.**While reading, consider the following:**- Objective of the work: What is the goal of the paper? Is it to better address a known application or problem, draw attention to a new application or problem, or to introduce and/or explain a new theoretical finding? A combination of these? Different objectives will require different considerations as to potential value and impact.
- Strong points: is the submission clear, technically correct, experimentally rigorous, reproducible, does it present novel findings (e.g. theoretically, algorithmically, etc.)?
- Weak points: is it weak in any of the aspects listed in b.?
- Be mindful of potential biases and try to be open-minded about the value and interest a paper can hold for the entire ICLR community, even if it may not be very interesting for you.

**Answer four key questions for yourself, to make a recommendation to Accept or Reject:**- What is the specific question and/or problem tackled by the paper?
- Is the approach well motivated, including being well-placed in the literature?
- Does the paper support the claims? This includes determining if results, whether theoretical or empirical, are correct and if they are scientifically rigorous.
- What is the significance of the work? Does it contribute new knowledge and sufficient value to the community? Note, this does not necessarily require state-of-the-art results. Submissions bring value to the ICLR community when they convincingly demonstrate new, relevant, impactful knowledge (incl., empirical, theoretical, for practitioners, etc).

**Write and submit your initial review, organizing it as follows:**- Summarize what the paper claims to contribute. Be positive and constructive.
- List strong and weak points of the paper. Be as comprehensive as possible.
- Clearly state your initial recommendation (accept or reject) with one or two key reasons for this choice.
- Provide supporting arguments for your recommendation.
- Ask questions you would like answered by the authors to help you clarify your understanding of the paper and provide the additional evidence you need to be confident in your assessment.
- Provide additional feedback with the aim to improve the paper. Make it clear that these points are here to help, and not necessarily part of your decision assessment.

**Complete the CoE report:**ICLR has adopted the following__Code of Ethics__(CoE). When submitting your review, you’ll be asked to complete a CoE report for the paper. The report is a simple form with two questions. The first asks whether there is a potential violation of the CoE. The second is relevant only if there is a potential violation and asks the reviewer to explain why there may be a potential violation. In order to answer these questions, it is therefore important that you read the CoE before starting your reviews.

**Engage in discussion:**The discussion phase at ICLR is different from most conferences in the AI/ML community. During this phase, reviewers, authors and area chairs engage in asynchronous discussion and authors are allowed to revise their submissions to address concerns that arise. It is crucial that you are actively engaged during this phase. Maintain a spirit of openness to changing your initial recommendation (either to a more positive or more negative) rating.**Borderline paper meeting:**Similarly to last year, the ACs are encouraged to (virtually) meet and discuss with reviewers**only****for borderline cases**. ACs will reach out to schedule this meeting. This is to ensure active discussions among reviewers, and well-thought-out decisions. ACs will schedule the meeting and facilitate the discussion. For a productive discussion, it is important to familiarize yourself with other reviewers' feedback prior to the meeting. Please note that we will be leveraging information for reviewers who failed to attend this meeting (excluding emergencies).**Provide final recommendation:**Update your review, taking into account the new information collected during the discussion phase, and any revisions to the submission. State your reasoning and what did/didn’t change your recommendation throughout the discussion phase.

**For great in-depth resources on reviewing, see these resources**:

- Daniel Dennet,
__Criticising with Kindness__. - Comprehensive advice:
__Mistakes Reviewers Make__ - Views from multiple reviewers:
__Last minute reviewing advice__ - Perspective from instructions to Area Chairs:
__Dear ACs__.

**Review Examples**

Here are two sample reviews from previous conferences that give an example of what we consider a good review for the case of leaning-to-accept and leaning-to-reject.

**Review for a Paper where Leaning-to-Accept**

This paper proposes a method, Dual-AC, for optimizing the actor (policy) and critic (value function) simultaneously which takes the form of a zero-sum game resulting in a principled method for using the critic to optimize the actor. In order to achieve that, they take the linear programming approach of solving the Bellman optimality equations, outline the deficiencies of this approach, and propose solutions to mitigate those problems. The discussion on the deficiencies of the naive LP approach is mostly well done. Their main contribution is extending the single step LP formulation to a multi-step dual form that reduces the bias and makes the connection between policy and value function optimization much clearer without losing convexity by applying a regularization. They perform an empirical study in the Inverted Double Pendulum domain to conclude that their extended algorithm outperforms the naive linear programming approach without the improvements. Lastly, there are empirical experiments done to conclude the superior performance of Dual-AC in contrast to other actor-critic algorithms.

Overall, this paper could be a significant algorithmic contribution, with the caveat for some clarifications on the theory and experiments. Given these clarifications in an author response, I would be willing to increase the score.

For the theory, there are a few steps that need clarification and further clarification on novelty. For novelty, it is unclear if Theorem 2 and Theorem 3 are both being stated as novel results. It looks like Theorem 2 has already been shown in "Randomized Linear Programming Solves the Discounted Markov Decision Problem in Nearly-Linear Running Time”. There is a statement that “Chen & Wang (2016); Wang (2017) apply stochastic first-order algorithms (Nemirovski et al., 2009) for the one-step Lagrangian of the LP problem in reinforcement learning setting. However, as we discussed in Section 3, their algorithm is restricted to tabular parametrization”. Is your Theorem 2 somehow an extension? Is Theorem 3 completely new?

This is particularly called into question due to the lack of assumptions about the function class for value functions. It seems like the value function is required to be able to represent the true value function, which can be almost as restrictive as requiring tabular parameterizations (which can represent the true value function). This assumption seems to be used right at the bottom of Page 17, where U^{pi*} = V^*. Further, eta_v must be chosen to ensure that it does not affect (constrain) the optimal solution, which implies it might need to be very small. More about conditions on eta_v would be illuminating.

There is also one step in the theorem that I cannot verify. On Page 18, how is the squared removed for difference between U and Upi? The transition from the second line of the proof to the third line is not clear. It would also be good to more clearly state on page 14 how you get the first inequality, for || V^* ||_{2,mu}^2.

**For the experiments, the following should be addressed.**

1. It would have been better to also show the performance graphs with and without the improvements for multiple domains.

2. The central contribution is extending the single step LP to a multi-step formulation. It would be beneficial to empirically demonstrate how increasing k (the multi-step parameter) affects the performance gains.

3. Increasing k also comes at a computational cost. I would like to see some discussions on this and how long dual-AC takes to converge in comparison to the other algorithms tested (PPO and TRPO).

4. The authors concluded the presence of local convexity based on hessian inspection due to the use of path regularization. It was also mentioned that increasing the regularization parameter size increases the convergence rate. Empirically, how does changing the regularization parameter affect the performance in terms of reward maximization? In the experimental section of the appendix, it is mentioned that multiple regularization settings were tried but their performance is not mentioned. Also, for the regularization parameters that were tried, based on hessian inspection, did they all result in local convexity? A bit more discussion on these choices would be helpful.

**Minor comments:**

1. Page 2: In equation 5, there should not be a 'ds' in the dual variable constraint

**Review for a Paper where Leaning-to-Reject**

This paper introduces a variation on temporal difference learning for the function approximation case that attempts to resolve the issue of over-generalization across temporally-successive states. The new approach is applied to both linear and non-linear function approximation, and for prediction and control problems. The algorithmic contribution is demonstrated with a suite of experiments in classic benchmark control domains (Mountain Car and Acrobot), and in Pong.

This paper should be rejected because (1) the algorithm is not well justified either by theory or practice, (2) the paper never clearly demonstrates the existence of problem they are trying to solve (nor differentiates it from the usual problem of generalizing well), (3) the experiments are difficult to understand, missing many details, and generally do not support a significant contribution, and (4) the paper is imprecise and unpolished.

**Main argument**

The paper does not do a great job of demonstrating that the problem it is trying to solve is a real thing. There is no experiment in this paper that clearly shows how this temporal generalization problem is different from the need to generalize well with function approximation. The paper points to references to establish the existence of the problem, but for example the Durugkar and Stone paper is a workshop paper and the conference version of that paper was rejected from ICLR 2018 and the reviewers highlighted serious issues with the paper—that is not work to build upon. Further the paper under review here claims this problem is most pressing in the non-linear case, but the analysis in section 4.1 is for the linear case.

The resultant algorithm does not seem well justified, and has a different fixed point than TD, but there is no discussion of this other than section 4.4, which does not make clear statements about the correctness of the algorithm or what it converges to. Can you provide a proof or any kind of evidence that the proposed approach is sound, or how it’s fixed point relates to TD?

The experiments do not provide convincing evidence of the correctness of the proposed approach or its utility compared to existing approaches. There are so many missing details it is difficult to draw many conclusions:

- What was the policy used in exp1 for policy evaluation in MC?
- Why Fourier basis features?
- In MC with DQN how did you adjust the parameters and architecture for the MC task?
- Was the reward in MC and Acrobot -1 per step or something else
- How did you tune the parameters in the MC and Acrobot experiments?
- Why so few runs in MC, none of the results presented are significant?
- Why is the performance so bad in MC?
- Did you evaluate online learning or do tests with the greedy policy?
- How did you initialize the value functions and weights?
- Why did you use experience replay for the linear experiments?
- IN MC and Acrobot why only a one layer MLP?

Ignoring all that, the results are not convincing. Most of the results in the paper are not statistically significant. The policy evaluation results in MC show little difference to regular TD. The Pong results show DQN is actually better. This makes the reader wonder if the result with DQN on MC and Acrobot are only worse because you did not properly tune DQN for those domains, whereas the default DQN architecture is well tuned for Atari and that is why you method is competitive in the smaller domains.

The differences in the “average change in value plots” are very small if the rewards are -1 per step. Can you provide some context to understand the significance of this difference? In the last experiment linear FA and MC, the step-size is set equal for all methods—this is not a valid comparison. Your method may just work better with alpha = 0.1.

**The paper has many imprecise parts, here are a few:**

- The definition of the value function would be approximate not equals unless you specify some properties of the function approximation architecture. Same for the Bellman equation
- equation 1 of section 2.1 is neither an algorithm or a loss function
- TD does not minimize the squared TD. Saying that is the objective function of TD learning in not true
- end of section 2.1 says “It is computed as” but the following equation just gives a form for the partial derivative
- equation 2, x is not bounded
- You state TC-loss has an unclear solution property, I don’t know what that means and I don’t think your approach is well justified either
- Section 4.1 assumes linear FA, but its implied up until paragraph 2 that it has not assumed linear
- treatment of n_t in alg differs from appendix (t is no time episode number)
- Your method has a n_t parameter that is adapted according to a schedule seemingly giving it an unfair advantage over DQN.
- Over-claim not supported by the results: “we see that HR-TD is able to find a representation that is better at keeping the target value separate than TC is “. The results do not show this.
- Section 4.4 does not seem to go anywhere or produce and tangible conclusions

**Things to improve the paper that did not impact the score:**

- It’s hard to follow how the prox operator is used in the development of the alg, this could use some higher level explanation
- Intro p2 is about bootstrapping, use that term and remove the equations
- It’s not clear why you are talking about stochastic vs deterministic in P3
- Perhaps you should compare against a MC method in the experiments to demonstrate the problem with TD methods and generalization
- Section 2: “can often be a regularization term” >> can or must be?
- update law is an odd term
- tends to alleviate” >> odd phrase
- section 4 should come before section 3
- Alg 1 in not helpful because it just references an equation
- section 4.4 is very confusing, I cannot follow the logic of the statements
- Q learning >> Q-learning
- Not sure what you mean with the last sentence of p2 section 5
- where are the results for Acrobot linear function approximation
- appendix Q-learning with linear FA is not DQN (table 2)

**FAQ for Reviewers**

**Q. **The review I just submitted is not appearing! Did it go thru?

**A. **There seems to be some delay in the system for you to see your reviews. Just give a bit of time and refresh again.

**Q**. What is the** borderline paper meeting** between AC and reviewers about?

**A.** This year we require ACs to meet with reviewers for all borderline papers (defined and announced by PCs once all reviews are received). There will be one meeting per borderline paper. Once we receive all reviews, we will let ACs know which papers fall into this bucket. ACs will then schedule and run the meetings with the reviewers of the papers in question.

**Q**: What if I **can not attend the meeting**?

**A: **Please work with your AC to find a meeting schedule that works for everyone. All reviewers agreed to participate in review meetings when they accepted the reviewer invitations. An AC has right to downweight the review of reviewers who refuse to attend the meeting.

**Q:** How should I use **supplementary material**?

**A: **It is not necessary to read supplementary material but such material can often answer questions that arise while reading the main paper, so consider looking there before asking authors.

**Q:** How should I handle a **policy violation**?

**A: **To flag a CoE violation related to a submission, please indicate it when submitting the CoE report for that paper. The AC will work with the PC and the ethics board to resolve the case. To discuss other violations (e.g. plagiarism, double submission, paper length, formatting, etc.), please contact either the AC/SAC or the PC as appropriate. You can do this by sending a confidential comment with the appropriate readership restrictions.

**Q:** Am I allowed to ask for **additional experiments**?

**A: **You can ask for additional experiments. New experiments should not significantly change the content of the submission. Rather, they should be limited in scope and serve to more thoroughly validate existing results from the submission.

**Q: **If a submission does not achieve** state-of-the-art results**, is that grounds for rejection?

**A: **No, a lack of state-of-the-art results does not by itself constitute grounds for rejection. Submissions bring value to the ICLR community when they convincingly demonstrate new, relevant, impactful knowledge. Submissions can achieve this without achieving state-of-the-art results.

**Q: **Are authors expected to cite and **compare with very recent work?** What about non peer-reviewed (e.g., ArXiv) papers? (updated on 7 November 2022)

**A:** We consider papers contemporaneous if they are published (available in online proceedings) within the last four months. That means, since our full paper deadline is September 28, if a paper was published (i.e., at a peer-reviewed venue) on or after May 28, 2023, authors are not required to compare their own work to that paper. Authors are encouraged to cite and discuss all relevant papers, but they may be excused for not knowing about papers not published in peer-reviewed conference proceedings or journals, which includes papers exclusively available on arXiv. Reviewers are encouraged to use their own good judgement and, if in doubt, discuss with their area chair.