Standing on the Shoulders of Giant Foundation Models (2024)

\sidecaptionvpos

figuret

Cong Lu1,2
conglu@cs.ubc.caShengran Hu1,2
srhu@cs.ubc.caJeff Clune1,2,3
jclune@gmail.com
1University of British Columbia
2Vector Institute
3Canada CIFAR AI Chair

Abstract

Go-Explore is a powerful family of algorithms designed to solve hard-exploration problems, built on the principle of archiving discovered states, and iteratively returning to and exploring from the most promising states.This approach has led to superhuman performance across a wide variety of challenging problems including Atari games and robotic control, but requires manually designing heuristics to guide exploration (i.e. determine which states to save and explore from, and what actions to consider next), which is time-consuming and infeasible in general.To resolve this, we propose Intelligent Go-Explore (IGE) which greatly extends the scope of the original Go-Explore by replacing these heuristics with the intelligence and internalized human notions of interestingness captured by giant pretrained foundation models (FMs).This provides IGE with a human-like ability to instinctively identify how interesting or promising any new state is (e.g. discovering new objects, locations, or behaviors), even in complex environments where heuristics are hard to define.Moreover, IGE offers the exciting and previously impossible opportunity to recognize and capitalize on serendipitous discoveries that cannot be predicted ahead of time.We evaluate our algorithm on a diverse range of language-based tasks that require search and exploration.In Game of 24, a problem testing multistep mathematical reasoning, IGE reaches 100% success rate 70.8% faster than the best classic graph search baseline.Next, in BabyAI-Text, a challenging partially observable gridworld where an agent has to follow language instructions, IGE exceeds the previous state-of-the-art with orders of magnitude fewer online samples.Finally, in TextWorld, a rich text game, we show the unique ability of IGE to succeed in settings requiring long-horizon exploration where prior state-of-the-art FM agents like Reflexion completely fail.Overall, Intelligent Go-Explore combines the tremendous strengths of FMs and the powerful Go-Explore algorithm, opening up a new frontier of research into creating more generally capable agents with impressive exploration capabilities.All our code is open-sourced at: https://github.com/conglu1997/intelligent-go-explore.

1 Introduction

Foundation models (FMs,[5, 31, 7, 36, 38]) trained on giant internet-scale datasets have demonstrated strong general capabilities in reasoning[40] and understanding[9].As such, these models have been increasingly employed as autonomous agents[28, 43, 39, 42, 33, 4] in decision-making tasks, showcasing the ability to adapt in-context[12, 30] to unseen tasks.However, a significant challenge remains: foundation model agents often struggle in environments that require deep exploration over extended time horizons[28].Overcoming this limitation would enable us to realize their potential as autonomous assistants in more open-ended domains like scientific discovery and innovation[20].This paper introduces Intelligent Go-Explore (IGE), a novel approach that combines the intelligence foundation models with the powerful Go-Explore[13, 14] framework to substantially increase the exploration capabilities of FM and reinforcement learning (RL,[34]) agents.

Standing on the Shoulders of Giant Foundation Models (1)

Go-Explore is a popular family of algorithms in deep RL based on maintaining an archive of “interestingly new” discovered states and then iteratively returning to and exploring from the most promising states (see Figure1 for an overview of the three stages).This framework has led to superhuman performance in a range of hard-exploration problems, including long-horizon Atari games and robotic control.However, success in these domains has largely relied on carefully hand-designed heuristics at all three stages to guide exploration.For example, in Montezuma’s Revenge[2], an Atari game that was the previous grand challenge of exploration in deep RL, (1) saved states in the archive were returned to with probability proportional to factors like the number of times a state has been sampled before, (2) exploration was purely via random action sampling, and (3) the criteria for which states were considered interestingly new enough to be added to the archive depended on domain-specific factors like whether the agent visited a new location, or did so with more keys.

These rigid, domain-specific choices are in stark contrast to human-like exploration of a new game, where players can often intuitively judge the value or interestingness of any particular state[10].More importantly, it is often impossible to know what is interesting or possible ahead of time in complex domains.In the words of Isaac Asimov—The most exciting phrase to hear in science, the one that heralds new discoveries, is not “Eureka!” but “That’s funny.”.With this motivation, IGE stands on the shoulders of giant foundation models, and uses their intelligence to (1) act as a judge to identify the most promising states to return to and explore from, (2) to select the best actions to take to explore from a selected state, and (3) to identify serendipitous discoveries when they happen (e.g. finding new objects, locations, or other novelties) and decide whether a new state is interestingly new enough to be added to the archive as a stepping stone for future exploration (Figure1, top).

We demonstrate IGE’s ability to reliably improve the exploration capabilities of FM agents on a diverse range of language-based tasks that require search and exploration.These settings include tasks that require commonsense reasoning, long-term planning and memory, and handling partial observability.IGE integrates well with various agent strategies, including few-shot and chain-of-thought-based prompting; and will only get better as the capabilities of foundation models improve further.While IGE performs strongly all-around, some highlights from our evaluation include: IGE reaches 100% success rate on Game of 24[42], a standard mathematical reasoning and search problem, 70.8% faster than classic graph search.Additionally, on the TextWorld[11] Coin Collector domain, IGE is the only algorithm that succeeds in discovering long-horizon optimal solution paths, where prior state-of-the-art FM agent frameworks like Reflexion[33] fail.

Intelligent Go-Explore simultaneously empowers foundation model agents to reliably explore, and reimagines the scope of Go-Explore to tackle virtually any type of problem, without being limited to hand-designed heuristics.These abilities will substantially improve our ability to develop more generally capable agents, and increase the range of tasks they can learn how to solve.

2 Background

2.1 Go-Explore for Hard-Exploration Problems

Go-Explore[14, 13] is a family of algorithms designed to solve hard-exploration[24] problems based on the principle of remembering and returning reliably to promising states.The classic setting builds an “archive” of novel states it discovers in an environment, where similar states are grouped in a single “cell”.These cells are defined by heuristics like having the same visual observation when downsampled to low resolution.In the beginning, the archive only contains the initial state.We describe the overall structure of the algorithm in the same order as Figure1 (bottom):At each iteration, (1) promising states are selected from the archive through domain-specific heuristics, e.g. probabilistically sampling states proportional to their progress through the environment or potential to lead to new states.The agent returns to that state, by resetting using the simulator or via a goal-conditioned policy, and (2) a sequence of random actions is taken to explore from that state.(3) All discovered states deemed interestingly new by the cell representation heuristics are added to the archive, and the process repeats.The strength of Go-Explore is due to addressing two critical impediments to exploration: forgetting how to reach previously visited states (detachment) and failing to first return to a state before exploring from it (derailment).

This approach leads to a collection of high-return trajectories being discovered, which may then be fed into an imitation learning[19] algorithm to produce a policy that generalizes and is robust to stochasticity.We adopt similar assumptions as the original setting, by assuming an agent can return to a previously discovered state by restoring in the simulator.This assumption may readily be relaxed by training a policy to return to a given state, or in the foundation model case, by simply prompting the model with a past trajectory.

2.2 Large Language and Multimodal Foundation Models

The combination of model scaling and training over internet-scale data has resulted in a wide variety of foundation models [5] that exhibit generalist capabilities.In this paper, we consider autoregressive large language models (LLMs,[7, 31, 38]) which learn to generate text completions by modeling the conditional probability of a new token given the preceding tokens, p(xt|x<t;θ)𝑝conditionalsubscript𝑥𝑡subscript𝑥absent𝑡𝜃p(x_{t}|x_{<t};\theta)italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ; italic_θ ).This framework enables LLMs to not only generate coherent text but crucially also exhibit human-like abilities, including on commonsense knowledge questions[35] and complex reasoning tasks[40].These models may also be extended to other input modalities such as images by tokenizing these inputs into the same space as the text[48].When prompting an FM with an instruction, the user may decide to do so with no related examples (zero-shot), with a few successful examples in related problems (few-shot,[7]), or ask for a chain of reasoning (chain-of-thought,[40]) before responding.

3 Driving Exploration with Giant Foundation Models

In this section, we propose Intelligent Go-Explore (IGE) which reimagines the classic Go-Explore algorithm as described in Section2 with the intelligence of giant pretrained foundation models.Specifically, we introduce FM intelligence to selecting which archived state to return to and explore from, which action to take from each state, and deciding whether a state is interestingly new and should be archived.IGE’s use of foundation models is closely related to FM-as-a-judge[47], which shows that foundation models are a good proxy for human judgment to evaluate the output of generative models.Here, instead of judging synthetic output, the foundation model makes choices to determine the best way to explore an environment.We illustrate our resultant algorithm at the top of Figure1 and provide full pseudocode in Algorithm1.

Wherever we query the foundation model, we introduce the overall strategy of Go-Explore alongside a brief description of the current environment in the “system message” (high-level directive) displayed below.The brief descriptions for each environment we evaluate on in Section4 are listed in AppendixB.In the following sections, we detail our prompting techniques at each stage of IGE.The previous prompt history is visible to the agent, which enables each component of IGE to communicate with each other.We provide precise details on how we parse responses in SectionC.1.

3.1 Select State From Archive

The power to easily store and return to promising discovered states is crucial to Go-Explore’s ability to reliably solve long-horizon exploration problems.IGE leverages the foundation model’s internalized notions of interestingness[44] to select the most promising state to return to from the archive (Figure1,left).This is far more flexible than classic Go-Explore, which relied on hardcoded hand-crafted heuristics to determine cell sampling probabilities.An example prompt is shown below.

Examples of the discovered states are given in Table1.We assign indices to these states in a list and ask the FM to select a numerical index.We define a budget of Nstatesubscript𝑁stateN_{\text{state}}italic_N start_POSTSUBSCRIPT state end_POSTSUBSCRIPT “state-expansions”.Each state expansion is followed by a sequence of exploratory actions, which we describe in the next section.

3.2 Explore From State

In order to effectively explore from a state selected in the previous section, we leverage the power of foundation model agents[28, 18] to choose how to act in an environment.This vastly improves on the original Go-Explore’s use of random action sampling.One of the key strengths of IGE is that it is a strict improvement on top of any FM agent framework, including zero-shot, few-shot, or even chain-of-thought-based prompting[43].We demonstrate this flexibility in Section4.

One point of departure from the classic Go-Explore is that we additionally maintain a state-conditional action history for each archived state, so that IGE can avoid repeating previously tested options.While this information may already be available in the entire history, this helps avoid any recency bias that can occur with longer contexts[45].The action history can be easily reiterated in the prompt, or the prompt could display the remaining untested actions.We define a budget of exploratory actions per state expansion Nactionsubscript𝑁actionN_{\text{action}}italic_N start_POSTSUBSCRIPT action end_POSTSUBSCRIPT, which is typically far shorter than the full horizon of the environment and represents a small number of trial actions.An example prompt is shown here.

3.3 Update Archive

IGE queries the foundation model to judge whether any newly discovered state is interestingly new and sufficiently different from prior states to qualify to be added to the archive.Intuitively, we should only save the most relevant stepping stones, and discard those that are unlikely to lead to new discoveries.

Whilst the original Go-Explore required extensive domain knowledge to determine interestingness, IGE avoids this requirement and manual labor, critically gaining the ability to recognize serendipitous discoveries that could not have been predicted ahead of time.In practice, we propose two options to filter discovered states after a sequence of exploratory actions.The first is to iterate through every new state and ask whether each one is interestingly new and should be added to the archive.The second is to first add all states, and then ask the foundation model to remove the uninteresting states.We discuss this choice later in Section4.3; the second form is preferable in larger environments where there is more need to explicitly deprecate earlier discoveries that have become irrelevant to not overload the archive.An example prompt for the first option is shown below.

By default, IGE implements the foundation model at all three stages of Go-Explore, but we rigorously analyze the relative importance of each component in Section5.In this paper, we focus on the discovery of solutions to hard-exploration problems.However, these solutions could easily then subsequently be used for downstream reinforcement learning or even improve the foundation model in the next task by in-context learning—thus allowing an agent to bootstrap its own learning indefinitely.

4 Empirical Evaluation

In this section, we evaluate Intelligent Go-Explore across a diverse set of text environments that require search and exploration.We demonstrate IGE’s ability to handle partially observable and complex observation spaces, discover solutions involving long chains of actions, and effectively improve the ability of FM agents to explore.For all our experiments, we use GPT-4[31], one of the current SOTA LLMs, as our foundation model.We compare IGE to random action sampling, a naïve LLM baseline, and two SOTA FM agents, ReAct[43] and Reflexion[33].All methods use the same amount of environment steps and receive the same observations for a fair comparison.Naïve LLM simply queries the LLM for an action conditional on the interaction history.ReAct prompts the agent to output its reasoning before making a decision.Based on ReAct, Reflexion further conditions the agent on the previous attempted episode, asking the agent to learn from its mistakes.We provide an overview of our environments in Table1.Full hyperparameters are detailed in AppendixD.

Game of 24BabyAI-TextTextWorld
Problem Typemathematical reasoning and searchpartially observable gridworld with language instructionspartially observable game requiring long-term memory and planning, exploration, and common sense
Text Observation"Current state: (2 8 8 14)""Goal: unlock the red door. You see a wall 4 steps forward, You see a yellow box 2 steps left.""You arrive in a pantry… You see a shelf. The shelf is wooden. On the shelf you can see flour…"
Next Actions- 2 + 8 = 10 Next: (8 10 14)- 8 / 2 = 4 Next: (4 8 14)- 14 + 2 = 16 Next: (8 8 16)- turn left- turn right- go forward- go east- cook potato with oven- unlock door with key
Task Horizon364 or 12825, 40 or 80

4.1 Game Of 24

We first demonstrate the effectiveness of IGE in a mathematical reasoning task, Game of 24[42].The goal is to perform basic arithmetic operations (+,,×,/)(+,-,\times,/)( + , - , × , / ) starting from 4 numbers to obtain 24.For example, given input (4,9,10,13)491013(4,9,10,13)( 4 , 9 , 10 , 13 ), a possible solution could be (104)×(139)=2410413924(10-4)\times(13-9)=24( 10 - 4 ) × ( 13 - 9 ) = 24.We formulate the problem as an MDP[34], where actions represent a reduction of two numbers by an arithmetic operation—i.e., the above solution would be represented as the sequence of state transitions (4,9,10,13)104=6(6,9,13)139=4(6,4)6×4=24(24)10464910136913139464642424(4,9,10,13)\xrightarrow{10-4=6}(6,9,13)\xrightarrow{13-9=4}(6,4)\xrightarrow{6%\times 4=24}(24)( 4 , 9 , 10 , 13 ) start_ARROW start_OVERACCENT 10 - 4 = 6 end_OVERACCENT → end_ARROW ( 6 , 9 , 13 ) start_ARROW start_OVERACCENT 13 - 9 = 4 end_OVERACCENT → end_ARROW ( 6 , 4 ) start_ARROW start_OVERACCENT 6 × 4 = 24 end_OVERACCENT → end_ARROW ( 24 ).Therefore, IGE uses the FM to iteratively expand possible solution paths and archive promising ones to return to.The action space is the range of possible next operations, displayed in the same manner as in Yao etal. [42].

Standing on the Shoulders of Giant Foundation Models (2)

We evaluate IGE across 100 hard test problems in Figure2, and additionally include the standard (unweighted) graph search algorithms depth-first search (DFS) and breadth-first search (BFS) as reference.Since the combinatorial complexity of the problem is at most (42)(32)43=1152binomial42binomial32superscript431152\binom{4}{2}\cdot\binom{3}{2}\cdot 4^{3}=1152( FRACOP start_ARG 4 end_ARG start_ARG 2 end_ARG ) ⋅ ( FRACOP start_ARG 3 end_ARG start_ARG 2 end_ARG ) ⋅ 4 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = 1152, graph search is guaranteed to find a solution within that many actions.The system prompts for both IGE and the LLM baselines contain few-shot examples with correct calculations on different starting numbers.IGE rapidly reaches 100% success rate, on average 70.8% quicker than the next best baseline, depth-first search (DFS)—this improvement is statistically significant (χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT test, p<0.05𝑝0.05p<0.05italic_p < 0.05) at 150 operations, where IGE has solved all problems.This success may be attributed to the fact that language models have internalized mathematical intuition and are likely to be able to identify promising pairs like (6,4)64(6,4)( 6 , 4 ) that could easily be multiplied together for a solution.

All LLM agent baselines (naïve LLM, ReAct, Reflexion) eventually plateau and even get beaten by the unintelligent DFS.This highlights the need for diverse action selection, which IGE enables.A final point of comparison we make is to Tree of Thoughts (ToT,[42]) which achieved 74% on Game of 24 within their evaluation budget.We emphasize that our evaluation setting is very different as IGE selects from the list of valid options rather than doing the math in context.However, we note the key difference to our method is that ToT evaluates and expands multiple reasoning paths following a tree structure, whereas IGE can easily jump around the search space—this is a crucial advantage in more complex environments (like those in the following sections), where it takes many coordinated actions to get from one state to another interesting state.

4.2 BabyAI-Text

Standing on the Shoulders of Giant Foundation Models (3)

Standing on the Shoulders of Giant Foundation Models (4)

Next, we show that IGE scales to the BabyAI-Text environment from Carta etal. [8], which is a procedurally-generated partially-observable 2D gridworld with text-based observations.The agent is given a textual goal instruction which could correspond to one or more instructions in a sequence, e.g. “pick up X and then go to Y”.As we can see from the observations in Table1, the task is challenging even for humans to complete and requires forming a model of the world from partial text descriptions.This kind of state observation would make it hard to define heuristics to determine how good any particular state is, as in classic Go-Explore.The optimal path to a solution may include moving blocking objects as well as finding keys to open doors.We consider 5 different task families of increasing difficulty: “go to”, “pick up”, ‘pick up then go to”, “open door”, and “put next to”, which are described fully in SectionB.2.

We omit the Reflexion baseline in this environment due to the high cost of querying GPT-4 with 128-step episodes in the context.Due to the complexity of this environment, we use chain-of-thought prompting in all three components of IGE.This allows the FM to deliberate on the state of the game before making decisions.We show that IGE can find solutions to these problems with only a tiny budget of 250 environment steps per task (divided into rollouts of 10 exploratory actions each) and visualize the final performance in Figure3.IGE and ReAct vastly outperform the prior RL-trained language model approach, GLAM[8], with orders of magnitude fewer samples (GLAM used 1.5M online steps) and requiring no training whatsoever.IGE achieves the best or close to the best performance in every task.The gap between IGE and the second-best method grows with task difficulty, with a statistically significant 36% improvement (χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT test, p<0.05𝑝0.05p<0.05italic_p < 0.05) on ‘put next to’.

4.3 TextWorld

Finally, we show IGE’s ability to tackle tasks requiring long-horizon memory and planning, exploration, and commonsense in TextWorld[11], a classic text-based agent benchmark.We consider three challenging games in Textworld: Treasure Hunter, The Cooking Game, and Coin Collector.In each game, the agent needs to complete the task while navigating a maze of different rooms, while only seeing the current room’s description in text.The agent interacts with the world using free-form natural language commands, such as “go east” or “cook potato with oven.”In Treasure Hunter[32], the agent has to find a specific item by exploring, finding keys, and unlocking doors and containers.In The Cooking Game, the agent must find a recipe, locate and process (e.g., dice, cut, chop) ingredients, and cook them according to the recipe using various kitchen appliances (e.g., oven, pan).In Coin Collector, the agent must find a coin randomly located in the maze, testing its navigation and exploration skills.We set each game to hard difficulty, details on game customizations are provided in SectionB.3.As in the previous section, we use chain-of-thought prompting in all three components of IGE.Because the state archive in this environment grows significantly, we implement rejection-based archive filtering, which we describe in SectionC.2.

Standing on the Shoulders of Giant Foundation Models (5)

Standing on the Shoulders of Giant Foundation Models (6)

We present success rates achieved on the three games using IGE and the baselines in Figure4.We observe that IGE outperforms all other baselines, with a statistically significant (χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT test, p<0.05𝑝0.05p<0.05italic_p < 0.05) performance gap between IGE and the second-best method in the harder Cooking Game and Coin Collector.In The Cooking Game, IGE outperforms the second-best agent, ReAct, by a large margin of 36%, demonstrating IGE’s advantage in hard-exploration problems.In Coin Collector, IGE is the only method that can find the solution in the maze, with all other methods completely failing.Interestingly, we observe that IGE exhibits BFS-like behavior, and intelligently selects rooms with unexplored directions and iteratively removes rooms with exhausted directions.This results in IGE almost always finding the shortest path to the target, while other methods fail to navigate the maze.

We highlight that Reflexion does not improve over ReAct in all the games we tested.Although Reflexion should in theory be an improvement over ReAct with the experience from previous attempts, it tends to decrease performance.We hypothesize that in long-horizon environments, the history becomes too long after the initial episode, and prevents Reflexion from effectively utilizing knowledge from the previous episode.In contrast, IGE uses the FM to iteratively filter interesting states in the archive, which ends up controlling the context length.This helps IGE truly make use of the cumulative knowledge gained through exploration.

5 Analysis

In this section, we analyze (1) the importance of FM intelligence for each of the three key components of Go-Explore, (2) how IGE’s selectivity produces a smaller (and thus more efficient) archive, and (3) how IGE’s performance improves as the FM’s size/intelligence increases.We take a representative sample of environments from the previous section of Game of 24, Put Next To (PN) from BabyAI-Text, and The Cooking Game (CG) from TextWorld.Hyperparameters are listed in AppendixD.

Variant of IGESuccess Rate (%)
Game of 24BabyAI (PN)TextWorld (CG)
Standard100 ±plus-or-minus\pm± 0.084 ±plus-or-minus\pm± 1492 ±plus-or-minus\pm± 10
✗Intelligent action selection068 ±plus-or-minus\pm± 9.024 ±plus-or-minus\pm± 1600 ±plus-or-minus\pm± 00
✗Intelligent state selection096 ±plus-or-minus\pm± 3.548 ±plus-or-minus\pm± 2076 ±plus-or-minus\pm± 16
✗Intelligent archive filtering093 ±plus-or-minus\pm± 5.064 ±plus-or-minus\pm± 2064 ±plus-or-minus\pm± 20
✗All 3 above061 ±plus-or-minus\pm± 9.54 ±plus-or-minus\pm± 600 ±plus-or-minus\pm± 00
✗State-conditional action history033 ±plus-or-minus\pm± 9.072 ±plus-or-minus\pm± 1672 ±plus-or-minus\pm± 16

How Important is Foundation Model Intelligence at Each Step?First, we analyze the impact of FM intelligence on each component of Intelligent Go-Explore.We ablate replacing state and action selection with uniform random sampling, archive filtering with saving everything to the archive, and not maintaining a state-conditional action history.We use these unintelligent choices, as it would be very time-consuming to attempt to design the right heuristics based on the rich text observations in Table1.In Table2, we observe that where the intelligence of FMs is more valuable varies by environment.Since the environment horizon is only 3 in the Game of 24, the most important factor is ensuring that the actions tried are diverse and intelligently selected.This hypothesis is confirmed: the largest performance drops occur when removing either FM action selection or the action history.Different IGE components are most helpful in both of the longer-horizon BabyAI-Text and TextWorld environments: intelligent state selection and archive filtering make a big impact, showcasing the strength of enabling IGE to return to promising discovered states.There are smaller performance drops when removing the action history; likely because in larger environments, many more unique states are discovered, so there is less gain from preventing taking the same actions from frequently returned to states.In both environments, we also observe a drastic decrease when switching to random actions, as in classic Go-Explore.This underscores the substantial benefits IGE provides in harnessing FMs for action selection.

Finally, we note the need for intelligent archive filtering across all our environments.Not only does archive filtering improve performance, but it also drastically cuts down the number of uninteresting states in the archive, as shown in Table3(left).As we use rejection-based archive filtering on TextWorld, we quote the average size of the archive throughout each episode.In BabyAI-Text, we observe the archive becoming around 8×8\times8 × larger without filtering.These metrics demonstrate IGE’s innate ability to capture promising discoveries as they occur and focus attention on them, without the need for any manual heuristics.

What is the Effect of Foundation Model Choice?We also analyze the dependence of our algorithm on the strength of the foundation model by replacing GPT-4 with an earlier variant, GPT-3.5 in Table3(right).There is a considerable difference between the two, which suggests that our environments are non-trivial to solve, and that future advancements in foundation models are likely to readily scale the performance of IGE to even harder problems.

Archive FilteringNumber of States
Game of 24BabyAI (PN)TextWorld (CG)
No Filter18.5 ±plus-or-minus\pm± 3.2203.5 ±plus-or-minus\pm± 56.722.4 ±plus-or-minus\pm± 15.3
With Filter15.6 ±plus-or-minus\pm± 2.325.5 ±plus-or-minus\pm± 5.204.4 ±plus-or-minus\pm± 2.80

Foundation ModelSuccess Rate (%)
Game of 24BabyAI (PN)TextWorld (CG)
GPT-4100 ±plus-or-minus\pm± 0084 ±plus-or-minus\pm± 1492 ±plus-or-minus\pm± 10
GPT-3.5057 ±plus-or-minus\pm± 100 ±plus-or-minus\pm± 000 ±plus-or-minus\pm± 00

6 Related Work

FM-as-judge.We employ FM guidance at all stages of IGE to drive exploration.FMs as judges[46, 6] have already seen use in decision-making tasks: OMNI[44] considers FM guidance in multi-task settings to select the most promising next task to train on.However, focusing on the broader task could miss out on interesting behavior that happens at a more granular level, and thus IGE greatly expands on the integration of FM intelligence into decision-making.RL from AI Feedback[1, 25, 21] considers training RL agents using reward functions derived from FM preferences.This similarly guides agents towards preferred states, but without the intelligence of FMs for action selection.

FM Agents.One of the key strengths of IGE is that it is agnostic to the precise agent formulation and thus strictly additive on top of a wide variety of strategies.A common strategy is chain-of-thought-based methods[43, 17], which prompts the FM to output a set of reasoning steps before the answer.We integrate this into the FM guidance in our experiments in Sections4.2 and4.3.Reflexion[33] enables an agent to improve over multiple episodes by asking it to reflect on the previous attempted episode, and learn from its mistakes.However, we show this can break down in tasks with long horizons, whilst IGE proposes a more efficient way to filter out the vast majority of uninteresting interactions.Another set of agent frameworks that are related to the idea of exploring diverse solution paths via state-connectivity is Tree of Thoughts[42] and Graph of Thoughts[3].In contrast, IGE can exploit search strategies that are not tied to any connectivity between states and can readily jump across the archive of promising saved states.This is particularly important for long-horizon tasks with larger state spaces, as we show in Sections4.2 and4.3.

Closely related to exploration, FM agents have also begun to see use in search-based tasks.Stream of Search[16] considers a similar mathematical reasoning task to the Game of 24, and seeks to initially clone the actions of graph search algorithms, then use RL to self-improve.In contrast, IGE already greatly outperforms classic graph search—an exciting future direction could be to first clone the exploratory behavior of Go-Explore and then self-improve with RL, enabling the FM to learn to select better.Lehnert etal. [26] analogously train a language model to mimic the A algorithm.Finally, Krishnamurthy etal. [22] also consider bootstrapping exploration with an externally summarized action-history in bandit problems; our focus is more on the detection of interesting states.

Go-Explore.The original Go-Explore[13, 14] framework enabled superhuman performance in a variety of hard-exploration problems, including applications as diverse as automated game testing[29].Gallouédec and Dellandréa [15] propose Latent Go-Explore which similarly aims to address the difficulty of designing exploration heuristics by automatically learning a latent representation and sampling states with a low latent density.However, this requires periodic retraining and could easily miss out on rare serendipitous discoveries.HuGE[37] guides Go-Explore with humans in the loop by asking for pair-wise feedback on which goal to select.On the other hand, we take humans out entirely and apply intelligent FM guidance at all components of Go-Explore.

7 Conclusion and Limitations

In this paper, we demonstrate a new approach to robust exploration in complex environments, Intelligent Go-Explore, reimagining Go-Explore in the era of giant foundation models.We show that IGE can drive exploration for a diverse set of FM agents, including few-shot and chain-of-thought prompting, across a variety of challenging text-based games.While we only evaluate IGE on simulated text-based environments in this paper, a particularly exciting direction for future work would be domains with multimodal search spaces.This could unlock applications as wide as scientific discovery in synthetic biology (designing novel drugs or proteins) or material science.IGE could be readily adapted to these areas, as there is already precedent for multimodal FMs as judges[41].A further direction that could break the limits of the current state-of-the-art in autonomous decision-making is the (hitherto unsolved by intelligent agents) dungeon crawler, NetHack[23].NetHack requires the discovery of complex strategies, deep game knowledge, and coherent behavior over an extremely long horizon.Küttler etal. [23] noted that for NetHack, classic Go-Explore’s “heuristic of downsampling images of states to measure their similarity to be used as an exploration bonus will likely not work for large symbolic and procedurally generated environments.”IGE represents a sharp departure from these limitations, by replacing hard-coded and inflexible exploration heuristics with the dynamic intelligence of giant foundation models.

There remain exciting opportunities to improve IGE’s capabilities to explore vast state spaces.For example, we currently recall and compare against the entire archive whenever we discover a new state.This could be made much more efficient by using techniques like retrieval augmented generation[27] and only comparing to the closest previously discovered states.As we consider IGE for real-world settings, we should take steps to ensure the responsible deployment of FMs[5].Our approach opens up the road to safe and interpretable exploration: through careful prompt engineering or techniques like constitutional AI[1], we could steer the agent away from unsafe behaviors.Furthermore, if we ask or train the FM to explain its choices in each part of IGE, we could gain insight into its rationale for exploring particular paths through an environment[40, 17]; improving safety, interpretability, and perhaps one day even our own understanding of how best to explore.

Acknowledgments and Disclosure of Funding

This work was supported by the Vector Institute, the Canada CIFAR AI Chairs program, grants from Schmidt Futures and Open Philanthropy, an NSERC Discovery Grant, and a generous donation from Rafael Cosman.We thank Aaron Dharna, Ben Norman, and Jenny Zhang from our lab at the University of British Columbia for insightful discussions for insightful discussions and/or feedback on early drafts of this work.

References

  • Bai etal. [2022]Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, SheerEl Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, SamuelR. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan.Constitutional ai: Harmlessness from ai feedback, 2022.
  • Bellemare etal. [2013]M.G. Bellemare, Y.Naddaf, J.Veness, and M.Bowling.The arcade learning environment: An evaluation platform for general agents.Journal of Artificial Intelligence Research, 47:253–279, June 2013.ISSN 1076-9757.doi: 10.1613/jair.3912.URL http://dx.doi.org/10.1613/jair.3912.
  • Besta etal. [2024a]Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler.Graph of thoughts: Solving elaborate problems with large language models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, March 2024a.ISSN 2159-5399.doi: 10.1609/aaai.v38i16.29720.URL http://dx.doi.org/10.1609/aaai.v38i16.29720.
  • Besta etal. [2024b]Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, etal.Graph of thoughts: Solving elaborate problems with large language models.In Proceedings of the AAAI Conference on Artificial Intelligence, 2024b.
  • Bommasani etal. [2021]Rishi Bommasani, DrewA. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, MichaelS. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S.Buch, Dallas Card, Rodrigo Castellon, NiladriS. Chatterji, AnnieS. Chen, KathleenA. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, LiFei-Fei, Chelsea Finn, Trevor Gale, LaurenE. Gillespie, Karan Goel, NoahD. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, DanielE. Ho, Jenny Hong, Kyle Hsu, Jing Huang, ThomasF. Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, O.Khattab, PangWei Koh, MarkS. Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, XiangLisa Li, Xuechen Li, Tengyu Ma, Ali Malik, ChristopherD. Manning, SuvirP. Mirchandani, Eric Mitchell, Zanele Munyikwa, SurajNair, Avanika Narayan, Deepak Narayanan, Benjamin Newman, Allen Nie, JuanCarlos Niebles, Hamed Nilforoshan, J.F. Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, JoonSung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Robert Reich, Hongyu Ren, Frieda Rong, YusufH. Roohani, Camilo Ruiz, Jack Ryan, Christopher R’e, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, KrishnaParasuram Srinivasan, Alex Tamkin, Rohan Taori, ArminW. Thomas, Florian Tramèr, RoseE. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, SangMichael Xie, Michihiro Yasunaga, Jiaxuan You, MateiA. Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang.On the opportunities and risks of foundation models.ArXiv, 2021.URL https://crfm.stanford.edu/assets/report.pdf.
  • Bradley etal. [2023]Herbie Bradley, Andrew Dai, Hannah Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, Grégory Schott, and Joel Lehman.Quality-diversity through ai feedback, 2023.
  • Brown etal. [2020]TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, DanielM. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners, 2020.
  • Carta etal. [2023]Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves Oudeyer.Grounding large language models in interactive environments with online reinforcement learning.In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 3676–3713. PMLR, 23–29 Jul 2023.URL https://proceedings.mlr.press/v202/carta23a.html.
  • Chang etal. [2024]Yupeng Chang, XuWang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, etal.A survey on evaluation of large language models.ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
  • Cooper [2014]Seth Cooper.A framework for scientific discovery through video games.Morgan & Claypool, 2014.
  • Côté etal. [2018]Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, RuoYu Tao, Matthew Hausknecht, LaylaEl Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler.Textworld: A learning environment for text-based games.CoRR, abs/1806.11532, 2018.
  • Dong etal. [2022]Qingxiu Dong, Lei Li, Damai Dai, CeZheng, Zhiyong Wu, Baobao Chang, XuSun, Jingjing Xu, and Zhifang Sui.A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2022.
  • Ecoffet etal. [2021a]Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth Stanley, and Jeff Clune.First return, then explore.Nature, 590:580–586, 02 2021a.doi: 10.1038/s41586-020-03157-9.
  • Ecoffet etal. [2021b]Adrien Ecoffet, Joost Huizinga, Joel Lehman, KennethO. Stanley, and Jeff Clune.Go-explore: a new approach for hard-exploration problems, 2021b.
  • Gallouédec and Dellandréa [2023]Quentin Gallouédec and Emmanuel Dellandréa.Cell-free latent go-explore, 2023.
  • Gandhi etal. [2024]Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and NoahD. Goodman.Stream of search (sos): Learning to search in language, 2024.
  • Hu and Clune [2024]Shengran Hu and Jeff Clune.Thought Cloning: Learning to think while acting by imitating human thinking.Advances in Neural Information Processing Systems, 36, 2024.
  • Huang etal. [2022]Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch.Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022.
  • Hussein etal. [2017]Ahmed Hussein, MohamedMedhat Gaber, Eyad Elyan, and Chrisina Jayne.Imitation learning: A survey of learning methods.ACM Comput. Surv., 50(2), apr 2017.ISSN 0360-0300.doi: 10.1145/3054912.URL https://doi.org/10.1145/3054912.
  • Jiang etal. [2023]Minqi Jiang, Tim Rocktäschel, and Edward Grefenstette.General intelligence requires rethinking exploration.Royal Society Open Science, 10(6):230539, 2023.
  • Klissarov etal. [2023]Martin Klissarov, Pierluca D’Oro, Shagun Sodhani, Roberta Raileanu, Pierre-Luc Bacon, Pascal Vincent, Amy Zhang, and Mikael Henaff.Motif: Intrinsic motivation from artificial intelligence feedback, 2023.
  • Krishnamurthy etal. [2024]Akshay Krishnamurthy, Keegan Harris, DylanJ. Foster, Cyril Zhang, and Aleksandrs Slivkins.Can large language models explore in-context?, 2024.
  • Küttler etal. [2020]Heinrich Küttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rocktäschel.The nethack learning environment.Advances in Neural Information Processing Systems, 33:7671–7684, 2020.
  • Ladosz etal. [2022]Pawel Ladosz, Lilian Weng, Minwoo Kim, and Hyondong Oh.Exploration in deep reinforcement learning: A survey.Information Fusion, 85:1–22, 2022.
  • Lee etal. [2024]Harrison Lee, Samrat Phatale, Hassan Mansoor, KellieRen Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi.RLAIF: Scaling reinforcement learning from human feedback with AI feedback, 2024.URL https://openreview.net/forum?id=AAxIs3D2ZZ.
  • Lehnert etal. [2024]Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul Mcvay, Michael Rabbat, and Yuandong Tian.Beyond a*: Better planning with transformers via search dynamics bootstrapping, 2024.
  • Lewis etal. [2020]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela.Retrieval-augmented generation for knowledge-intensive nlp tasks.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin, editors, Advances in Neural Information Processing Systems, volume33, pages 9459–9474. Curran Associates, Inc., 2020.URL https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf.
  • Liu etal. [2023]Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, YuGu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, YuSu, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang.Agentbench: Evaluating llms as agents, 2023.
  • Lu etal. [2024]Cong Lu, Raluca Georgescu, and Johan Verwey.Go-explore complex 3-d game environments for automated reachability testing.IEEE Transactions on Games, 16(1):235–240, 2024.doi: 10.1109/TG.2022.3228401.
  • Olsson etal. [2022]Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, etal.In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022.
  • OpenAI [2024]OpenAI.Gpt-4 technical report, 2024.
  • Parisotto and Salakhutdinov [2018]Emilio Parisotto and Ruslan Salakhutdinov.Neural map: Structured memory for deep reinforcement learning.In International Conference on Learning Representations, 2018.URL https://openreview.net/forum?id=Bk9zbyZCZ.
  • Shinn etal. [2023]Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao.Reflexion: Language agents with verbal reinforcement learning, 2023.
  • Sutton and Barto [2018]RichardS. Sutton and AndrewG. Barto.Reinforcement Learning: An Introduction.The MIT Press, second edition, 2018.URL http://incompleteideas.net/book/the-book-2nd.html.
  • Talmor etal. [2019]Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant.CommonsenseQA: A question answering challenge targeting commonsense knowledge.In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.doi: 10.18653/v1/N19-1421.URL https://aclanthology.org/N19-1421.
  • Team [2024]Gemini Team.Gemini: A family of highly capable multimodal models, 2024.
  • TorneVillasevil etal. [2023]Marcel TorneVillasevil, Max Balsells IPamies, Zihan Wang, Samedh Desai, Tao Chen, Pulkit Agrawal, and Abhishek Gupta.Breadcrumbs to the goal: Goal-conditioned exploration from human-in-the-loop feedback.In A.Oh, T.Neumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine, editors, Advances in Neural Information Processing Systems, volume36, pages 63222–63258. Curran Associates, Inc., 2023.URL https://proceedings.neurips.cc/paper_files/paper/2023/file/c7c7cf10082e454b9662a686ce6f1b6f-Paper-Conference.pdf.
  • Touvron etal. [2023]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
  • Wang etal. [2024]Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, XuChen, Yankai Lin, etal.A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):1–26, 2024.
  • Wei etal. [2022]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, EdChi, QuocV Le, Denny Zhou, etal.Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022.
  • Wu etal. [2024]Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, and Lei Zhang.A comprehensive study of multimodal large language models for image quality assessment.arXiv preprint arXiv:2403.10854, 2024.
  • Yao etal. [2023a]Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, ThomasL. Griffiths, Yuan Cao, and Karthik Narasimhan.Tree of Thoughts: Deliberate problem solving with large language models, 2023a.
  • Yao etal. [2023b]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, KarthikR Narasimhan, and Yuan Cao.React: Synergizing reasoning and acting in language models.In The Eleventh International Conference on Learning Representations, 2023b.URL https://openreview.net/forum?id=WE_vluYUL-X.
  • Zhang etal. [2024]Jenny Zhang, Joel Lehman, Kenneth Stanley, and Jeff Clune.OMNI: Open-endedness via models of human notions of interestingness.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=AgM3MzT99c.
  • Zhao etal. [2021]Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh.Calibrate before use: Improving few-shot performance of language models.In International conference on machine learning, pages 12697–12706. PMLR, 2021.
  • Zheng etal. [2023a]Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiLin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, JosephE Gonzalez, and Ion Stoica.Judging llm-as-a-judge with mt-bench and chatbot arena.In A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine, editors, Advances in Neural Information Processing Systems, volume36, pages 46595–46623. Curran Associates, Inc., 2023a.URL https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf.
  • Zheng etal. [2023b]Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiLin, Zhuohan Li, Dacheng Li, EricP. Xing, Hao Zhang, JosephE. Gonzalez, and Ion Stoica.Judging llm-as-a-judge with mt-bench and chatbot arena, 2023b.
  • Zhu etal. [2023]Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny.Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023.
  • Zoubir and Iskandler [2007]AbdefihakM Zoubir and DRobert Iskandler.Bootstrap methods and applications.IEEE Signal Processing Magazine, 24(4):10–19, 2007.

Supplementary Material

Table of Contents

\startcontents

[sections]\printcontents[sections]l1

Appendix A Algorithm Pseudocode

We provide full pseudocode for Intelligent Go-Explore in Algorithm1.This complements the discussion in Section3.

1:Hyperparameters: no. state expansions Nstatesubscript𝑁stateN_{\text{state}}italic_N start_POSTSUBSCRIPT state end_POSTSUBSCRIPT, no. exploratory actions Nactionsubscript𝑁actionN_{\text{action}}italic_N start_POSTSUBSCRIPT action end_POSTSUBSCRIPT, foundation model \mathcal{M}caligraphic_M

2:Initialize: archive of states 𝒮archive=subscript𝒮archive\mathcal{S}_{\textrm{archive}}=\emptysetcaligraphic_S start_POSTSUBSCRIPT archive end_POSTSUBSCRIPT = ∅, state-conditional action history 𝒜()=𝒜\mathcal{A}(\cdot)=\emptysetcaligraphic_A ( ⋅ ) = ∅

3:𝒮archives0subscript𝒮archivesubscript𝑠0\mathcal{S}_{\textrm{archive}}\leftarrow s_{0}caligraphic_S start_POSTSUBSCRIPT archive end_POSTSUBSCRIPT ← italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT \triangleright Add initial state to archive

4:fori=1,,Nstate𝑖1subscript𝑁statei=1,\dots,N_{\textrm{state}}italic_i = 1 , … , italic_N start_POSTSUBSCRIPT state end_POSTSUBSCRIPTdo

5:Query \mathcal{M}caligraphic_M for the next state si,1subscript𝑠𝑖1s_{i,1}italic_s start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT, from 𝒮archivesubscript𝒮archive\mathcal{S}_{\textrm{archive}}caligraphic_S start_POSTSUBSCRIPT archive end_POSTSUBSCRIPT \triangleright See Section3.1

6:forj=1,,Naction𝑗1subscript𝑁actionj=1,\dots,N_{\textrm{action}}italic_j = 1 , … , italic_N start_POSTSUBSCRIPT action end_POSTSUBSCRIPTdo

7:Query \mathcal{M}caligraphic_M for the next action ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT from si,jsubscript𝑠𝑖𝑗s_{i,j}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT conditional on 𝒜(si,j)𝒜subscript𝑠𝑖𝑗\mathcal{A}(s_{i,j})caligraphic_A ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) \triangleright See Section3.2

8:si,j+1P(si,j,ai,j)similar-tosubscript𝑠𝑖𝑗1𝑃subscript𝑠𝑖𝑗subscript𝑎𝑖𝑗s_{i,j+1}\sim P(s_{i,j},a_{i,j})italic_s start_POSTSUBSCRIPT italic_i , italic_j + 1 end_POSTSUBSCRIPT ∼ italic_P ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ), 𝒜(si,j)ai,j𝒜subscript𝑠𝑖𝑗subscript𝑎𝑖𝑗\mathcal{A}(s_{i,j})\leftarrow a_{i,j}caligraphic_A ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ← italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT \triangleright Take action and update history

9:if\mathcal{M}caligraphic_M determines that si,j+1subscript𝑠𝑖𝑗1s_{i,j+1}italic_s start_POSTSUBSCRIPT italic_i , italic_j + 1 end_POSTSUBSCRIPT is interesting w.r.t 𝒮archivesubscript𝒮archive\mathcal{S}_{\textrm{archive}}caligraphic_S start_POSTSUBSCRIPT archive end_POSTSUBSCRIPTthen \triangleright See Section3.3

10:𝒮archivesi,j+1subscript𝒮archivesubscript𝑠𝑖𝑗1\mathcal{S}_{\textrm{archive}}\leftarrow s_{i,j+1}caligraphic_S start_POSTSUBSCRIPT archive end_POSTSUBSCRIPT ← italic_s start_POSTSUBSCRIPT italic_i , italic_j + 1 end_POSTSUBSCRIPT

11:endif

12:endfor

13:endfor

14:Return best discovered trajectory

Appendix B Further Details on Environments

We provide further details for each of the environments used in the empirical evaluation in Section4.

B.1 Game of 24

We use the environment and set of evaluation tasks from https://github.com/princeton-nlp/tree-of-thought-llm which is released under the MIT License.We include the environment-specific prompt that is appended to the system prompt in Section3 below.The system prompt contains examples of correct reasoning paths on different problems (few-shot prompting).

The action space at each step is all the valid arithmetic operations, presented in an analogous way as the ‘propose’ step in Yao etal. [42].

B.2 BabyAI-Text

The BabyAI-Text[8] environment comes with five task types, which we list here and visualize in order in Figure5:

  • Go to <object>, a simple navigation task that requires reasoning abilities to choose the right plan given the object’s position;

  • Pick up <object>, a reasoning task that combines navigation tasks;

  • Pick up <object A> then go to <object B> and Go to <object B> after pickup <object A>, both serving to test reasoning abilities on temporal sequences;

  • Unlock <door>, a task that includes inferring that a key is needed to unlock the door, finding the right key (i.e. the one colored as the door), and eventually using the toggle action with the key on the door;

  • Put <object A> next to <object B>, which requires first reaching <object A>, picking it up, reaching <object B> and finally dropping <object A> next to <object B>.

Standing on the Shoulders of Giant Foundation Models (7)

We use the codebase from https://github.com/flowersteam/Grounding_LLMs_with_online_RL which is released under the MIT License.The action space is discrete and composed of 6 possible actions: turn left, turn right, go forward, pick up, drop, and toggle.The ‘go to’ and ‘pick up’ tasks have a shorter environment horizon of H=64𝐻64H=64italic_H = 64, whereas the rest have a horizon of H=128𝐻128H=128italic_H = 128.We include the environment-specific prompt that is appended to the system prompt in Section3 below.

B.3 TextWorld

We evaluate IGE on ‘Treasure Hunter’, ‘The Cooking Game’, and ‘Coin Collector’ from the TextWorld[11] domain.We use the environment code from https://github.com/microsoft/TextWorld which is released under the MIT License.

B.3.1 Treasure Hunter

For Treasure Hunter, we set the ‘level’ option to the maximum value of 30, resulting in a maze with 20 rooms.Locked doors and containers are added, which may need to be unlocked and opened to find the target object.To further increase the difficulty, we remove the solution description from the original game and filter out tasks that can be completed with 20 steps in the optimal solution.We include the environment-specific prompt that is appended to the system prompt in Section3 below.

B.3.2 The Cooking Game

In The Cooking Game, we set the number of ingredients to a maximum of 5 and the number of rooms to 13.We enable all challenging additional options: doors need to be opened, food must be processed (e.g., cut, diced, chopped with a knife), and cooked (e.g., grilled with a BBQ, fried on a stove, roasted in an oven).We include the environment-specific prompt that is appended to the system prompt in Section3 below.

We show a successful example trajectory found by IGE below, from our evaluation in Section4.3.

B.3.3 Coin Collector

In Coin Collector, we set the number of rooms to 40 and allow distractor rooms to be added along the way.Similar to Treasure Hunter, we remove the solution description from the original game, and the optimal path from the agent’s starting point to the target is set to 20 steps.We include the environment-specific prompt that is appended to the system prompt in Section3 below.

Appendix C Further Prompt Discussion

C.1 Extracting Choices

By default, in Section4.1, we prompt the FM to return a JSON object containing just the numerical index of the choice.We choose this because of the ease of parsing the response and validating it lies within the correct bounds.An example prompt is displayed below.

When using chain-of-thought as in Section4.2, we use the following prompt:

For the TextWorld environment in Section4.3, since the action space is much larger, we ask the FM to directly output a text action that we automatically parse.

We use the regex “> (.*?)(?:|̇$)” (in Perl notation) to parse the command.We note that the failure rate for both of these options is very low, less than 0.1% across our evaluation.Despite this, we include a failsafe that returns a random choice in case of an invalid output.

C.2 Rejection-based Archive Filtering

The ‘acceptance-based’ archive filter in Section3.3 iterates through every new state and asks whether each one is interestingly new and should be added to the archive.This can break down in larger environments where there is more need to explicitly deprecate earlier discoveries that have become irrelevant to not overload the archive, for example, in Section4.3.In this environment, we use an alternate version of the prompt which first adds all states, and then asks the foundation model to remove the uninteresting states.An example prompt is shown below.

Appendix D Hyperparameters

In this section, we provide the hyperparameters for our empirical evaluation in Section4.We list the hyperparameters for IGE in Table4.We choose the values for exploratory rollout length based on the average number of steps needed to make ‘reasonable progress’ in the environment.

HyperparameterValue(s)
Game of 24BabyAI-TextTHTCGCC
No. state expansions, Nstatesubscript𝑁stateN_{\text{state}}italic_N start_POSTSUBSCRIPT state end_POSTSUBSCRIPT50252448125
No. exploratory actions, Nactionsubscript𝑁actionN_{\text{action}}italic_N start_POSTSUBSCRIPT action end_POSTSUBSCRIPT310551

We list the sampling parameters for GPT-4[31] passed via the OpenAI API in Table5.

HyperparameterValue
Game of 24BabyAI-TextTextWorld
Temperature0.70.70.3
Max new tokens100010001000
Response formatJSON ObjectJSON ObjectText
VersionTurbo-2024-04-09o-2024-05-13o-2024-05-13

We used GPT-4-Turbo for Game of 24 and GPT-4o for BabyAI and TextWorld.This was purely done to select the version of GPT-4 that was available and the cheapest at the time of running the experiments.The version of GPT-4 is consistent per environment.We use a reduced temperature for the TextWorld domain to reduce the possibility of generating malformed responses, as actions are output in free-form natural language.In our ablations in Section5, we use the ‘turbo-0125’ variant of GPT-3.5.

D.1 Cost of Experiments

We provide the average cost per task for our algorithm per environment (the number of seeds is specified in Section4):

EnvironmentAPI Cost (USD)
Game of 241.04
BabyAI-Text2.01
TextWorld1.28

We note that the price per token of the ‘o-2024-05-13’ option is half that of ‘Turbo-2024-04-09’, so we could expect to achieve the same level of results on the Game of 24 with half the price.The total cost of API access required to perform the final experiments in this paper was under 2,000 USD.During development, we iterated on IGE with a smaller number of seeds, which represents a small fraction of this cost added on top.

Standing on the Shoulders of Giant Foundation Models (2024)

References

Top Articles
Latest Posts
Article information

Author: Gov. Deandrea McKenzie

Last Updated:

Views: 6373

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Gov. Deandrea McKenzie

Birthday: 2001-01-17

Address: Suite 769 2454 Marsha Coves, Debbieton, MS 95002

Phone: +813077629322

Job: Real-Estate Executive

Hobby: Archery, Metal detecting, Kitesurfing, Genealogy, Kitesurfing, Calligraphy, Roller skating

Introduction: My name is Gov. Deandrea McKenzie, I am a spotless, clean, glamorous, sparkling, adventurous, nice, brainy person who loves writing and wants to share my knowledge and understanding with you.