How Autonomous Agents Work: The Concept and Its LangChain Implementation

9 min readMay 24, 2023

One use of LLMs that has gained a lot of attention is in the field of autonomous agents. You must have seen its applications in popular apps like Auto-GPT, BabyAGI, and more that come out every other day.

Most of these apps can be understood well by knowing the underlying flow which is more or less the same across all the tools.

😯 Intuitive idea behind using agents

Let’s begin with understanding the flow at a high-level first and we’ll introduce the concepts as the need arises.

The idea behind using agents is simple. We already have models that understand the human language and can reason decently well. These include models like GPT and other open-source alternatives. So far, these have been limited to just “talking” about how things can be done whether it be writing code that builds your app or listing steps to perform actions like setting up a charity fund. What if we could provide a way for them to also act on this intelligence?

In the world of software engineering, any such action is traditionally enabled by the use of what we call APIs (Application Programming Interface). These APIs expose the functionality of any app or service that performs an action to any other software or a front-end (for human use). This leads us to think that the model could potentially act on its knowledge if it can use these APIs too. But how do you make a model aware of what APIs are available and more importantly, what an API could do?

You simply tell it!

An example representation for visual understanding

Thanks to the great natural language understanding of the LLMs, it is easy to communicate what an API does and under what circumstances it should be used.

In such a context, all that an LLM would have to do is output the name of the API and what input should be provided to it to get the desired output. We can then easily design logic around it that takes this knowledge, calls the downstream APIs with the right variables and returns the output to the LLM to be processed further (it would either call another API or just return this output in natural language to the user). This process happens repeatedly, as you can imagine and stops with the final output of a human-friendly message.

🌊 Diving deeper

You can think of the above situation as a combination of multiple problems that can each be solved independently. Let’s go through the flow again but this time, we’ll be more technical, introducing concepts along the way as they come.

Roughly speaking, the flow above can be broken down into a play between the following components:

The LLM Model
APIs
An orchestrator

Each of the components above are represented by an abstraction in LangChain (an open-source LLM framwork). An abstraction is just an interface and a way to formalize the responsibilities and attributes of an entity. We’ll go through each of them and understand the benefits that such a construct can offer.

🤖 Agents

Agents are essentially wrappers around an LLM model and can be called with an input that needs to be passed to it. They are actually built with something called an LLM chain, which is another word for a pipeline containing the model and some additional elements like a prompt template. If these terms sound alien to you, I’d recommend giving the LangChain documentation on this, a read. However, knowledge of it is not essential in understanding how agents work and you can skip it safely as far as the scope of the rest of the article is concerned.

So, what is special about them and why do you need them instead of just an LLM chain? An agent defines some additional attributes that help make the execution easier.

A way to control the output of the model as required through the use of pre-defined prompts which can be packaged together in the concept of an agent. Below is an example of a prompt from a chat agent that conditions the model to return the output in a format that can be parsed later by functions in that agent or further downstream.

# flake8: noqa
SYSTEM_MESSAGE_PREFIX = """Answer the following questions as best you can. You have access to the following tools:"""
FORMAT_INSTRUCTIONS = """The way you use the tools is by specifying a json blob.
Specifically, this json should have a `action` key (with the name of the tool to use) and a `action_input` key (with the input to the tool going here).

The only values that should be in the "action" field are: {tool_names}

The $JSON_BLOB should only contain a SINGLE action, do NOT return a list of multiple actions. Here is an example of a valid $JSON_BLOB:

```
{{{{
  "action": $TOOL_NAME,
  "action_input": $INPUT
}}}}
```

ALWAYS use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action:
```
$JSON_BLOB
```
Observation: the result of the action
... (this Thought/Action/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question"""
SYSTEM_MESSAGE_SUFFIX = """Begin! Reminder to always use the exact characters `Final Answer` when responding."""
HUMAN_MESSAGE = "{input}\n\n{agent_scratchpad}"

Has functions that can take in the intermediate steps and are then responsible for constructing a “scratchpad” or the context from them, which is passed along with the prompt above, thus allowing the model to “think” through the whole process or objective.

def get_full_inputs(
        self, intermediate_steps: List[Tuple[AgentAction, str]], **kwargs: Any
    ) -> Dict[str, Any]:
        """Create the full inputs for the LLMChain from intermediate steps."""
        thoughts = self._construct_scratchpad(intermediate_steps)
        new_inputs = {"agent_scratchpad": thoughts, "stop": self._stop}
        full_inputs = {**kwargs, **new_inputs}
        return full_inputs

The function above, defined in the Agent class, is building an input to be passed to the model, which consists of a scratchpad of thoughts constructed in a way that complies with the initial prompting that is specific to this agent.

Listing tools that a particular flavor of an agent can work with, which can help with validation later.

    @classmethod
    def _validate_tools(cls, tools: Sequence[BaseTool]) -> None:
        if len(tools) != 2:
            raise ValueError(f"Exactly two tools must be specified, but got {tools}")
        tool_names = {tool.name for tool in tools}
        if tool_names != {"Lookup", "Search"}:
            raise ValueError(
                f"Tool names should be Lookup and Search, got {tool_names}"
            )

Here’s a code snippet from an agent definition in LangChain that validates if the tools passed to the agent are called “Lookup” and “Search” which is expected for an agent of this specific type.

Schemas like AgentFinish, AgentAction that are part of LangChain and used with agents that help distinguish between the category of responses. For example, a response of type AgentFinish would indicate that the agent has reached its conclusion.

Essentially, all of these points add on top of the LLM chain in useful ways to enable a flow where you can iteratively ask a model to achieve an objective that you define.

🧑‍💻 APIs

This doesn’t need a lot of explanation. APIs are what enable execution of a variety of tasks through the use of external services. In LangChain, you will find a Tool abstraction that allows you to define essentially a function that can be called with an input which internally uses whatever custom logic is desired to make the call to the external service, get the output back and returns it.

class BingSearchRun(BaseTool):
    """Tool that adds the capability to query the Bing search API."""

    name = "Bing Search"
    description = (
        "A wrapper around Bing Search. "
        "Useful for when you need to answer questions about current events. "
        "Input should be a search query."
    )
    api_wrapper: BingSearchAPIWrapper

    def _run(self, query: str) -> str:
        """Use the tool."""
        return self.api_wrapper.run(query)

    async def _arun(self, query: str) -> str:
        """Use the tool asynchronously."""
        raise NotImplementedError("BingSearchRun does not support async")

In the code above is the Bing Search Tool. It has the following characteristics:

A run function that contains the logic which is called when the tool is used. You can see that it makes use of a BingSearchAPIWrapper ‘s run function which is where the API calls to Bing happen.
Every tool comes with a natural language description that lays down the cases the tool should be used in. This is helpful for the model to determine when to choose a specific tool over the other. Plugins in ChatGPT also rely on similar ideas.

🧙 An Orchestrator

What I mean by orchestrator is something that can control the flow of execution between the agent, the user and the tools. This would involve:

Taking the input from the user.
Passing it to the agent (model) along with the right inputs, prompt and past memory.
Getting the output from it which indicates which tool to use with what inputs.
Calling the required tool or API with that input, getting the response and returning to the model.
Taking the natural language response from the model based on the tool output and presenting to the user.

The AgentExecutor abstraction in LangChain is our orchestrator which performs all of the tasks above.

It is initialized with the flavor of agent you want and the tools that you want the agent to be able to select from.

tools = [
    Tool(
        name = "Current Search",
        func=search.run,
        description="useful for when you need to answer questions about current events or the current state of the world. the input to this should be a single search term."
    ),
]

agent_executor = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    callback_manager=manager,
    verbose=True,
)

An AgentExecutor does the following:

Has functions that drive the loop for the agent and tool interaction. The code below shows the _call function which sets off the loop and the _take_next_step function which returns the output from the agent.

def _call(self, inputs: Dict[str, str]) -> Dict[str, Any]:
        """Run text through and get agent response."""
        ...
        # We now enter the agent loop (until it returns something).
        while self._should_continue(iterations, time_elapsed):
            next_step_output = self._take_next_step(
                name_to_tool_map, color_mapping, inputs, intermediate_steps
            )


def _take_next_step(
        ...
    ) -> Union[AgentFinish, List[Tuple[AgentAction, str]]]:
        """Take a single step in the thought-action-observation loop.

        Override this to take control of how the agent makes and acts on choices.
        """
        # Call the LLM to see what to do.
        output = self.agent.plan(intermediate_steps, **inputs)
        # If the tool chosen is the finishing tool, then we end and return.
        if isinstance(output, AgentFinish):
            return output
        ...

Saves intermediate steps and makes them available to the agent (model) for every request, based on which the model answers the questions.
Validates the tools provided with those supported or required by the selected agent.

✨ Conclusion

Knowing these concepts, you can now understand the multiple agent types that are available with LangChain and can start experimenting with different tools to see how all of that plays out.

You can also take this understanding and apply it elsewhere in the newer agent-based applications that we’ve seen recently.

For a brief rundown of what the other tools have that might be useful in understanding how they work and potentially modifying their behavior as you need it, I’d recommend this blog by LangChain. It lists down features and some implementation details for popular apps like Auto-GPT, BabyAGI and more! Happy building 🥳🚀🤘

👋 Connect with me through this link: bio.link/wjayesh