Your typical daily tasks like grocery shopping, doing the dishes, and handling other small chores may seem effortless and instinctive to you. However, for a robot to execute these tasks seamlessly, it requires a detailed and comprehensive plan with precise instructions.
MIT’s Improbable AI Lab, a part of the prestigious Computer Science and Artificial Intelligence Laboratory (CSAIL), has introduced a new multimodal framework named Compositional Foundation Models for Hierarchical Planning (HiP). This innovative framework leverages three distinct foundation models trained on vast datasets covering tasks such as image generation, language translation, and robotics, akin to the technology behind OpenAI’s GPT-4.
HiP stands out by utilizing three unique base models, each trained on a different data modality, as opposed to traditional bidirectional models that typically focus on vision, language, and action data. By breaking down the decision-making process into distinct stages handled by each base model, HiP eliminates the need for access to paired vision, language, and action data, thereby enhancing transparency in the decision-making process.
In the realm of embodied agent planning, HiP addresses the challenge of coordinating language, visual, and action data, which can be costly and complex. Unlike previous efforts that struggled to integrate these diverse data modalities effectively, HiP presents a novel approach that seamlessly merges language, visual, and environmental intelligence.
According to NVIDIA AI researcher Jim Fan, the concept of foundation models doesn’t have to be monolithic. HiP’s three main models—a language thinker, a physical world model, and an action planner—collaborate to simplify complex decision-making processes, making them more manageable and transparent.
The HiP system aims to assist robots in performing everyday tasks such as organizing items or completing household chores like turning on lights or operating appliances. Additionally, HiP shows promise in enhancing multilevel manufacturing and design tasks, like sorting and arranging materials efficiently.
The effectiveness of HiP was demonstrated through various manipulation tasks, where the system showcased adaptability and responsiveness to changing scenarios. By outperforming existing planning techniques like Transformer BC and Action Diffuser, HiP proved its ability to adjust plans dynamically based on new information.
HiP’s three-pronged planning structure involves pre-training each component on diverse datasets, enabling them to work in harmony. Starting with a large language model (LLM) that generates abstract work schedules by gathering symbolic information, HiP then refines these plans using common sense knowledge sourced from the internet. This structured approach enhances the system’s ability to break down overarching goals into actionable sub-goals effectively.
Through iterative refinement, HiP incorporates feedback at each stage to enhance the overall planning process, akin to the editorial process of refining a draft before finalizing it. By integrating a robust visual model to complement the LLM’s initial planning, HiP ensures a comprehensive understanding of the environment, enabling precise task execution.
While HiP’s current implementation relies on limited visual foundation models, future enhancements could involve integrating more advanced models to improve physical sequence forecasting and robot action generation. Despite its effectiveness, HiP’s training requirements are minimal, making it a cost-effective solution for long-term task planning using readily available foundation models.
In conclusion, HiP represents a significant advancement in the field of robotic planning by seamlessly integrating diverse data modalities and pre-trained models. The potential applications of HiP extend beyond household tasks to complex real-world challenges, showcasing its versatility and scalability in various domains.