MIT Researchers Introduce Speech-to-Reality System That Builds Objects From Spoken Requests

AI Robot in a data center environment. AI regulations: Congress Rejects Federal Override, Preserving State AI Laws

Researchers at the Massachusetts Institute of Technology (MIT) have developed a system that allows a user to request an object through ordinary speech and receive a physical version of that object within minutes.

The project, known as Speech-to-Reality, links natural language recognition, three-dimensional generative design tools, and robotic assembly. The team describes it as a step toward faster and more accessible production methods that require no technical knowledge from the user.

According to the research paper, when a user gives a request such as “I want a simple stool,” the system captures the instruction and moves it through several stages. The spoken sentence is analyzed, a digital outline is generated, and that outline is converted into a three-dimensional structure that can be prepared for assembly.

The team has already used the approach to produce stools, tables, chairs, multi-tier shelves, the letter T, and decorative pieces such as a dog-shaped figure. All of the objects are assembled from modular parts that can be attached, removed, and reused without creating waste.

Alexander Htet Kyaw, a graduate researcher at MIT and member of the Morningside Academy for Design, who led the early development of the project, said the work brings together several fast-moving research areas that rarely meet in practical applications.

We’re connecting natural language processing, 3D generative AI, and robotic assembly,” he said. “These are rapidly advancing areas of research that haven’t been brought together before in a way that you can actually make physical objects just from a simple speech prompt.

How The System Turns Spoken Requests into Physical Structures

The process begins with speech recognition, followed by a large language model that evaluates the spoken sentence and clarifies what object the user intends to produce. Afterward, a generative model creates a three-dimensional mesh of the object, and that mesh is then converted through a voxel-based process into discrete pieces that the robotic equipment can pick up and place.

The digital design is then reviewed through geometric rules that assess whether the structure can remain stable once assembled, taking into account the number of available components, the risk of unsupported overhangs, and the ability of each part to connect securely, and once it meets those conditions, a planning system sets the assembly sequence and maps out the robotic arm’s movements before construction begins.

MIT Researchers Introduce Speech-to-Reality System That Builds Objects From Spoken Requests 3

The researchers say the use of modular cubes provides a path to faster production than conventional three-dimensional printing, which often requires many hours to complete a single object. They also note that the modular approach supports repeated reuse. Items created for one purpose can be dismantled and reassembled into different forms, which the team sees as an important step toward reducing waste in small-scale manufacturing.

What comes next for Speech to Reality

The team is now working to strengthen the objects produced through the system. The current connections between the modular components rely on magnets, which limit the weight-bearing ability of the furniture. The group intends to adopt stronger joining methods that still allow the pieces to be attached and removed quickly.

Miana Smith, a graduate student at the MIT Center for Bits and Atoms, said the team is also studying ways to scale the method so it can operate at many sizes: “We’ve also developed pipelines for converting voxel structures into feasible assembly sequences for small, distributed mobile robots, which could help translate this work to structures at any size scale.”

Kyaw, who has prior experience using gesture recognition and augmented reality to control robotics during fabrication, is preparing to merge speech and gesture input. He said this combination could make the interaction more fluid and allow people to guide the assembly process in a more intuitive way.

Part of the long-term vision is to let personal objects be repurposed repeatedly, with items created for one need taken apart and rebuilt for another using the same parts. Kyaw described the goal by referencing fictional technologies that can produce items instantly. He said he drew on ideas from the replicator in Star Trek and robotic systems in the animated film Big Hero Six.

I want to increase access for people to make physical objects in a fast, accessible, and sustainable manner,” he said. “I’m working towards a future where the very essence of matter is truly in your control. One where reality can be generated on demand.

How The System Turns Spoken Requests into Physical Structures

What comes next for Speech to Reality

Related Articles