Realizing Magical 2D to 3D Transformation with Artificial Intelligence

Nitish Bhardwaj
7 min readJul 2, 2021


We live in a world where every solid object has a 3D shape, but our eyes capture a 2D image of the objects. It is the magic of our brain that combines the 2D pictures to give us a perception of 3D world. However, we always require actual 3D models to visualize and understand the products better. There has been high demand of 3D models for enhanced viewing experiences in mobile applications, games, movies, VR (Virtual Reality), AR (Augmented Reality) and MR (Mixed Reality). Recently, Microsoft launched Mesh, a mixed-reality platform, that allows people in different physical locations to join and work in a collaborative 3D environment. The global 3D mapping and modeling market size is expected to grow to 7.6 billion by 2025 (CAGR of 15%). Technological advancements in 3D scanners, 3D sensors, and depth cameras have also contributed to the availability of 3D contents. Many companies like Autodesk and Dassault Systems have built CAD tools to create 3D models for different applications.

Visualizing 3D world with VR / AR devices

Most of the processes in 3D modeling are still manual and require expertise like Sketch to 3D conversion, defining and extracting constraints on 3D models, handling different formats, etc. As AI has revolutionized every sector, it has also stepped in 3D Modeling and design world to help designers and other key players at various steps of 3D design. Google’s 3D Scene Understanding enables to predict objects in 3D scene using advanced AI techniques based on Deep Neural Networks. Facebook’s Pytorch3D is an open-source highly modular and optimized library with unique capabilities, designed to make 3D deep learning easier with PyTorch. NVIDIA’s Kaolin is another library envisioned to help the research community working in 3D computer vision. One of the challenging problems in 3D space is to convert a 2D image into a 3D model as there is a high demand of 3D models.

In this multi-part blog series, I’ll share my experiences and learnings. In this first blog post, let me introduce you to the different approaches to convert a 2D image to 3D.

Image to 3D has been an important problem as it finds application in many areas, for example, 3D view of products on e-commerce platforms, more effective design brainstorming session with 3D view of objects, better visual of phone camera pictures, etc. Image in 2D representation lacks depth information and details about the background of the image. In the current scenario, 3D designers create 3D models using CAD tools using their 3D modeling skills. With the recent development in Deep Neural network-based generative models, there have also been multiple approaches for automation of Image to 3D conversion.

In this blog, I’ll first talk about different approaches for 3D modeling. Next, I have added references for some of the AI solutions. Lastly, I’ll cover popular 3D representations.

There are broadly 3 approaches for 3D modeling:

  • Drawing Image in 3D environment
  • Geometric approach to generate 3D model
  • Deep Neural network-based AI approaches to generate 3D from Images

Drawing Image in 3D environment

With the development of high-end graphics cards which can render display with high precision, CAD companies like Autodesk, CATIA have developed many tools to give designers a lot of flexibility for 3D design. 3D designers can draw image with different in-built shapes and brushes in 3D environment and can easily render the image to a 3D model. There are different tools like Autodesk’s Sketchbook, CATIA’s Natural Image, Blender ‘s Grease Pencil tool, etc. These tools really help the expert designers, but it requires vast knowledge and expertise to use the tool.

Geometric approach to generate 3D model

Some of the CAD tools also give the option of uploading multiple images and then converting them into a 3D model. This process takes multiple viewpoints of an object and uses a geometric approach, to create a 3D object. Example: Using Computer Graphics approach which consists of Camera Calibration, Depth Estimation, Registration, and Material Application. There are many tools that can be used for this purpose. However, this process is limited by the requirement of multiple images and viewpoint information. Also, some of the generated outputs are not good for non-symmetrical or non-uniform objects.

Deep Neural network-based AI approaches to generate 3D from Images

With the capability of learning from representations, Deep Neural Networks have been a useful and successful tool for the generation of images and texts and now extends to 3D models. Learning-based 3D reconstruction approaches can be categorized based on 3D representations such as:

Some of the state-of-the-art solutions for 3D reconstruction are as follows:

Model | Publication | Dataset | Code |Paper | Output

3D-GAN |NeurIPS 2016 |ModelNet, IKEA |Link | Link |3D voxels

MarrNet | NeurIPS 2017 |Pascal 3D+, IKEA | Link |Link |3D voxels

Pixel2Mesh | ECCV 2018 | ShapeNet | Link | Link | 3D meshes

ShapeHD | ECCV 2018 | Pascal3D+, Pix3D | Link | Link | 3D voxels

GenRe | NeurlPS 2018 | ShapeNet, Pix3D | Link | Link | 3D voxels

Pixel2Mesh++ | ICCV 2019 | ShapeNet | Link | Link | 3D meshes

DISN | NeurlPS 2019 | ShapeNet | Link | Link | 3D meshes

IM-NET | CVPR 2019 | ShapeNet | Link | Link | Implicit Representation

DVR | CVPR 2020 | ShapeNet | Link | Link | Implicit Representation

Most of these models are based on generative networks. For the given input of single view or multi-view images, the network tries to learn the 3D features of the object in the image. It becomes easier to learn and generate good quality 3D models for simpler objects with multiple viewpoints. It is obvious that learning to generate a 3D model using a single-view image is challenging. Therefore, the learning approaches follow an ensemble model to convert 2D image to an intermediate form called 2.5D i.e., 2D image with depth information, and then convert 2.5 to 3D.

3D models formats and representations for 3D reconstruction

Several 3D representations have been used for learning-based 3D reconstruction, e.g., voxels, point clouds, meshes, implicit-representations (Ex: occupancy networks), etc.

  • Voxels are a generalized form of pixels in the 3D domain. They divide the 3D space into the same size of 3D grid cells. The 3D cell is called a voxel and its size determines the granularity of the representation. In the context of Deep Learning, voxels are limited by the memory requirement, as it grows cubically with resolution.
  • Point Clouds are another representation which are very flexible and computationally efficient, but they lack connectivity information about the output. They require intensive postprocessing because of missing connectivity information. Most existing architectures are also limited in the number of points that can be reconstructed (typically a few thousands).
  • Meshes are the most common representations comprising of vertices and faces. These are represented as collections of triangles and are computationally and memory efficient. However, this representation either requires a template mesh from the target domain or sacrifices important properties of the 3D output such as connectivity. Also, they often represent geometry as a collection of 3D patches which leads to self-intersections and non-watertight meshes.
  • To mitigate these problems, implicit representations have gained popularity. By describing 3D geometry and texture implicitly, e.g., as the decision boundary of a binary classifier, they do not discretize space and have a fixed memory footprint. One such representation is called Occupancy Networks which represent the 3D geometry as the decision boundary of a classifier that learns to separate the object’s inside from its outside. This yields a continuous implicit surface representation that can be queried at any point in 3D space and from which watertight meshes can be extracted in a simple post-processing step that takes a 3D point as input and outputs its probability of occupancy.
Reference: Occupancy Networks []

Conclusion: Although there has been a lot of improvement in the quality of generated 3D models, there are still limitations in the processes as discussed above. Training a deep neural network-based model also requires a large amount of data. The generated 3D models are still not manufacture-ready and require further processing before going for manufacturing.

In my next blog, I will talk about AI approaches to convert sketches to 3D. It’s a more challenging problem to work on sketches as they lack color details and it’s difficult to estimate depth from the greyscale images.

Thanks for reading. Please leave your comments, questions, or suggestions.


1. 3D-Machine-Learning

2. 3D-Deep-Learning-Zoo

3. 3D Modeling CAD Tools

4. Pixel2Mesh++: Multi-View 3D Mesh Generation via Deformation

5. Occupancy Networks — Learning 3D Reconstruction in Function Space



Nitish Bhardwaj

Deep Learning Researcher exploring problems and solutions. :)