This project was conceived with the aim of delivering an immersive experience, allowing users to draw images in the air and witness their creations undergo impressive transformations through the application of AI.
TL;DRIf you want to skip all the documentation and set up the project for yourself or see the final application in action, you can skip to the Final Product section. You can grab all the open-source code in the Code section at the bottom of the page.
Features- Hand tracking
- Gesture recognition
- Sketch autocompletion
- AI image generation
- Gallery view
- Sharing / Deleting the photo
We utilized several technologies in our tech stack to bring our application to life. To start, we used Android Studio as the development environment to build our application. To implement the design of our user interface, we used Jetpack Compose to streamline the process of styling components and provide a comfortable interface for our users. Next, we made use of the Android device’s camera to capture data from the user’s fingers and gestures using MediaPipe. We used OpenCV to perform all the image preprocessing and drawing onto the screen. If the user desires to autocomplete a sketch, the drawn sketch is sent to a backend SketchRNN micro-service that returns back an autocompleted sketch. When the user saves the final sketch, the sketch is sent to a third-party service to generate an improved AI-generated image of the sketch. In summary, our application utilizes a number of technologies to both collect and synthesize the user’s gestures into AI-generated artwork.
Platforms and Technologies:
- Android Application (Internet connection & Camera)
- MediaPipe (by Google) for hand tracking & gesture detection
- OpenCV for image manipulation, preprocessing, and drawing
- APIs for AI model (implemented using custom micro-services)
- SketchRNN to convert gesture strokes into sketches
- ClipDrop to convert sketches into finished drawings
- Jetpack Compose for our front-end design and implementation
Our application seamlessly transforms user gestures into art through intuitive gestures and modern AI technologies. When the user accesses the drawing page, they simply put their hand within view of the front-facing camera. MediaPipe, by Google, takes charge of gesture detection, providing support for various gesture categories and pinpointing landmarks on the user's hand. The primary gesture that activates the drawing mode is when the user points their index finger upward. The app tracks the tip of the index finger as the user draws strokes on the canvas.
When the user is ready to convert their strokes into polished artwork, they can make a closed-fist gesture. At this stage, the app sends the collected strokes to an endpoint on our SketchRNN HTTP server. Within the server, the SketchRNN AI model works its magic, auto-completing the sketch based on a chosen category from the dropdown menu. The AI-generated strokes are then returned to the user's Android device via HTTP and displayed on their screen.
For added convenience, a thumbs-down gesture allows the user to clear the screen, offering a fresh canvas if they are not satisfied with their drawing, the autocompleted sketch, or the final piece. Otherwise, a thumbs-up gesture signals the app to send the sketch to ClipDrop, where it undergoes a transformation into a detailed work of art. This involves background removal, grayscale conversion, and the fine-tuning of image characteristics to ensure optimal results from the AI art generator. Ultimately, the completed artwork is saved in the user's gallery, ready to be admired or shared. Through intuitive gestures and modern AI technologies, our app bridges the gap between creative expression and AI-powered artistry, offering users a delightful and interactive artistic experience.
ImplementationView and View ModelThe architectural framework of our application has been meticulously structured with a distinct separation of concerns through the implementation of a view and view model paradigm. Each individual view file is responsible for orchestrating the realization of UI components and facilitating user interactions, adhering to a modular and organized codebase. Concurrently, each view model file encapsulates the essential functions and data necessary for the respective activities. Additionally, model files have been employed to oversee the management of our proprietary custom data format, further enhancing the coherence and maintainability of our codebase.
User Interface and Experience (UI/UX)To design our product's user interface and experience (UI/UX), we leveraged Figma and Jetpack Compose. In Figma, we created each screen of our product and connected them to understand the navigation flow and what features we should include. We also employed Jetpack Compose, a modern Android UI toolkit, in Android Studio to create the components easily and reusable. Additionally, Figma provides the code for Jetpack compose, we were able to facilitate the rapid and modular development of UI components.
Totally, we have 4 screens:
- Home screen: The user can start their drawing or navigate to their gallery page.
- Draw screen: The user can draw in the air. The program will support utilities for the user based on their gesture.
Point up: Draw the lines.
Close the fist: Automatically complete your drawing. The user can do this repeatedly.
Thumbs down: Clean up the screen.
Thumbs up: The program will send a request to the external API to get an AI-generated image.
- Gallery screen: The user is able to look through their pictures.
- Individual screen: The user can see their pictures bigger, and be able to share their picture or delete it.
This was our initial design and logic at the beginning.
It later evolved into this with auto layout constraints and prototyping implemented.
We used Jetpack Compose to create the frontend design. Here is a snippet of our home page written in Jetpack Compose. The Figma editor has an editor mode which did help in generating some of the design and layout code.
In the development of our application, we dedicated significant effort to the implementation of a robust Gesture Tracking system. Initially, we explored the possibility of creating our own gesture-tracking solution using OpenCV, coupled with a skin tone calibration system to attempt to isolate the hand and track its center position. Here is the result we got when we implemented this solution along with a snippet of the relevant code.
This approach worked by isolating HSV values that matched the user’s skin tone. The white sections in the middle camera frame indicated a potential hand target. The problem with this solution is that if other light sources such as ambient matched a similar HSV value, it’d also be considered as part of the hand. To attempt to mitigate this we tried to only consider the largest white blogs on in the screen to be the target hand, but this meant that the solution wouldn’t work if the user’s hand was far away from the camera’s view, causing it to track the ambient light instead. To improve the HSV calibration, we added a calibration page to fine-tune this light sensitivity. After further testing, we decided that we needed a better approach to hand tracking in order to better track the index finger and account for hand gestures.
Then we discovered the powerful capabilities of MediaPipe by Google, which not only simplified our task but also provided reliable gesture recognition across a diverse range of skin tones without the need for calibration procedures. In this section, we discuss the step-by-step process of how we integrated MediaPipe into our Android application, enabling real-time tracking of user hand gestures and hand position on the screen, we outline the seamless transition into 'draw mode, ' where the app captures and stores strokes created by the user’s index finger movement. These strokes are then sent to our SketchRNN microservice, which transforms them into compelling sketches.
This is a screenshot of how we set up the MediaPipe dependency in Android and automatically downloaded the hand tracking & gesture recognition model into the app.
Integrate MediaPipe for Gesture Recognition and Positional Tracking
In general, these where the steps we needed to accomplish in order to get hand tracking and gesture recognition.
- Integrate MediaPipe into the project by following the official documentation. (https://developers.google.com/mediapipe)
- Configure access to the device's camera for real-time input.
- Set up MediaPipe's gesture recognition module to track the user's hand gestures and position on the screen.
Essentially, for every camera frame, we trigger a function called handleFrame()
. This function runs all the hand tracking via MediaPipe and drawing of sketches via OpenCV. In regards to MediaPipe, on the right is a snippet of how we recorded the points of the index finger when the gesture was Pointing_Up
. The snippet in the bottom left shows how we parsed out the gestures and kept track of them.
Implement Gesture Modes
Pointing Up
- Detect when the user is pointing their finger to enter draw mode by accessing the current gesture mode from MediaPipe.
- Store the strokes generated by the user with the tip of their index finger during draw mode.
Closed Fist
- Send a request to SketchRNN for an autocompleted sketch back
Thumbs Down
- Clears the sketch
Thumbs Up
- Pre-processing of user sketch
- Send sketch to image AI generator and get AI art back
To transform user gestures into stunning works of art, the integration of a sketch completion model, such as SketchRNN, is an essential intermediate step. In this section, we delve into the process involved in integrating SketchRNN into a dedicated backend microservice accessible from the Android app. Drawing inspiration from open-source examples, we establish the groundwork for the model's operation. To facilitate communication between the Android app and the SketchRNN microservice, we write HTTP endpoints, creating a bridge for the exchange of gesture strokes and AI-generated sketch strokes. As the strokes are transmitted from the user's gestures to the SketchRNN model, we address the data preprocessing and format conversion requirements to ensure that the model seamlessly interprets and responds to input and that the sketches are properly displayed on the Android device’s screen. Finally, we outline the process of converting these strokes between relative and absolute coordinates, ultimately culminating in the display of the generated sketches within our Android application.
SketchRNN Model Integration
- Integrate the SketchRNN model into a backend microservice. We built our microservice with Express based on the following open-source example (https://github.com/MindExMachina/smartgeometry). We forked this repo and updated it to work since it had broken dependencies as well as added additional logic to handle better parsing of the input data and deal with duplicate data points. You can find our version of the SketchRNN server in the code section at the bottom of this article.
- Convert the strokes into the appropriate format required by the SketchRNN model.
- Ensure the model can receive and process the stroke data sent by the app.
- Implement data preprocessing or format conversion to prepare the strokes for input to the model.
- Trigger the SketchRNN model to complete the sketch based on the received strokes.
HTTP Endpoints for the SketchRNN Microservice
- Create a web server for a microservice running the SketchRNN model
- Create an HTTP endpoint to send gesture stroke data to the SketchRNN microservice
- Create an HTTP endpoint to receive sketch stroke data from the SketchRNN microservice
Conversion of Strokes between Relative and Absolute Coordinates
- After receiving the generated strokes from the SketchRNN model, convert them from absolute to relative coordinates if needed.
- Ensure that the strokes are properly scaled to be rendered on the screen.
Displaying Generated Sketches within the Android App
- Send the generated sketch data, as absolute coordinates, back to the app through an HTTP endpoint in your backend microservices.
- Implement the necessary logic in the app to display the completed sketch to the user.
OpenCV is an essential technology in our app's functionality, serving as a critical component for various image-processing tasks. In particular, we harness the power of OpenCV v4.8.0 to draw sketches on the screen, including user-drawn and AI-generated strokes, and for processing images to be better suited to provide high-quality results from our AI image generation models like SketchRNN and Clipdrop. OpenCV is instrumental in merging the autocompleted sketch with the original strokes, ensuring a seamless integration of user input and AI-generated art. Its versatility and robustness in image processing contribute significantly to the fluid and interactive user experience that our app provides.
Here is a screenshot of how we set up the OpenCV SDK for use with our Android Studio project.
This is a screenshot of how OpenCV was used to draw the user-drawn points as well as the Sketch RNN points onto the screen. We basically take a list of all points and for every point, we draw a line from the previous point to the next which forms the sketch.
This is a screenshot of how OpenCV was used to implement the following:
- Take a drawn sketch & remove the background
- Merge SketchRNN sketch into drawn sketch
- Convert to grayscale
- The absolute threshold to black and white colors
- Invert colors (i.e. white -> black, black -> white)
This technique is used to clarify the user-drawn sketch to get better results when we pass it to the AI image generator.
We have utilized the Clipdrop API for generating AI images. This API requires an API key, a prompt, and an image in JPG format as parameters. To facilitate the generation of API requests, a separate utility function has been created. It is crucial for this function to operate successfully that an internet connection is established. Initially, the prompt allowed users to manually describe what they had drawn. However, in the final version, we have integrated it with an image auto-completion feature and configured it to use the selected model from a dropdown menu as additional context. This results in an AI-generated image that better resembles the original sketch.
This is a snippet of code that shows how the sketch JPEG is sent to the ClipDrop API. It is called a file
. The prompt
includes the model selected plus some additional description to get better results from the AI image generator.
Once the API request is conducted successfully, the image file will be stored in the phone’s internal storage so that the user can confirm their image in the gallery.
Final ProductSetup Instruction
This video provides setup instructions on how to try the application out for yourself!
Final Demo
This video showcases how the final project works.
Ideas and technologies that weren’t used in the final projectIn the initial design phase, our plan involved the integration of the Hugging Face ML model, encompassing image-to-text functionality. The idea behind this was to take the user-drawn sketch and generate a text prediction out of it (e.g. image of a hamburger produces the text hamburger). This text would then be passed to a third-party service to generate the AI art. Subsequently, however, a decision was made to omit these components in the final version as the results would not closely resemble the user's original sketch. To streamline the process, we opted for the utilization of an external API, Clipdrop API, allowing for the direct generation of images when users submit their sketch images. Given the seamless functionality of our current implementation, there is no longer a necessity to leverage the Hugging Face ML model and image-to-text mechanisms for AI image generation.
Additionally, we also excluded iSketchNFill. We assumed that iSketchNFill would support the autocompletion of the user's sketch. Unfortunately, the feature of iSketchNFill did not perform sketch autocompletion but rather morphed an image to the boundaries of a drawn sketch. This was not the solution we were looking for. The iSketchNFill does not generate multiple autocompleted sketches and the program seems to be more like editing the preset sketches rather than autocompletion, thus the reason why it was not used.
ConclusionIn summary, we aim to merge the worlds of technology and art through our project. Specifically, by utilizing gesture recognition technology, we will enable our users to create artwork through computer vision technology. In addition, we believe that our autocomplete art tracing functionality will streamline the process of art creation and improve user experience. Consequently, we would like to conclude that our application certainly allows users to make the work of art generation more accessible to everyone.
Comments