Category 1: Generative AI Applications for the Edge
Our project's goal is to improve the situational awareness of first responders – like firefighters, police, and EMTs- by leveraging data from IoT devices, smart buildings, and other surrounding sources that is flowing through an Edge gateway. Accurate and timely communications are a must in addressing emergency scenarios such as wildfires, earthquakes, floods and active shooters. Given the increasing availability of IoT devices in such surroundings, the key challenge is to create a common-sense translation of the "JSON-techno-speak" in an IoT message to one or more paragraphs in English. The paragraphs should turn the cryptic JSON keys into meaningful words and the JSON values should be interpreted to extract the critical information. We are very happy to report that we have achieved that goal with our Small Language Model (SLM) implementation running in real-time on a Jetson Orin Nano 8GB Developer Kit.
We had come up with the idea of using LLM's to address that challenge in a previous year-long competition ((https://www.us-ignite.org/program/startup-support/nist-iot-competition/) that won us a place in the winner's circle. That solution was created for a powerful gaming laptop with a high-end NVIDIA GPU where we used PEFT/LoRA to create a trainable LLAMA2 7B model and trained it with outputs from very large language models. We ran it successfully on the target hardware with reasonable inference times. The main objective in this hackathon is to explore if that model could be executed on a Jetson Orin Nano (8GB) to provide responses to incoming MQTT messages containing JSON payload with low latency. That goal fits well into our overall mission to create smart cost-effective edge gateways.
We spent a significant amount of time porting our p-tuned LLAMA2 model from the i-9 CPU/RTX4070 GPU to the ARM/Jetson Orin Nano developer kit. We ported it to both Jetpack 5 and Jetpack 6 (JP6) on two separate boards (we had won a hardware kit for our idea) to see if the software requirements were satisfied. It turned out that Jetpack 6 was a better choice and flashed both the boards with JP6. We had to make sure that we were able to use pytorch-cuda and that required identifying the correct wheel. In addition, the package bitsandbytes had to compiled locally. We were able to get our port working on JP6 but the model refused to generate any output. We could see in JTOP that the GPU was pegged at 100% and memory was almost maxed out.
We were undaunted and we decided to explore other alternatives. We found jetson-ai-lab.com and spent a significant amount of time in going through the extensive amount of information in that site along with exploration of the Github sites that were referenced there. We explored multiple models along with multiple containers that were discussed in that site. Based on the comparison of the outputs of the different models along with the memory efficiency and speed considerations, we settled on the stabilityai/stablelm-zephyr-3b Chat model. We have been very happy with the results.
After we selected the model, the next challenge was to feed a prompt such as "What is the important information in the following JSON string" to the model in an automated fashion by embedding the JSON in an incoming MQTT message into that prompt. Although there are multiple ways of achieving that objective, we settled on a "hack" that fed prompt to the process inside the Docker container that was running the model. That hack needed minimal modification to our existing MQTT subscriber, and we could focus on prompt engineering to improve the output of the model. We had to throttle the MQTT publisher to publish every 30 seconds so that we could generate clean output from the model with a limit of 2048 tokens for the maximum history length. Those design choices created a flow where we could read the prompt and response very comfortably as the output rolled off the screen. Here is some interesting output for one of the MQTT messages. It clearly turns the cryptic keys into meaningful words, analyzes the values and guesses the units of those values. All of that in a few seconds on a Jetson Orin Nano was really astonishing!
>> PROMPT: What is the most important information from this JSON message? "{\"dateutc\": 1678488300000, \"tempf\": null, \"humidity\": null, \"windspeedmph\": null, \"windgustmph\": null, \"maxdailygust\": null, \"winddir\": null, \"uv\": null, \"solarradiation\": null, \"hourlyrainin\": null, \"eventrainin\": 258.012, \"dailyrainin\": null, \"weeklyrainin\": null, \"monthlyrainin\": null, \"totalrainin\": null, \"battout\": 1, \"tempinf\": 76.6, \"humidityin\": 48, \"baromrelin\": 30.224, \"baromabsin\": 29.698, \"feelsLike\": null, \"dewPoint\": null, \"feelsLikein\": 76.2, \"dewPointin\": 55.4, \"lastRain\": \"2023-03-02T18:01:00.000Z\"}"
The most important information from this JSON message is the weather data for a specific time and location. In this case, the data includes:
- The date and time in UTC (1678488300000)
- The temperature in Fahrenheit (76.6)
- The humidity level (48%)
- The barometric pressure (in the dry mode)
- The barometric pressure in the unit of measurement (30.224)
- The feeling like (not available in this JSON message)
- The dew point in the unit of measurement (55.4)
- The last rain (2023-03-02T18:01:00.000Z)<|endoftext|>
┌───────────────┬─────────────┐
│ embed_time │ 0.000310857 │
├───────────────┼─────────────┤
│ input_tokens │ 237 │
├───────────────┼─────────────┤
│ output_tokens │ 143 │
├───────────────┼─────────────┤
│ prefill_time │ 0.0113543 │
├───────────────┼─────────────┤
│ prefill_rate │ 20873.1 │
├───────────────┼─────────────┤
│ decode_time │ 6.21285 │
├───────────────┼─────────────┤
│ decode_rate │ 23.0168 │
└───────────────┴─────────────┘
Another sample screenshot is shown in Figure 1. That prompt also asks if there are any anomalies in the data. The answer was "no anomalies."
All the "disaster json data" that was used to test the models is available at GitHub - usnistgov/IoTData_EmergencyScenarios: Emulated IoT data to simulate emergency first response scenarios
It is important to understand how this Jetson Orin Nano Solution (JONS) is different from our previous MSI Gaming Laptop (i-9/RTX4070) Solution (MGLS).
In MGLS, we first found the equivalent of the cryptic JSON key and converted it to a standard key. For example, temp became temperature. Then we turned the JSON into sentences. For example, {"temp": 35} became "the temperature is 35 degrees." Then we used those sentences in training and prediction. In JONS, we just fed the JSON directly with preamble prompt such as "What is most important information in the following JSON string? What are the anomalies? {JSON string ...)
It was very surprising the find that the stabilityai/stablelm-zephyr-3b (StableLM-3B) model did an excellent job of figuring out that temp is temperature etc. for each of the keys and it did not need sentences. It understood JSON. So, we now have a faster, cheaper and better model compared the p-tuned Llama2-7B that took longer and more compute resources. The ShearedLlama-2.7B did not do as well as the StableLM-3B and hallucinated frequently.
We also explored the equivalents of some of these models on a Macbook and in WSL on Windows 11 (with NVIDIA GPU). We could not achieve any results comparable to what we could achieve on the Jetson Orin Nano. However, that is an apples to oranges comparison because the Local_LLM used on the Jetson is different than the GGUF that was used on the Mac.
On the Mac, we experimented with various “gguf” models, and tried to run its inference through “llama.cpp” and “llama_cpp_python”.
The models we used include
1. Llama2-7b with 4-bit and 5-bit quanitization
2. StabilityAI’s Stable Code 3b
3. StabilityAI’s Zephyr 3b
When the inference of the models ran through “llama.cpp” or “llama_cpp_python”, the output of the model was not deterministic and often digressed a lot. The smaller models with a speed of upto 10-20 tokens/sec on the Jetson Nano. The performance of the models was almost acceptable when the inference was run through the MLC pipeline. We compared the performance of our SLM against Anthropic’s Claude-3 Opus model for comparison. The Opus model performs exceptionally well as shown in Figure 2.
We are planning to dig deeper into the Jetson containers project and Local_LLM for our future work. Apart from exploring the Stability AI models further, we are also planning to explore any SLM's that are provided by Anthropic that run well on the Jetson.
Comments