My Makerspace has started using Vision on an Nvidia RTX laptop for a people counting application at my non-profit, with favorable results.
Initial testing was with cloud-based CV and LLM services, with mixed results. Everybody kept telling us that the only way to do people counting is with YOLO and a purpose-trained CV model.
For the local LLM (RTX 2070 with 8GB), we tested 3 different models. Granite vision provided the most accurate and consistent results, giving the correct count for our (chaotic, full of machine tools and safety gear and highly variable lighting) workspaces about 90% of the time, and only off by ±1 for 10% of the sample images. Speed was acceptable -- qwen was faster, but gave almost random answers.
NAME SIZE/LOADED SIZE ERROR SECONDS/FRAME
qwen2.5vl:3b 3.2 GB / 6.0GB +3 1
gemma3:latest 3.3 GB / 6.2GB ±2 9
granite3.2-vision:latest 2.4 GB / 7.5GB ±1 3
A listener is launched which loads Granite-Vision into Ollama and supplies a system prompt of "You are a terse digital assistant. You respond with short, simple answers, preferring to return integer numeric responses rather than sentences".
The service ingests near-realtime snapshot-on-motion (usually 704x480 px JPEG) events from 19 different public workspace cameras, and then makes a loopback API call to Ollama with a prompt of "Return as an integer the number of people in this image. The output should be just the numerical value alone". The answer is provided as a bare integer (e.g. "3") which feeds real-time occupancy trackers as well as a time-series database (influxdb).
Has anybody done any benchmarking or effectiveness testing with different image resolutions? Some of our cameras naturally provide snapshots at 1280x720 and these take twice as much long to upload and analyze (with any of the 3 models we tried) with only a slight improvement in error rate.
#LLM------------------------------
Kevin Kadow
------------------------------