At a glance
Overview
- Industry
- PM Kusum Yojana Solar Infrastructure Verification
- Domain
- Field installation verification and disbursement compliance
- Problem
- Human reviewers cannot detect image splicing or duplicate submissions at scale, creating fraud exposure and multi-week approval backlogs
- Solution
- System that runs pixel forensics, OCR, and object detection, producing a structured verdict in under 40 seconds
- Scale
- Thousands of field submissions across multiple state regions with varying regulatory formats
- Result
- Verification cycle reduced from up to 14 business days to around under 40 seconds per submission
Executive summary
PM Kusum Yojana has become one of India's largest solar infrastructure initiatives, but large-scale deployment creates a verification challenge. Manual photo review cannot reliably detect image manipulation, duplicate subsidy claims, or location fraud at the volume required by PM Kusum implementation agencies.
This case study documents how we built a system that performs multi-dimensional visual verification on field photographs -- detecting solar asset presence, operator attendance, hardware compliance, and pixel-level tampering. What solved the problem was not faster review.
It was building a system that can perceive things human reviewers cannot.
Project gallery
Project in pictures
01 · Context
Industry Context
Large-scale solar infrastructure programs operate at a verification throughput that the existing audit model was never designed to handle.
A state program processing tens of thousands of field submissions annually relies on human reviewers to perform solar installation photo verification one record at a time: read the GPS stamp, confirm the coordinates fall inside the permitted boundary, verify panels are visible, check for a present operator or beneficiary, validate accessory compliance, and cross-reference the application number against the regional format template.
Each check is straightforward in isolation. At program scale -- with daily submission volumes numbering in the hundreds and disbursements legally tied to the completion of every check -- the manual approach creates backlogs measured in weeks, not days, and error rates that climb as reviewer fatigue accumulates.
The real cost of getting it wrong is not just delayed payment to a legitimate installation. It is approved disbursement to a fraudulent one, and no recovery mechanism once funds have cleared.
The standard approach held for a long time because volume was contained. When programs processed hundreds of submissions per quarter, manual review was slow but manageable.
The structural shift that exposed its limits was growth: as deployment targets scaled and submission volumes multiplied, field inspection image fraud detection became the bottleneck that defined the whole program's throughput ceiling.
Program administrators could hire more reviewers, but reviewers cannot see pixel-level modifications and they cannot match image signatures against every prior submission in real time.
The gap between what the industry needs -- deterministic, forensically grounded verification at submission volume -- and what manual review can deliver has become wide enough that it now defines program risk, not just program efficiency.
02 · Challenge
The Problem
Every working day inside these programs, an audit team opens a queue of field photographs and begins a sequence that has not changed in years.
For a single submission, the manual review process takes several minutes: read the GPS coordinates burned into the image, verify they fall within the permitted state boundary, confirm solar panels are visible and appear active, check that an operator or beneficiary is present in the frame, scan for required accessories, compare the application number format against the regional template, and flag the image for duplicate review against every prior submission.
Multiply this by hundreds of daily submissions and the arithmetic is straightforward: the queue grows faster than it clears, and approvals that should take hours take weeks.
The downstream cost was not only operational. Delayed approvals meant delayed disbursements to legitimate installations, which created financial strain for installers operating on thin margins and reputational exposure for the administering program.
At the same time, the manual process was producing a second category of cost that was harder to quantify but more serious: false approvals.
When a reviewer is working through a backlog of several hundred photographs, the cognitive load of cross-referencing application number formats that vary by state, confirming coordinate ranges that differ by region, and detecting subtle inconsistencies in image content is significant.
Errors in that environment are not careless -- they are structural.
The fundamental problem was not that reviewers were slow. It was that the human visual system cannot do what forensic verification requires.
A reviewer staring at a photograph with a cloned solar panel region -- an image where the panel area has been copied from a prior approved submission and composited into a new frame -- sees a photograph. There is no visible seam, no obvious artifact, no signal available to the naked eye.
The same is true of duplicate submissions routed across state boundaries: a photograph submitted in one region and resubmitted in another to claim a second installation looks identical to a legitimate submission because it is the same image. Human review cannot close either of these fraud vectors.
That is not a training problem. It is a physics problem.
What made acting non-negotiable was the realisation that these fraud vectors were not edge cases -- they were vectors that would scale with the program. As submission volumes grew, the financial incentive to exploit undetectable fraud methods grew proportionally.
The moment it became clear that no amount of process improvement or reviewer headcount could address pixel-level splicing or cross-regional duplicate matching, the question shifted from whether to build a detection system to how quickly one could be operational.
How do solar programs detect image fraud and duplicate submissions at scale? The answer is that they cannot do it through human review.
Detection requires forensic analysis at the pixel level -- anomaly scoring against regions of interest, perceptual hash matching across a full submission registry, and exclusion logic sophisticated enough to distinguish benign camera overlays from genuine structural fraud.
These are engineering problems, not staffing ones.
03 · Approach
The Solution
What the system now makes possible is a complete verification cycle that no human process could match in speed or detection coverage. A field inspector submits a photograph to the local interface.
Within forty seconds, the system has confirmed whether active solar panels are present and positioned correctly, whether an operator or beneficiary is physically in the frame, whether required accessories pass the current compliance specification for that state region, whether the GPS coordinates fall within the permitted boundary, whether the application number matches the regional format template, and whether the photograph passes pixel-level forensic review for splicing and cross-application duplicate matching.
The submission either receives an approved verdict with a structured audit trail or a rejection with annotated evidence. What a reviewer previously spent several minutes on per photograph, and what previously accumulated into weeks of backlog, now resolves before the inspector reaches their next site.
Two components required custom engineering that no off-the-shelf system could have provided. The first was the ROI tamper gating pipeline. Standard pixel anomaly detection flags every GPS banner, timestamp overlay, and map inset burned into photographs by mobile camera apps as a manipulation event.
Because nearly every authentic field submission carries these overlays, a naive forensic pass would reject legitimate submissions at a rate that made the system unusable.
We built a multi-step exclusion pipeline that detects metadata banner regions using edge-density heuristics and text bounding box positions, constructs a dynamic exclusion mask, and runs pixel anomaly scoring only across the raw regions of interest after the mask is subtracted.
The second was the dual-pass OCR engine that runs Latin and regional script families in parallel, merges detections spatially using confidence heuristics, and feeds combined output to a deployed language model using runtime-reloaded regional parsing templates.
The system integrates into the existing workflow at the submission point, replacing the moment a reviewer would open a photograph for manual inspection. Nothing upstream of submission changes. Nothing downstream of the verdict changes.
Administrators receive the same structured output they always processed -- approved, rejected, or flagged for escalation -- but the decision is now produced in forty seconds with a complete forensic audit trail rather than in days with hand-annotated notes.
Regional configuration templates, validation patterns, and parsing rules update at runtime without a service restart, so compliance teams can respond to regulatory format changes without involving engineering.
[Field photo submission] → [Normalisation and scaling] → [Parallel visual analysis: panel detection, person detection, clarity scoring, tamper forensics] → [Sequential accessory identification] → [Dual-pass OCR with regional script merge] → [Local LLM field extraction with grounding check] → [ROI anomaly filtering against exclusion mask] → [Structured verdict with annotated proof] → [Disbursement decision and audit trail]
04 · Engineering
Technical Deep Dive
This section explains system design choices, implementation trade-offs, and runtime behavior in a structured format for faster engineering review.
The hardest problem in solar installation photo verification automation is not inference speed or model accuracy on clean inputs. It is the gap between what a forensic system needs to flag and what a production system must silently ignore.
Every authentic field photograph is, by forensic standards, a manipulated image -- camera apps burn GPS banners, timestamps, map insets, and beneficiary watermarks directly into the pixel data. A system that treats any of these as anomalies produces a false positive rate that makes it worthless in production.
Building a system that detects structural fraud while ignoring benign overlays required custom engineering at every layer of the stack, not just model selection.
- 01Technical Node
Dynamic Exclusion Mask for Tamper Gating
Standard pixel anomaly models compute a modification probability across the full image frame and report any region where the pixel distribution deviates from the expected noise floor. Applied to field photographs, this produces false positives on every GPS banner and timestamp overlay -- which appear on virtually every authentic submission. The exclusion mask pipeline solves this by building a spatial map of benign overlay regions before the forensic pass runs, using edge-density heuristics to detect rectangular metadata banners and text bounding box positions from the OCR stage to identify watermark zones. The anomaly analyser evaluates only the pixel regions that remain after subtracting this mask. The result is high sensitivity to structural fraud -- cloned panel regions, inpainted personnel -- precisely where it matters, with no false positives on camera overlays. This component is the one a naive implementation always skips and the reason naive implementations cannot be deployed in production.
- 02Technical Node
Dual-Pass Multilingual OCR Engine
Field photographs capture text that mixes Latin characters with regional scripts, at varying angles, distances, and lighting conditions. A single-pass OCR engine tuned for one character family produces garbled output on mixed-script frames, and a unified multilingual model cannot be independently tuned per family at the accuracy threshold this application requires. The dual-pass architecture runs separate Latin and regional script recognisers in parallel, merging their outputs spatially using bounding box overlap and per-region confidence scoring. Text regions where both passes return a detection use the higher-confidence reading. Regions where only one pass returns a detection use that reading directly. The merge produces a combined text object that handles mixed-script frames without requiring a single model to handle both families simultaneously.
- 03Technical Node
Language Model for Field Extraction
Raw OCR output from field photographs is noisy, unstructured, and mixed with background text that is not part of the data record. Extracting validated fields -- latitude, longitude, application number, beneficiary name -- requires a model that understands the regional layout and format rules that govern how this data appears in stamps. The language model runs and does inference calls, using runtime-reloaded regional configuration templates that define the parsing rules for each state's format. This means compliance teams can update parsing logic for new regional formats without a code deployment. A grounding pass verifies every extracted field against the raw OCR blocks before the field is written to the verdict, eliminating hallucinated values that would pass validation but did not appear in the image.
- 04Technical Node
Semantic Segmentation for Asset Presence Verification
Confirming that active solar panels are physically installed and correctly positioned requires isolating the panel region within the image and computing spatial coverage metrics. The segmentation model identifies and outlines solar array components, produces coordinate and area outputs for each detected region, and confirms whether the detected configuration meets the presence threshold for that submission type. Open-vocabulary segmentation was chosen over a fixed-class object detector for accessory identification because the set of required accessories varies across deployment configurations and changes over program lifecycles. A model that accepts text prompts at inference time rather than encoding class lists at training time eliminates retraining when a new accessory category is introduced.
- 05Technical Node
Biometric Presence Detection
Verifying that an operator or beneficiary is physically at the site requires detecting a person in the frame. The presence detection component confirms site visitation without storing biometric. Detection runs and the output is a binary presence signal -- a person is or is not in the frame -- with no persistent biometric record created.
- 06Technical Node
Cross-Application Duplicate Registry
A photograph submitted in one state region and resubmitted in another to claim a second disbursement is indistinguishable from a legitimate submission by visual inspection. Duplicate detection operates against a secure local registry of perceptual image signatures accumulated across all prior submissions. The match runs at submission time, before any downstream processing, and produces a duplicate flag that routes the submission to immediate rejection with evidence. The registry grows with every approved submission.
- 07Technical Node
Hybrid Execution Architecture for Workstation Hardware
Running semantic segmentation, dual-pass OCR, pixel forensics, and language model inference simultaneously on workstation-class hardware creates memory pressure that causes out-of-memory crashes on naive parallel implementations. The hybrid execution architecture separates tasks into two tiers: lightweight, non-interfering tasks run concurrently through a thread-pool executor, while high-memory models -- specifically the open-vocabulary segmentation model -- are queued sequentially with explicit GPU cache reclamation and garbage collection between phases. This keeps the full pipeline stable on standard workstations without dedicated VRAM budgets.
- 08Technical Node
Regional Configuration and Runtime Reload
Validation patterns, coordinate range rules, application number templates, and language model parsing prompts vary across state programs and change over program lifecycles. Encoding these as static configuration requires a code deployment every time a format changes. The dynamic configuration layer auto-reloads the regional profile at runtime, allowing administrators to modify any validation rule without restarting the service. This means the system can serve multiple state regions from a single deployment and respond to regulatory format changes on the same day they are announced.
- 09Technical Node
Pixel anomaly detection with ROI gating
Forensic modification scoring applied after dynamic exclusion mask subtraction -- chosen over standard full-frame anomaly detection because benign camera overlays on authentic submissions produce false positives that make full-frame scoring unusable in production
- 10Technical Node
Dual-pass OCR with confidence merge
Parallel Latin and regional script recognisers with spatial overlap merging -- chosen over a unified multilingual model because independent per-family tuning is required to reach the accuracy threshold this application demands on mixed-script field photographs
- 11Technical Node
Deployed language model
Inference for structured field extraction with runtime-reloaded regional templates – chosen with cloud LLM.
- 12Technical Node
Open-vocabulary segmentation
Text-prompt-driven accessory identification at inference time -- chosen over fixed-class object detection because required accessory categories change across program versions and cannot be encoded as a static classifier without retraining cycles
- 13Technical Node
Perceptual hash duplicate registry
Image signature matching against full prior submission history at submission time -- chosen over metadata-based duplicate checks because a resubmitted photograph carries no metadata that identifies it as a duplicate, only its pixel content does
- 14Technical Node
Hybrid thread-pool and sequential executor
Concurrent lightweight tasks with sequentially queued high-memory models and explicit cache reclamation -- chosen over a fully parallel execution model because simultaneous segmentation and forensic model loads exceed workstation VRAM budgets and cause out-of-memory crashes
- 15Technical Node
Dual API representation
Separate internal telemetry endpoint and filtered public endpoint -- chosen to prevent forensic evidence detail from leaking through downstream enterprise integrations while providing full audit trail access to compliance teams
05 · Outcomes
Results
Verification cycle time dropped from up to 14 business days to under 40 seconds -- eliminating backlogs as an operational category
These results shifted what program administrators could promise to their stakeholders. A completion timeline that previously depended on reviewer headcount and queue depth now depends only on submission volume -- a fundamentally different planning conversation.
Compliance reconstructions, which previously required retrieving reviewer notes and interviewing inspectors, are now produced from a complete structured audit trail generated automatically for every record. The fraud vectors that programs had learned to accept as undetectable are no longer in the threat model.
And for the first time, program expansion into new state regions does not require proportional expansion of the audit team -- it requires a configuration file update.
- Verification cycle timeUnder 40 seconds — Approval backlogs ceased to exist
- Maximum cycle time reduction10,080x faster (derived from 14 days vs 40 sec) — Submissions resolve before inspectors leave the site
- External PII exposureZero — Biometrics and GPS telemetry
- → Verification cycle reduced by up to 10,080x faster (derived) -- what previously required up to 14 business days of queue time now resolves in under forty seconds, meaning a field inspector can receive a confirmation verdict before driving to the next site.
- → Pixel-level splicing detection deployed across 100% of submissions -- a fraud vector that was entirely invisible to human review is now screened on every photograph at submission time, with no increase in reviewer workload.
- → Cross-application duplicate detection operational in real time -- photographs submitted across multiple regional applications are matched against the full submission registry at the moment of upload, before any processing occurs, preventing duplicate disbursements regardless of the geographic distance between claims.
- → Regional format validation now deterministic -- application numbers, coordinate ranges, and beneficiary name formats are validated against state-specific templates with no reviewer interpretation, eliminating the class of errors that occur when a reviewer applies the wrong regional format rules.
06 · Process
How We Worked
Delivery roadmap across discovery, engineering, validation, and rollout.
Field Photo Fraud Pattern Mapping We began by cataloguing every failure mode that human review was encountering in production: coordinate misreads, application number format errors, missed accessory items, and the two forensic fraud categories that were undetectable by any manual means. This phase identified that the primary bottleneck was not reviewer throughput but detection physics -- the fraud vectors that mattered most required capability that no human process could develop. That finding determined the architecture from the outset.
STEP 1: Field Photo Fraud Pattern Mapping We began by cataloguing every failure mode that human review was encountering in production: coordinate misreads, application number format errors, missed accessory items, and the two forensic fraud categories that were undetectable by any manual means.
This phase identified that the primary bottleneck was not reviewer throughput but detection physics -- the fraud vectors that mattered most required capability that no human process could develop. That finding determined the architecture from the outset.
Architecture Decision Before any model selection occurred, we resolved the data residency constraint. Field photographs contain biometric data and precise location telemetry subject to program privacy mandates.
STEP 2: Architecture Decision Before any model selection occurred, we resolved the data residency constraint. Field photographs contain biometric data and precise location telemetry subject to program privacy mandates.
ROI Tamper Gate and Dual-Pass OCR Engineering The two custom components that determined the system's production viability were built and validated in this phase. The dynamic exclusion mask pipeline was the critical piece: we built it by cataloguing the specific overlay types produced by mobile camera apps in the target deployment environment and designing the exclusion heuristics around the real distribution of benign modifications on authentic submissions. The dual-pass OCR merge logic was tuned against real field photographs drawn from the program's existing submission archive.
STEP 3: ROI Tamper Gate and Dual-Pass OCR Engineering The two custom components that determined the system's production viability were built and validated in this phase.
The dynamic exclusion mask pipeline was the critical piece: we built it by cataloguing the specific overlay types produced by mobile camera apps in the target deployment environment and designing the exclusion heuristics around the real distribution of benign modifications on authentic submissions.
The dual-pass OCR merge logic was tuned against real field photographs drawn from the program's existing submission archive.
Regional Format Validation and Adversarial Testing Every state region's application number format, coordinate range specification, and script mixing pattern was encoded into the regional configuration layer and tested against the full archive of prior submissions -- including known fraudulent ones. This phase produced the exclusion mask calibration, the OCR confidence thresholds, and the duplicate registry construction. It also validated that the ROI gating pipeline produced no false positives on any authentic submission in the test archive.
STEP 4: Regional Format Validation and Adversarial Testing Every state region's application number format, coordinate range specification, and script mixing pattern was encoded into the regional configuration layer and tested against the full archive of prior submissions -- including known fraudulent ones.
This phase produced the exclusion mask calibration, the OCR confidence thresholds, and the duplicate registry construction. It also validated that the ROI gating pipeline produced no false positives on any authentic submission in the test archive.
Deployment and Compliance Handoff Deployment on workstation-class hardware at the program's operational sites required final calibration of the hybrid execution architecture -- confirming that the sequential queuing logic for high-memory models produced stable throughput under concurrent submission loads. The compliance handoff included documentation of the audit trail format, the regional configuration reload procedure, and the escalation routing for submissions flagged as ambiguous. Program administrators took over regional configuration management on day one without engineering support.
STEP 5: Deployment and Compliance Handoff Deployment on workstation-class hardware at the program's operational sites required final calibration of the hybrid execution architecture -- confirming that the sequential queuing logic for high-memory models produced stable throughput under concurrent submission loads.
The compliance handoff included documentation of the audit trail format, the regional configuration reload procedure, and the escalation routing for submissions flagged as ambiguous. Program administrators took over regional configuration management on day one without engineering support.
07 · Future
What Comes Next
The regional configuration layer positions the system to absorb new state formats and compliance requirements without code changes. As additional regions are onboarded, the forensic evidence archive grows, and that archive is the training substrate for the next generation of detection models.
Duplicate registry coverage, fraud pattern libraries, and regional format corpora all compound with every submission the system processes.
The next engineering phase -- active fraud pattern learning from confirmed rejection cases -- becomes tractable because the current system is already producing labelled forensic evidence at scale.