Detecting Oil Leaks with AI: Overcoming Data Scarcity with Generative Models

Building an AI classifier for oil leak detection using synthetic data generation to overcome the challenge of limited real-world examples, showcasing the power of generative models in solving practical industrial problems.

Lavanya Shirur Sudhakar

April 27, 2025

Artificial Intelligence

Oilfield services companies are some of the largest employers in Houston, and detecting oil leaks is a problem that many of these companies face. However, traditional methods of building AI models for this purpose are limited by the scarcity of real-world data. Oil spills are rare, and images are often proprietary or unavailable in the public domain. Even stock photo websites primarily provide AI-generated content, making it difficult to build robust datasets.

Our goal was to build an AI model that classifies whether an image contains an oil leak or not. Tackling this problem was only made possible through recent advances in generative AI. In this blog, we'll walk through how we:

Used Stable Diffusion to generate onshore oil rig images.
Manually inpainted oil spills to create synthetic datasets.
Evaluated multiple platforms (Roboflow, OpenAI's fine-tuning API, and Azure Custom Vision) to find the best solution for our needs.
Trained and tested a classifier that accurately predicts oil leaks from images.

Generating Oil Rig Images with Stable Diffusion

To start, we needed realistic images of oil rigs, but generating both an oil rig and a visible oil spill in a single pass proved difficult. Therefore, we first focused on creating base images of onshore oil rigs. This was done using Stable Diffusion through the automatic1111 webUI. Crafting the right prompt was critical to ensure the images reflected West Texas-style rod pumps and not offshore rigs. Below is an example of one of our prompts:

Prompt: "onshore oil rig with rod pumps, desert environment, sunrise lighting"
Negative Prompt: "offshore rig, ocean, water"

After generating 20 base images, we proceeded to manually simulate oil spills on these rigs using inpainting techniques.

Inpainting Oil Leaks for Synthetic Dataset Creation

Since oil spills are not random, we used domain knowledge to guide where the leaks should appear. Typically, oil leaks occur around boreholes, valves, and pipe junctions. We:

Manually painted leak-prone areas on each image (usually around the borehole).
Guessed how the oil would flow and spread from the source.
Generated 20 variations per base image, resulting in a dataset of 400 synthetic images.

Some images intentionally resembled water spills to help train the model to differentiate between oil and other liquids. These non-leak images were kept as negative samples for training purposes.

Choosing the Right Platform: Roboflow, OpenAI, or Azure Custom Vision

Depending on the problem, different platforms are available to build image classifiers. If you want to identify specific regions of an image containing the leak, tools like Roboflow or object detection models (e.g., YOLO) are ideal. Additionally, OpenAI's fine-tuning API could be used to fine-tune pre-trained models for this task.

Since our primary goal was to detect the presence of oil leaks (not their exact location), we opted for Azure Custom Vision, which offered a simpler, binary classification solution.

Training the Classifier on Azure Custom Vision

Using the synthetic dataset, we trained our model with Azure Custom Vision. The dataset was split into:

70% for training
15% for validation
15% for testing

The model was trained as a binary classifier to determine whether an oil leak was present in the image. Below are the results from Iteration 3 of our model training:

Precision: 83.1%
Recall: 83.1%
Average Precision (AP): 92.4%

These metrics show that the model performs well in both minimizing false positives (precision) and capturing actual leaks (recall).

Model Testing: Quick Test Example

To demonstrate the model's capabilities, we tested it on an example image showing a simulated oil spill along a shoreline.

The results were as follows:

Prediction: OilLeak
Probability: 99.3%
Negative Probability: 0.6%

The model confidently identified the oil spill with 99.3% certainty, proving its ability to detect leaks even in diverse environments.

Alt text

Conclusion: AI for Real-Time Oil Leak Detection

Our work demonstrates that generative AI can address data scarcity challenges in oil leak detection, opening new possibilities for automated monitoring in oilfield operations. With 83.1% precision and recall, our model is ready to assist companies in real-time leak detection, improving both safety and regulatory compliance.

While the current version detects leaks without pinpointing their exact location, future iterations could use object detection models or Roboflow to locate specific leak areas. Additionally, fine-tuning the model with real-world data, when available, could further enhance performance.

By integrating this solution into monitoring systems and camera feeds, oilfield companies can reduce environmental risks, prevent downtime, and avoid fines from undetected leaks. This is just the beginning—AI and synthetic datasets are paving the way for more efficient and sustainable operations in the oilfield services industry.