Skip to content

2025

Anomaly Detection with Isolation Forest and Qdrant

anomaly-detection-qdrant

This beginner tutorial uses the Isolation Forest algorithm for anomaly detection and Qdrant for storage and visualization.

In this specific demo, we will explore a Stronghold ruin and look for ghosts. Our hero will go through all the Shadows and see if he can come across anomalies, aka Wraiths. Take a look at the YouTube demo:

👉 To just run the code - here is the Python notebook

Introduction

This tutorial will guide you through implementing an anomaly detection system using Qdrant, a vector search database, and the Isolation Forest algorithm from Scikit-Learn. By the end of this tutorial, you will:

  1. Generate synthetic data containing normal and anomalous samples.
  2. Store and retrieve vector embeddings in Qdrant.
  3. Train an Isolation Forest model for anomaly detection.
  4. Update Qdrant with anomaly labels and visualize results using PCA.
  5. Visualize vectors and anomalies in Qdrant.

Prerequisites

  1. Install and run Qdrant locally. Here is the documentation.

  2. Ensure you have the necessary libraries installed. Run the following command:

pip install qdrant-client scikit-learn numpy matplotlib

Step 1: Import Dependencies

First, import the required libraries:

import numpy as np
import matplotlib.pyplot as plt
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
from sklearn.ensemble import IsolationForest
  • numpy: Used for numerical operations and data generation.
  • matplotlib.pyplot: Used for visualization.
  • qdrant_client: Qdrant's Python client for vector storage and retrieval.
  • IsolationForest: The algorithm used for anomaly detection.

anomaly-detection-qdrant

Step 2: Generate Synthetic Data

To simulate real-world data, we create 490 normal data points that are randomly distributed around a mean value of 0.5 with some variance. Additionally, we generate 10 anomalous data points that are positioned further away, making them easier to detect.

# Set random seed for reproducibility
np.random.seed(42)

# Generate normal embeddings (centered around 0.5)
normal_data = np.random.normal(loc=0.5, scale=0.1, size=(490, 128))

# Generate anomalies (farther from the normal cluster)
anomalies = np.random.normal(loc=1.5, scale=0.3, size=(10, 128))

# Combine normal and anomalous data
data = np.vstack([normal_data, anomalies])

# Print data shape
print(f"Generated {data.shape[0]} vectors of dimension {data.shape[1]}")

By keeping anomalies at a different mean value (1.5), they are positioned distinctly from the normal points, making them detectable through distance-based methods.

anomaly-detection-qdrant

Step 3: Connect to Qdrant and Create a Collection

Qdrant is a vector search engine that allows efficient similarity search. In this specific tutorial, we will only use Qdrant to store vectors and to visualize their distribution. The anomaly detection will be handled by the Isolation Forest algorithm.

In another tutorial, we will use Qdrant's API to detect outliers. For now, Qdrant can be used to store and retrieve vectors at great speeds.

We first connect to a locally running instance and create a collection named Stronghold, specifying a 128-dimensional vector space using cosine distance as the similarity measure.

# Connect to Qdrant (Assuming running locally)
client = QdrantClient("http://localhost:6333")

# Create a collection (if it doesn't exist)
collection_name = "Stronghold"

client.recreate_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=128, distance=Distance.COSINE)
)

Step 4: Insert Data into Qdrant

We now insert the generated data points into the Qdrant database. Each data point is assigned an ID and stored with a default label unknown.

# Insert data into Qdrant
points = [
    PointStruct(id=i, vector=vector.tolist(), payload={"label": "unknown"})
    for i, vector in enumerate(data)
]
client.upsert(collection_name=collection_name, points=points)

print(f"Inserted {len(points)} vectors into Qdrant.")
Open up Qdrant Dashboard on http://localhost:6333/dashboard. Your collection should be there, with the appropriate config.

anomaly-detection-qdrant

Step 5: Retrieve Stored Vectors

To validate that the data has been successfully stored, we retrieve the stored vectors from Qdrant.

retrieved_points = client.scroll(collection_name=collection_name, limit=500)[0]

# Extract vectors and IDs
retrieved_vectors = np.array([point.vector for point in retrieved_points if point.vector is not None])
retrieved_ids = [point.id for point in retrieved_points]

print(f"Retrieved {len(retrieved_vectors)} vectors from Qdrant.")

Step 6: Train Isolation Forest for Anomaly Detection

anomaly-detection-qdrant

We train an Isolation Forest model to detect anomalies in the dataset. Isolation Forest works by isolating anomalies, which typically require fewer splits compared to normal points.

# Train Isolation Forest model
iso_forest = IsolationForest(contamination=0.05, random_state=42)
predictions = iso_forest.fit_predict(data)

# Convert predictions (-1 = anomaly, 1 = normal)
anomaly_labels = ["Wraith" if p == -1 else "Shadows" for p in predictions]

# Count anomalies
print(f"âś… Detected {anomaly_labels.count('Wraith')} anomalies out of {len(data)} vectors.")

Step 7: Update Qdrant with Anomaly Labels

We now update the stored vectors in Qdrant with their anomaly classification.

# Update payloads in Qdrant with anomaly labels and image URLs
for i, point_id in enumerate(retrieved_ids):
    # Define image URL based on label
    image_url = "https://i.ibb.co/Q7z72wq3/shadows.png" if anomaly_labels[i] == "Shadows" else "https://i.ibb.co/NnS6DV5z/wraith.png"

    client.set_payload(
        collection_name="Stronghold",
        points=[point_id],
        payload={
            "anomaly": anomaly_labels[i],
            "image_url": image_url
        }
    )
print("âś… Updated Qdrant with anomaly labels and image URLs.")

Step 8: Discover Anomalies with Qdrant

anomaly-detection-qdrant

Qdrant provides a web UI where you can inspect the stored vectors and their associated metadata.

  1. Open your web browser and go to http://localhost:6333/dashboard.

  2. Navigate to the Stronghold collection.

  3. Use the search functionality to inspect vectors labeled as anomalies (Wraith).

  4. You can filter by payload: anomaly: Wraith.

anomaly-detection-qdrant

Step 9: Visualize Anomalies with PCA

anomaly-detection-qdrant

To visualize the results, we use Principal Component Analysis (PCA) to reduce the vector dimensions to 2D for plotting.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Reduce dimensions to 2D using PCA
pca = PCA(n_components=2)
data_2d = pca.fit_transform(data)

# Assign colors based on anomaly labels
colors = ["red" if label == "Wraith" else "blue" for label in anomaly_labels]

# Scatter plot
plt.figure(figsize=(10, 8))
plt.scatter(data_2d[:, 0], data_2d[:, 1], c=colors, alpha=0.7)
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("Anomaly Detection Visualization")
plt.show()

Step 10: Visualize Anomalies with Qdrant

Use the visualization feature of Qdrant to inspect vectors labeled as anomalies (Wraith).

Here is a sample json configuration for to generate the graph:

{
  "limit": 500,
  "color_by": {
    "payload": "anomaly"
  }
}

Explore the associated image URLs and metadata to understand how anomalies are distributed.

You can hover over each point and see the content of the metadata.

anomaly-detection-qdrant

👉 To just run the code - here is the Python notebook

Conclusion

This tutorial demonstrated how to:

  1. Generate synthetic data.
  2. Store and retrieve vectors in Qdrant.
  3. Train an Isolation Forest model for anomaly detection.
  4. Update Qdrant with anomaly labels.
  5. Visualize anomalies using PCA.

With these techniques, you can apply anomaly detection in real-world scenarios like fraud detection, network intrusion detection, and more!

anomaly-detection-qdrant

GPT is Dead: The Rise of DeepSeek

gpt-is-dead

It’s not just GPT, or Llama or Qwen

It’s the entire idea that you have to spend $60-100 million on a frontier model.

DeepSeek-V3 cost $6 million and roughly 2.8m GPU hours. That makes models like GPT-4o or Llama 3.1 at least 10x expensive. Even Andrej Karpathy had to chime in:

karpathy

DeepSeek-V3 is a mixture-of-experts (MoE) transformer that comprises 671 billion parameters, and 37B of those are active for each token. The team trained the model in less than a 1/10 of what it took to train Llama 3.1.

By launching this model in late December 2024, DeepSeek has redefined the standard for Large Language Models.

Engineering is all about solving big problems with less resources. That’s exactly what DeepSeek’s young engineers did…and they wrote a world-class academic report.

tony-start

The real world is not a Jupyter Notebook

For those of you not living in Production Land - just a reminder: Working with LLMs costs a lot of compute. Just a basic hosted GPU can start at $500 on GCP. Experimenting with compute-heavy applications can quickly add up.

In all of this hype, DeepSeek has reminded us what peak performance looks like. In the first 15 days of using V3, here is the compute bill for an enterprise-scale user:

deepseek-cost

Just for reference - a simple request is 500 tokens. A codebase search is around 50,000.

274,000,000 tokens is a lot of LLM. Most likely for some type of an agentic system where a model accesses multiple tools and keeps watch over its own results.

DeepSeek's cost is a far cry from what we’ve seen on the customer side - with tens of thousands of dollars being spent on corporate experiments.

Here is a comparison with GPT-4o on Azure (source):

cost

This is what competition looks like.

Let’s look at the benchmarks

A meta-analysis of DeepSeek’s own benchmark results can be compiled from the report’s findings. DeepSeek-V3 is the best across all domains. This sounds too good to be true.

benchmarks

Users are already stress testing DeepSeek-V3. Though we are only four-weeks in, I found a great hands-on report by Sunil Kumar Dash from Composio.

His conclusions:

Category Best Model Second Best Third Best
Reasoning Deepseek-V3 Claude 3.5 Sonnet OpenAI GPT-4o
Math Deepseek-V3 Claude 3.5 Sonnet OpenAI GPT-4o
Coding Claude 3.5 Sonnet Deepseek-V3 OpenAI GPT-4o
Creative Writing Claude 3.5 Sonnet Deepseek-V3 OpenAI GPT-4o

Though its reasoning capabilities are not beyond O1, the Chinese model holds its own against GPT-4o and Claude. Moreover, its Chain of Thought reasoning works well.

chain-of-thought

Yes - it does pass the “strawberry” question.

No - it isn’t as consistent across all domains. In practice - it isn't perfect.

All of this checks out with the average user experience reported on r/LocalLlama, where DeepSeek-V3 is a major source of hype.

Recoil42, a top commenter on this subreddit

The catch is cost. Deepseek offers maybe 75% of the performance as Sonnet but at a very small fraction of the cost. It was trained at a very small fraction of the cost, and asks users for a small fraction of the cost. That's why it's in a league of its own. I used Cline last night and maybe thirty minutes of casual coding clocked me 1.50 dollars. Two hours of DeepSeek usage clocked me maybe 15 cents. It's not even close.

Sonnet is better. Definitely, concretely better. It solves problems for me that leave DeepSeek spinning in circles. But the cost-efficiency of DeepSeek is a crazy eyebrow-raiser — it is cheap enough to be effectively used unmetered for most people.

These days I default to DeepSeek and only tag Sonnet into the ring when a problem is particularly difficult to solve. For writing boilerplate, doing basic lookup, and writing simple functions — DeepSeek is unmatched.

The general-purpose LLM for enterprises

In my first year of working with enterprise services for vector databases, I learned how valuable open-source is for business.

The use cases are not that complex - but the data must stay private.

Though Database as a Service is well known by this point, many users opt for Bring Your Own Cloud offerings, mostly due to compliance conflicts.

This is why I believe that DeepSeek-V3 is the heavy hitter for 2025. We are in the world of production now and GenAI systems built with OS components are about to start scaling. They will cost a lot of money.

All you need is memory

DeepSeek-V3 is rough around the edges - but this does not matter when you use a vector database like Qdrant.

vector-database

Feeding DeepSeek-V3 relevant context beyond its 128K context window is the ultimate RAG scenario. But you already know about RAG, and you most certainly know about Qdrant.

Tutorial

Qdrant prepared a minimal code implementation for you to copy and scale as part of your system. Give it a try and report back to me.

tutorial

I'd like to know about your total cost, response accuracy and system scalability.

Feel free to reach out to me via GitHub or publicly on social LinkedIn.