Skip to main content

Documentation Index

Fetch the complete documentation index at: https://wildcampstudio.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Installation

pip install delve-taxonomy

Delve Client

The main class for interacting with Delve programmatically.

Basic Usage

from delve import Delve

# Initialize with defaults
delve = Delve()

# Run taxonomy generation
result = delve.run_sync("data.csv", text_column="text")

# Access results
print(result.taxonomy)
print(result.labeled_documents)

Initialization

delve = Delve(
    model="anthropic/claude-sonnet-4-5-20250929",
    fast_llm="anthropic/claude-haiku-4-5-20251001",
    sample_size=100,
    batch_size=200,
    max_num_clusters=5,
    use_case="Generate taxonomy for my data",
    output_dir="./results",
    output_formats=["json", "csv", "markdown"],
    verbosity=Verbosity.NORMAL,  # SILENT, QUIET, NORMAL, VERBOSE, DEBUG
    predefined_taxonomy=None,  # Use existing taxonomy instead of generating
    embedding_model="text-embedding-3-large",
    classifier_confidence_threshold=0.0
)

Configuration Options

model
string
default:"anthropic/claude-sonnet-4-5-20250929"
Main LLM model for taxonomy generation and reasoning.
delve = Delve(model="anthropic/claude-opus-4")
Supported models:
  • anthropic/claude-sonnet-4-5-20250929 (recommended)
  • anthropic/claude-opus-4 (most capable)
  • anthropic/claude-haiku-4-5-20251001 (fastest/cheapest)
  • Any model supported by LiteLLM
fast_llm
string
default:"anthropic/claude-haiku-4-5-20251001"
Faster model for document summarization to reduce costs.
delve = Delve(fast_llm="anthropic/claude-haiku-4-5-20251001")
Use a faster, cheaper model for the summarization step.
sample_size
integer
default:"100"
Number of documents to sample for taxonomy generation.
delve = Delve(sample_size=200)
Larger samples (200-500) produce more comprehensive taxonomies but cost more and take longer. Start with 100 for quick iterations.
batch_size
integer
default:"200"
Number of documents per minibatch during iterative clustering.
delve = Delve(batch_size=50)
Smaller batches (50-100) produce more refined taxonomies. Larger batches (200-300) are faster but may be less precise.
max_num_clusters
integer
default:"5"
Maximum number of clusters/categories to generate in the taxonomy.
delve = Delve(max_num_clusters=10)
Start with a smaller number (5-10) for focused taxonomies. Increase for more granular categorization of diverse datasets.
use_case
string
Custom description of your taxonomy use case.
delve = Delve(
    use_case="Categorize customer feedback by product feature and sentiment"
)
Providing a specific use case helps guide the model to generate more relevant categories for your domain.
output_dir
string
default:"./results"
Directory for saving output files.
delve = Delve(output_dir="./my-results")
Creates the directory if it doesn’t exist.
output_formats
list
default:"['json', 'csv', 'markdown']"
List of output formats to generate.
delve = Delve(output_formats=["json", "csv"])
Available formats:
  • json - Machine-readable taxonomy and labeled documents
  • csv - Spreadsheet format for analysis
  • markdown - Human-readable reports
verbosity
Verbosity
default:"Verbosity.SILENT"
Output verbosity level. Controls how much progress information is displayed.
from delve.console import Verbosity

delve = Delve(verbosity=Verbosity.SILENT)   # No output (SDK default)
delve = Delve(verbosity=Verbosity.QUIET)    # Errors only
delve = Delve(verbosity=Verbosity.NORMAL)   # Spinners + checkmarks
delve = Delve(verbosity=Verbosity.VERBOSE)  # Progress bars with ETA
delve = Delve(verbosity=Verbosity.DEBUG)    # Everything + internal state
Levels:
  • SILENT (0) - No output, ideal for SDK usage in scripts
  • QUIET (1) - Errors only
  • NORMAL (2) - Spinners and success checkmarks
  • VERBOSE (3) - Progress bars with item counts and ETA
  • DEBUG (4) - All output plus warnings and debug info
predefined_taxonomy
string | list | None
default:"None"
Use an existing taxonomy instead of generating one. Useful when you want to label documents with known categories.
# From a JSON/CSV file
delve = Delve(predefined_taxonomy="categories.json")

# Or as a list of dicts
delve = Delve(predefined_taxonomy=[
    {"id": "1", "name": "Bug", "description": "Bug reports and issues"},
    {"id": "2", "name": "Feature", "description": "Feature requests"},
])
When provided, Delve skips taxonomy discovery and directly labels documents using the given categories.
embedding_model
string
default:"text-embedding-3-large"
OpenAI embedding model for classifier training. Used when sample_size < total documents to train an efficient classifier for labeling remaining documents.
delve = Delve(embedding_model="text-embedding-3-small")  # Cheaper option
classifier_confidence_threshold
float
default:"0.0"
Minimum confidence for classifier predictions. Documents below this threshold fall back to LLM labeling. Set to 0 to use classifier for all documents (no fallback).
delve = Delve(classifier_confidence_threshold=0.8)  # Use LLM for low-confidence docs

Methods

run_sync()

Synchronous method for taxonomy generation (recommended for most use cases).
result = delve.run_sync(
    data,
    text_column=None,
    id_column=None,
    source_type=None,
    **adapter_kwargs
)
data
str | Path | DataFrame
required
Data source to process. Can be:
  • Path to CSV file ("data.csv")
  • Path to JSON/JSONL file ("data.json")
  • LangSmith URI ("langsmith://project-name")
  • pandas DataFrame
text_column
string
Column/field name containing text content (required for CSV/DataFrame).
result = delve.run_sync("data.csv", text_column="message")
id_column
string
Column/field name for document IDs (optional).
result = delve.run_sync(
    "data.csv",
    text_column="text",
    id_column="doc_id"
)
source_type
string
Force specific adapter type: csv, json, jsonl, langsmith, dataframe
result = delve.run_sync(
    "data.txt",
    source_type="json",
    text_field="content"
)
**adapter_kwargs
dict
Additional adapter-specific parameters:For JSON:
  • json_path - JSONPath expression for nested data
  • text_field - Field name containing text
For LangSmith:
  • api_key - LangSmith API key
  • days - Days to look back (default: 7)
  • max_runs - Maximum runs to fetch
  • filter_expr - LangSmith filter expression
Returns: DelveResult object with taxonomy, labeled documents, and metadata. Example:
from delve import Delve

delve = Delve(sample_size=150)
result = delve.run_sync(
    "feedback.csv",
    text_column="comment",
    id_column="ticket_id"
)

print(f"Generated {len(result.taxonomy)} categories")
for category in result.taxonomy:
    print(f"- {category.name}: {category.description}")

run()

Asynchronous version of run_sync(). Use for async applications.
import asyncio
from delve import Delve

async def main():
    delve = Delve()
    result = await delve.run("data.csv", text_column="text")
    print(result.taxonomy)

asyncio.run(main())

run_with_docs() / run_with_docs_sync()

Process pre-created Doc objects directly, useful for programmatic document creation or testing.
from delve import Delve, Doc

docs = [
    Doc(id="1", content="Fix authentication bug"),
    Doc(id="2", content="Add dark mode feature"),
]

delve = Delve(use_case="Categorize software issues")
result = delve.run_with_docs_sync(docs)
# Or async: result = await delve.run_with_docs(docs)

find_matches() / find_matches_async()

Fast, lightweight binary detection for finding documents matching a single category. Uses hybrid semantic + keyword matching without running the full taxonomy pipeline.
from delve import Delve

result = Delve.find_matches(
    "data.csv",
    category={
        "name": "Refund Request",
        "description": "User asking for refund, money back, or cancellation",
        "keywords": ["refund", "money back", "cancel order"],
    },
    text_column="content",
    threshold=0.6,
)

# All documents returned with scores
print(f"Found {result.stats['matches']} matches")

# Access matched documents (category != None)
for doc in result.matched_documents[:5]:
    print(f"{doc.id}: {doc.confidence:.2f}")
category
dict
required
Category definition with name, description, and optional keywords list.
threshold
float
default:"0.5"
Minimum score (0-1) for a document to be considered a match.
semantic_weight
float
default:"0.7"
Weight for semantic (embedding) similarity.
keyword_weight
float
default:"0.3"
Weight for keyword matching. Set to 0 for pure semantic matching.
Returns: MatchResult with all documents scored, plus matched_documents and unmatched_documents properties.
Binary detection is much faster (2-4 min for 30K docs) and cheaper ($1-2) than full taxonomy generation. See Binary Detection for full documentation.

Data Sources

Delve supports multiple input formats. The source_type is auto-detected from file extensions, or you can specify it explicitly.
FormatExtensionRequired Parameters
CSV.csvtext_column
JSON.jsontext_field or json_path
JSONL.jsonl(auto-extracts text)
DataFrame(in-memory)text_column
LangSmithlangsmith:// URIapi_key
# CSV
result = delve.run_sync("data.csv", text_column="message")

# JSON with JSONPath for nested data
result = delve.run_sync("data.json", json_path="$.messages[*].content")

# pandas DataFrame
result = delve.run_sync(df, text_column="message")

# LangSmith
result = delve.run_sync("langsmith://my-project", api_key="lsv2_...", days=7)

Working with Results

The DelveResult object provides access to all outputs:
result = delve.run_sync("data.csv", text_column="text")

# Taxonomy categories
for cat in result.taxonomy:
    print(f"{cat.name}: {cat.description}")

# Labeled documents
for doc in result.labeled_documents:
    print(f"{doc.id}{doc.category}")

# Metadata and export paths
print(result.metadata)  # {'num_documents': 100, 'num_categories': 5, ...}
print(result.export_paths)  # {'taxonomy': Path(...), 'csv': Path(...), ...}

TaxonomyCategory

AttributeTypeDescription
idstrUnique category identifier
namestrCategory name
descriptionstrCategory description

Doc (labeled document)

AttributeTypeDescription
idstrDocument identifier
contentstrOriginal text content
categorystrAssigned category name
explanationstr | NoneWhy this category was assigned
summarystr | NoneLLM-generated summary

Metadata

The result.metadata dictionary contains comprehensive run statistics:
result.metadata = {
    # Basic info
    "num_documents": 5000,          # Total documents processed
    "num_categories": 10,           # Number of taxonomy categories
    "sample_size": 100,             # Configured sample size
    "batch_size": 200,              # Configured batch size
    "model": "anthropic/claude-sonnet-4-5-20250929",
    "fast_llm": "anthropic/claude-haiku-4-5-20251001",

    # Timing
    "run_duration_seconds": 145.32, # Total processing time

    # Category distribution
    "category_counts": {            # Documents per category
        "Bug Fix": 1250,
        "Feature Request": 890,
        "Question": 650,
        ...
    },

    # Labeling breakdown
    "llm_labeled_count": 100,       # Documents labeled by LLM
    "classifier_labeled_count": 4900, # Documents labeled by classifier
    "skipped_document_count": 5,    # Documents that couldn't be categorized

    # Classifier metrics (when classifier is used)
    "classifier_metrics": {
        "train_accuracy": 0.92,
        "test_accuracy": 0.85,
        "train_f1": 0.91,
        "test_f1": 0.847
    },

    # Source information
    "source": {
        "type": "csv",              # csv, json, dataframe, langsmith, docs
        "path": "data.csv",
        "text_column": "content",
        "id_column": null
    },

    # Quality tracking
    "warnings": [],                 # Any processing warnings
    "status_log": [...]             # Step-by-step status messages
}
Use category_counts to quickly see how your documents are distributed across categories:
from collections import Counter

# Get top 5 categories
top_categories = Counter(result.metadata["category_counts"]).most_common(5)
for category, count in top_categories:
    print(f"{category}: {count} documents")
The classifier_metrics key is only present when sample_size < total documents, meaning a classifier was trained to label the remaining documents.

Error Handling

from delve import Delve

try:
    delve = Delve()
    result = delve.run_sync("data.csv", text_column="text")
except ValueError as e:
    # Missing API key, invalid parameters, etc.
    print(f"Configuration error: {e}")
except FileNotFoundError as e:
    # File doesn't exist
    print(f"File not found: {e}")
except Exception as e:
    # Other errors
    print(f"Error: {e}")

Environment Variables

Set these before running your code:
# Required
export ANTHROPIC_API_KEY="your-anthropic-key"

# Required when sample_size > 0 and docs > sample_size (for classifier embeddings)
export OPENAI_API_KEY="your-openai-key"

# Optional
export LANGSMITH_API_KEY="your-langsmith-key"
The OpenAI API key is required for generating embeddings when training the classifier. If you set sample_size=0, all documents are labeled by the LLM and no OpenAI key is needed.
Or use python-dotenv:
from dotenv import load_dotenv
load_dotenv()

from delve import Delve
# API keys are loaded automatically

Next Steps

Examples

See working code examples

CLI Reference

Learn CLI commands