SDK Reference

Installation

pip install delve-taxonomy

Delve Client

The main class for interacting with Delve programmatically.

Basic Usage

from delve import Delve

# Initialize with defaults
delve = Delve()

# Run taxonomy generation
result = delve.run_sync("data.csv", text_column="text")

# Access results
print(result.taxonomy)
print(result.labeled_documents)

Initialization

delve = Delve(
    model="anthropic/claude-sonnet-4-5-20250929",
    fast_llm="anthropic/claude-haiku-4-5-20251001",
    sample_size=100,
    batch_size=200,
    max_num_clusters=5,
    use_case="Generate taxonomy for my data",
    output_dir="./results",
    output_formats=["json", "csv", "markdown"],
    verbosity=Verbosity.NORMAL,  # SILENT, QUIET, NORMAL, VERBOSE, DEBUG
    predefined_taxonomy=None,  # Use existing taxonomy instead of generating
    embedding_model="text-embedding-3-large",
    classifier_confidence_threshold=0.0
)

Configuration Options

model

string

default:"anthropic/claude-sonnet-4-5-20250929"

Main LLM model for taxonomy generation and reasoning.

delve = Delve(model="anthropic/claude-opus-4")

Supported models:

anthropic/claude-sonnet-4-5-20250929 (recommended)
anthropic/claude-opus-4 (most capable)
anthropic/claude-haiku-4-5-20251001 (fastest/cheapest)
Any model supported by LiteLLM

fast_llm

string

default:"anthropic/claude-haiku-4-5-20251001"

Faster model for document summarization to reduce costs.

delve = Delve(fast_llm="anthropic/claude-haiku-4-5-20251001")

Use a faster, cheaper model for the summarization step.

sample_size

integer

default:"100"

Number of documents to sample for taxonomy generation.

delve = Delve(sample_size=200)

Larger samples (200-500) produce more comprehensive taxonomies but cost more and take longer. Start with 100 for quick iterations.

batch_size

integer

default:"200"

Number of documents per minibatch during iterative clustering.

delve = Delve(batch_size=50)

Smaller batches (50-100) produce more refined taxonomies. Larger batches (200-300) are faster but may be less precise.

max_num_clusters

integer

default:"5"

Maximum number of clusters/categories to generate in the taxonomy.

delve = Delve(max_num_clusters=10)

Start with a smaller number (5-10) for focused taxonomies. Increase for more granular categorization of diverse datasets.

use_case

string

Custom description of your taxonomy use case.

delve = Delve(
    use_case="Categorize customer feedback by product feature and sentiment"
)

Providing a specific use case helps guide the model to generate more relevant categories for your domain.

output_dir

string

default:"./results"

Directory for saving output files.

delve = Delve(output_dir="./my-results")

Creates the directory if it doesn’t exist.

output_formats

list

default:"['json', 'csv', 'markdown']"

List of output formats to generate.

delve = Delve(output_formats=["json", "csv"])

Available formats:

json - Machine-readable taxonomy and labeled documents
csv - Spreadsheet format for analysis
markdown - Human-readable reports

verbosity

Verbosity

default:"Verbosity.SILENT"

Output verbosity level. Controls how much progress information is displayed.

from delve.console import Verbosity

delve = Delve(verbosity=Verbosity.SILENT)   # No output (SDK default)
delve = Delve(verbosity=Verbosity.QUIET)    # Errors only
delve = Delve(verbosity=Verbosity.NORMAL)   # Spinners + checkmarks
delve = Delve(verbosity=Verbosity.VERBOSE)  # Progress bars with ETA
delve = Delve(verbosity=Verbosity.DEBUG)    # Everything + internal state

Levels:

SILENT (0) - No output, ideal for SDK usage in scripts
QUIET (1) - Errors only
NORMAL (2) - Spinners and success checkmarks
VERBOSE (3) - Progress bars with item counts and ETA
DEBUG (4) - All output plus warnings and debug info

predefined_taxonomy

string | list | None

default:"None"

Use an existing taxonomy instead of generating one. Useful when you want to label documents with known categories.

# From a JSON/CSV file
delve = Delve(predefined_taxonomy="categories.json")

# Or as a list of dicts
delve = Delve(predefined_taxonomy=[
    {"id": "1", "name": "Bug", "description": "Bug reports and issues"},
    {"id": "2", "name": "Feature", "description": "Feature requests"},
])

When provided, Delve skips taxonomy discovery and directly labels documents using the given categories.

embedding_model

string

default:"text-embedding-3-large"

OpenAI embedding model for classifier training. Used when sample_size < total documents to train an efficient classifier for labeling remaining documents.

delve = Delve(embedding_model="text-embedding-3-small")  # Cheaper option

classifier_confidence_threshold

float

default:"0.0"

Minimum confidence for classifier predictions. Documents below this threshold fall back to LLM labeling. Set to 0 to use classifier for all documents (no fallback).

delve = Delve(classifier_confidence_threshold=0.8)  # Use LLM for low-confidence docs

Methods

run_sync()

Synchronous method for taxonomy generation (recommended for most use cases).

result = delve.run_sync(
    data,
    text_column=None,
    id_column=None,
    source_type=None,
    **adapter_kwargs
)

data

str | Path | DataFrame

required

Data source to process. Can be:

Path to CSV file ("data.csv")
Path to JSON/JSONL file ("data.json")
LangSmith URI ("langsmith://project-name")
pandas DataFrame

text_column

string

Column/field name containing text content (required for CSV/DataFrame).

result = delve.run_sync("data.csv", text_column="message")

id_column

string

Column/field name for document IDs (optional).

result = delve.run_sync(
    "data.csv",
    text_column="text",
    id_column="doc_id"
)

source_type

string

Force specific adapter type: csv, json, jsonl, langsmith, dataframe

result = delve.run_sync(
    "data.txt",
    source_type="json",
    text_field="content"
)

**adapter_kwargs

dict

Additional adapter-specific parameters:For JSON:

json_path - JSONPath expression for nested data
text_field - Field name containing text

For LangSmith:

api_key - LangSmith API key
days - Days to look back (default: 7)
max_runs - Maximum runs to fetch
filter_expr - LangSmith filter expression

Returns: DelveResult object with taxonomy, labeled documents, and metadata. Example:

from delve import Delve

delve = Delve(sample_size=150)
result = delve.run_sync(
    "feedback.csv",
    text_column="comment",
    id_column="ticket_id"
)

print(f"Generated {len(result.taxonomy)} categories")
for category in result.taxonomy:
    print(f"- {category.name}: {category.description}")

run()

Asynchronous version of run_sync(). Use for async applications.

import asyncio
from delve import Delve

async def main():
    delve = Delve()
    result = await delve.run("data.csv", text_column="text")
    print(result.taxonomy)

asyncio.run(main())

run_with_docs() / run_with_docs_sync()

Process pre-created Doc objects directly, useful for programmatic document creation or testing.

from delve import Delve, Doc

docs = [
    Doc(id="1", content="Fix authentication bug"),
    Doc(id="2", content="Add dark mode feature"),
]

delve = Delve(use_case="Categorize software issues")
result = delve.run_with_docs_sync(docs)
# Or async: result = await delve.run_with_docs(docs)

find_matches() / find_matches_async()

Fast, lightweight binary detection for finding documents matching a single category. Uses hybrid semantic + keyword matching without running the full taxonomy pipeline.

from delve import Delve

result = Delve.find_matches(
    "data.csv",
    category={
        "name": "Refund Request",
        "description": "User asking for refund, money back, or cancellation",
        "keywords": ["refund", "money back", "cancel order"],
    },
    text_column="content",
    threshold=0.6,
)

# All documents returned with scores
print(f"Found {result.stats['matches']} matches")

# Access matched documents (category != None)
for doc in result.matched_documents[:5]:
    print(f"{doc.id}: {doc.confidence:.2f}")

Data Sources

Delve supports multiple input formats. The source_type is auto-detected from file extensions, or you can specify it explicitly.

Format	Extension	Required Parameters
CSV	`.csv`	`text_column`
JSON	`.json`	`text_field` or `json_path`
JSONL	`.jsonl`	(auto-extracts text)
DataFrame	(in-memory)	`text_column`
LangSmith	`langsmith://` URI	`api_key`

# CSV
result = delve.run_sync("data.csv", text_column="message")

# JSON with JSONPath for nested data
result = delve.run_sync("data.json", json_path="$.messages[*].content")

# pandas DataFrame
result = delve.run_sync(df, text_column="message")

# LangSmith
result = delve.run_sync("langsmith://my-project", api_key="lsv2_...", days=7)

Working with Results

The DelveResult object provides access to all outputs:

result = delve.run_sync("data.csv", text_column="text")

# Taxonomy categories
for cat in result.taxonomy:
    print(f"{cat.name}: {cat.description}")

# Labeled documents
for doc in result.labeled_documents:
    print(f"{doc.id} → {doc.category}")

# Metadata and export paths
print(result.metadata)  # {'num_documents': 100, 'num_categories': 5, ...}
print(result.export_paths)  # {'taxonomy': Path(...), 'csv': Path(...), ...}

TaxonomyCategory

Attribute	Type	Description
`id`	str	Unique category identifier
`name`	str	Category name
`description`	str	Category description

Doc (labeled document)

Attribute	Type	Description
`id`	str	Document identifier
`content`	str	Original text content
`category`	str	Assigned category name
`explanation`	str \| None	Why this category was assigned
`summary`	str \| None	LLM-generated summary

Metadata

The result.metadata dictionary contains comprehensive run statistics:

result.metadata = {
    # Basic info
    "num_documents": 5000,          # Total documents processed
    "num_categories": 10,           # Number of taxonomy categories
    "sample_size": 100,             # Configured sample size
    "batch_size": 200,              # Configured batch size
    "model": "anthropic/claude-sonnet-4-5-20250929",
    "fast_llm": "anthropic/claude-haiku-4-5-20251001",

    # Timing
    "run_duration_seconds": 145.32, # Total processing time

    # Category distribution
    "category_counts": {            # Documents per category
        "Bug Fix": 1250,
        "Feature Request": 890,
        "Question": 650,
        ...
    },

    # Labeling breakdown
    "llm_labeled_count": 100,       # Documents labeled by LLM
    "classifier_labeled_count": 4900, # Documents labeled by classifier
    "skipped_document_count": 5,    # Documents that couldn't be categorized

    # Classifier metrics (when classifier is used)
    "classifier_metrics": {
        "train_accuracy": 0.92,
        "test_accuracy": 0.85,
        "train_f1": 0.91,
        "test_f1": 0.847
    },

    # Source information
    "source": {
        "type": "csv",              # csv, json, dataframe, langsmith, docs
        "path": "data.csv",
        "text_column": "content",
        "id_column": null
    },

    # Quality tracking
    "warnings": [],                 # Any processing warnings
    "status_log": [...]             # Step-by-step status messages
}

Use category_counts to quickly see how your documents are distributed across categories:

from collections import Counter

# Get top 5 categories
top_categories = Counter(result.metadata["category_counts"]).most_common(5)
for category, count in top_categories:
    print(f"{category}: {count} documents")

The classifier_metrics key is only present when sample_size < total documents, meaning a classifier was trained to label the remaining documents.

Error Handling

from delve import Delve

try:
    delve = Delve()
    result = delve.run_sync("data.csv", text_column="text")
except ValueError as e:
    # Missing API key, invalid parameters, etc.
    print(f"Configuration error: {e}")
except FileNotFoundError as e:
    # File doesn't exist
    print(f"File not found: {e}")
except Exception as e:
    # Other errors
    print(f"Error: {e}")

Environment Variables

Set these before running your code:

# Required
export ANTHROPIC_API_KEY="your-anthropic-key"

# Required when sample_size > 0 and docs > sample_size (for classifier embeddings)
export OPENAI_API_KEY="your-openai-key"

# Optional
export LANGSMITH_API_KEY="your-langsmith-key"

The OpenAI API key is required for generating embeddings when training the classifier. If you set sample_size=0, all documents are labeled by the LLM and no OpenAI key is needed.

Or use python-dotenv:

from dotenv import load_dotenv
load_dotenv()

from delve import Delve
# API keys are loaded automatically

Getting Started

Advanced Topics

CLI Usage

SDK Usage

Examples

SDK Reference

Installation

Delve Client

Basic Usage

Initialization

Configuration Options

Methods

run_sync()

run()

run_with_docs() / run_with_docs_sync()

find_matches() / find_matches_async()

Data Sources

Working with Results

TaxonomyCategory

Doc (labeled document)

Metadata

Error Handling

Environment Variables

Next Steps

Examples

CLI Reference

Getting Started

Advanced Topics

CLI Usage

SDK Usage

Examples

Documentation Index

​Installation

​Delve Client

​Basic Usage

​Initialization

​Configuration Options

​Methods

​run_sync()

​run()

​run_with_docs() / run_with_docs_sync()

​find_matches() / find_matches_async()

​Data Sources

​Working with Results

​TaxonomyCategory

​Doc (labeled document)

​Metadata

​Error Handling

​Environment Variables

​Next Steps

Examples

CLI Reference

Installation

Delve Client

Basic Usage

Initialization

Configuration Options

Methods

run_sync()

run()

run_with_docs() / run_with_docs_sync()

find_matches() / find_matches_async()

Data Sources

Working with Results

TaxonomyCategory

Doc (labeled document)

Metadata

Error Handling

Environment Variables

Next Steps