Use this file to discover all available pages before exploring further.
This guide explains how class imbalance affects taxonomy classification and how to diagnose and address it using Delve’s built-in tools and configuration options.
Class imbalance occurs when some categories in your taxonomy have significantly more documents than others. This is extremely common in real-world data:
Support tickets: 80% billing issues, 20% technical problems
Product reviews: 90% positive, 10% negative
Document types: 95% standard reports, 5% edge cases
When you use random sampling with imbalanced data, rare categories get underrepresented or completely missed:
Category
% of Data
Expected in 100 Samples
Expected in 200 Samples
Category A
60%
60
120
Category B
30%
30
60
Category C
9%
9
18
Category D
0.8%
~1
~2
Category E
0.2%
0
0
With 100 random samples from data with this distribution, Category E would have zero training examples. The classifier simply cannot learn to recognize it.
When the classifier’s confidence for a document is below the threshold, Delve handles it according to low_confidence_action:
Action
Behavior
Cost
"other" (default)
Label as “Other” category
Free
"llm"
Re-label with LLM (max 20 docs)
Medium
"keep"
Keep classifier prediction
Free
The default "other" action is recommended for most use cases. It’s honest about uncertainty (the classifier truly doesn’t know) and avoids expensive LLM calls.
Safeguard for "llm" action: If more than 20 documents need re-labeling, Delve automatically falls back to "other" to prevent excessive LLM costs. For large datasets with significant imbalance, use min_examples_per_category instead.
What it is: Count of documents per category in the training sample (LLM-labeled documents).What to look for: Categories with very low or zero counts.How to act: If a category has fewer than 3 samples, the classifier will struggle with it.
result = delve.run_sync("data.csv", text_column="text")# Check sample distributionsample_dist = result.metadata.get("sample_distribution", {})for category, count in sorted(sample_dist.items(), key=lambda x: x[1]): if count < 3: print(f"Warning: '{category}' has only {count} training examples")
What it is: List of taxonomy categories with no training examples.What to look for: Any non-empty list indicates guaranteed blind spots.How to act: Increase sample_size or enable min_examples_per_category.
zero_cats = result.metadata.get("zero_sample_categories", [])if zero_cats: print(f"Categories with ZERO training examples:") for cat in zero_cats: print(f" - {cat}") print(f"\nConsider setting min_examples_per_category=5")
What it is: F1 score for each category on the classifier’s test set.What to look for: Scores below 0.5, especially 0.0.How to act: Low F1 for specific categories means the classifier can’t reliably predict them.
metrics = result.metadata.get("classifier_metrics", {})per_class = metrics.get("per_class_f1", {})print("Per-class F1 scores:")for cat, f1 in sorted(per_class.items(), key=lambda x: x[1]): status = "OK" if f1 >= 0.5 else "POOR" print(f" {cat}: {f1:.2f} [{status}]")
When Delve discovers the taxonomy, it creates categories based on what it sees in your sample. This naturally tends toward balance, but edge cases can still be missed.Recommendations:
Don’t try to infer “Other” from classifier confidence. Always include an “Other” category in your taxonomy if you expect some documents won’t fit your defined categories.
You might think: “If the classifier is uncertain, the document probably doesn’t fit any category, so label it as Other.”This doesn’t work well in practice. When the classifier has low confidence, it’s usually uncertain between valid categories (e.g., “Planning” vs “General Questions”), not because the document doesn’t fit any category.Real-world test results:
If you have 15 categories and 5 of them have F1 = 0.0 (the classifier never predicts them correctly), your macro F1 will be dragged down significantly, even if the major categories perform well.Example from real data:
F1 Weighted: 88.8% (great overall performance)
F1 Macro: 36.0% (several rare categories have F1 = 0)
A large gap between weighted and macro F1 is a sign of class imbalance. The weighted score is dominated by majority classes, hiding poor performance on rare categories. Use per_class_f1 to identify which specific categories are struggling.