The short answer
Sentiment analysis is the computational task of identifying polarity (positive, negative, neutral) and sometimes intensity in text. It operates at document, sentence, or aspect granularity. Methods range from keyword matching through classical machine learning and BERT-family deep learning to LLM-based classification. The right method depends on domain, latency budget, and whether aspect-level classification is needed. Production systems typically combine methods in a hybrid pipeline.
Definition
Sentiment analysis (sometimes called opinion mining) is the subfield of natural language processing concerned with identifying and extracting subjective information from text — primarily polarity (positive, negative, neutral) and sometimes intensity, emotion category, or subjectivity.
As a discipline, it emerged in the late 1990s from the information-retrieval community and was formalized in the early-to-mid 2000s by academic work from Pang, Lee, and others. The initial test beds were movie and product reviews; most commercial application has stayed close to that territory.
Sentiment analysis is distinct from thematic analysis (what themes appear) and topic modeling (unsupervised theme discovery), though modern feedback platforms run all three together. It is also distinct from emotion detection, which classifies more granular emotional categories (anger, disappointment, delight, relief) beyond simple polarity.
Granularity levels
Three common levels.
Document-level. One score per document. Fast, coarse, loses information when a document expresses mixed sentiment. Useful for aggregate review-score tracking.
Sentence-level. One score per sentence. Captures mixed-sentiment documents better. The default granularity for general-purpose social-listening tools.
Aspect-level. One score per aspect (feature) mentioned in the document. "The camera is great but the battery drains too fast" → (camera, positive) + (battery, negative). The granularity that matters for product feedback because aspects map to product features and route to different teams.
Aspect-level sentiment is what matters for product feedback. Aspects map to features; features map to teams; teams can act. Indellia — Granularity
Methods
Five dominant methods. Summary table:
- Keyword-based. Positive/negative word lists, arithmetic scoring. Very fast, very brittle, poor on negation and sarcasm.
- Classical ML. Logistic regression, SVM, gradient boosting trained on labeled examples. Reliable baseline, predictable latency (milliseconds), interpretable.
- Deep learning (BERT-family). Fine-tuned transformer models. Meaningfully better than classical ML on held-out test sets for review data. Latency 100–300ms on GPU.
- LLM-based. Zero-shot or few-shot classification with GPT-4-family, Claude-family, or Gemini-family models. Flexible, no training required, handles nuance well with good prompting. Higher cost and variability.
- Hybrid. Classical or BERT for high-volume polarity pass, LLM for hard cases flagged by the first pass. The production-default architecture as of Q1 2026.
See the sentiment analysis for product reviews guide for implementation detail on each method.
Evaluation
The number that usually gets cited — "our model is 92% accurate" — is meaningless without three specifics. What test set? What metric? What baseline?
Test set. Accuracy on IMDB movie reviews has no bearing on accuracy on camera reviews. Production evaluation must use in-category data. Vendor claims that don't specify the test set should be treated as marketing.
Metric. Accuracy (correct / total) is a poor metric when classes are imbalanced. F1 score, precision/recall per class, and confusion matrices are more informative for production systems where the cost of false positives differs from the cost of false negatives.
Baseline. "92% accurate" relative to what? Random guessing on a 3-class problem is 33% accurate; a majority-class classifier may hit 60%+ if the class distribution is skewed. The lift over baseline is what actually matters.
Production teams should run a monthly human audit — a random 200-record sample labeled by a domain expert, compared against the classifier. Drift beyond 5 percentage points usually triggers a retrain or re-evaluation.
Domain adaptation
A general-purpose sentiment model trained on news or movie reviews will underperform on product-review data. The gap — often 15–30 percentage points in accuracy — is domain adaptation. Three ways to close it.
Full fine-tuning. Label 5,000–50,000 in-domain examples and fine-tune the model. Best accuracy, highest cost. Weeks to months of effort for a single category.
Few-shot with in-domain examples. For LLM-based methods, include 10–30 category-specific examples in the prompt. Lower accuracy than fine-tuning but faster to deploy and flexible across categories.
Domain-aware platforms. Use a VoC platform that has pre-trained on category-specific data. Indellia's sentiment models, built on NEC Labs foundations, are trained on product-feedback corpora rather than generic sentiment datasets.
Multilingual considerations
Brands with presence in non-English markets need multilingual sentiment classification. Three practical options.
Language detection plus per-language models. Detect the language on ingestion, route to the appropriate model. Highest accuracy per language, most moving parts.
Multilingual models. XLM-RoBERTa, mBERT, and multilingual LLMs can classify sentiment across 50+ languages. Accuracy is lower than per-language fine-tuned models but significantly higher than translation-then-classify pipelines.
Translation-then-classify. Translate to English first, then classify. Meaningful accuracy loss from translation artifacts; not recommended as a primary approach.
Choosing a tool: trade-offs
Five questions to answer before committing.
- Granularity. Document-level is cheap but loses aspect-level signal. For product feedback, insist on aspect-level.
- Domain fit. Has the tool been evaluated on your category's review data? Or only on public benchmarks?
- Latency. Can the tool handle your ingestion volume at your required freshness? A system that takes 2 days to classify new reviews is not a production system.
- Explainability. When a review is classified as negative, can the system show you which text drove the classification? This matters for audit and for user trust.
- Integration. Does the tool output work with your downstream systems — theme detection, anomaly detection, SKU-level rollups, exec reporting?
The Indellia sentiment analysis software landing page covers how Indellia handles each.
Try aspect-based sentiment on your reviews. The free AI Sentiment Analysis Tool runs polarity and aspect-level sentiment on any review set.