Visualize Sentence Embeddings Using PCA and t-SNE in Python
Let's visualize how sentence embeddings group similar sentences together using PCA and t-SNE.
Install Required Libraries
If you haven't installed them yet:
pip install sentence-transformers matplotlib scikit-learn
1. Generate Sentence Embeddings
We'll use Sentence Transformers to generate sentence embeddings.
from sentence_transformers import SentenceTransformer
import numpy as np
# Load pre-trained model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# List of sentences
sentences = [
"I love AI.",
"Artificial Intelligence is fascinating.",
"The weather is sunny today.",
"I enjoy machine learning.",
"It's raining outside.",
"Deep learning is a subset of machine learning."
]
# Generate embeddings
embeddings = model.encode(sentences)
print("Embedding Shape:", embeddings.shape)
2. PCA Visualization (2D Plot)
Principal Component Analysis (PCA) reduces the dimensions of embeddings to 2D for visualization.
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Reduce to 2 dimensions
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)
# Plot the embeddings
plt.figure(figsize=(10, 6))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], color='blue')
# Add labels
for i, sentence in enumerate(sentences):
plt.text(embeddings_2d[i, 0], embeddings_2d[i, 1], sentence, fontsize=10)
plt.title("Sentence Embeddings Visualized with PCA")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.grid(True)
plt.show()
3. t-SNE Visualization
t-SNE (t-distributed Stochastic Neighbor Embedding) is better at capturing local similarities between sentences.
from sklearn.manifold import TSNE
# Reduce to 2 dimensions using t-SNE
tsne = TSNE(n_components=2, perplexity=5, random_state=42)
embeddings_2d_tsne = tsne.fit_transform(embeddings)
# Plot
plt.figure(figsize=(10, 6))
plt.scatter(embeddings_2d_tsne[:, 0], embeddings_2d_tsne[:, 1], color='green')
# Add labels
for i, sentence in enumerate(sentences):
plt.text(embeddings_2d_tsne[i, 0], embeddings_2d_tsne[i, 1], sentence, fontsize=10)
plt.title("Sentence Embeddings Visualized with t-SNE")
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.grid(True)
plt.show()
What's the Difference?
| Method | Best For | Pros | Cons |
|---|---|---|---|
| PCA | Global structure | Fast, linear | Misses local relationships |
| t-SNE | Local structure (clusters) | Captures clusters | Slow, sensitive to hyperparameters |
Conclusion
- Use PCA if you want a quick overview of the embeddings.
- Use t-SNE if you're looking for clusters and more detailed semantic grouping.
Comments
Post a Comment