Skip to content

LangExtract GuideMaster Google's AI Data Extraction Library

The Unofficial Developer's Handbook. Learn how to use LangExtract to extract structured data from any text using Gemini, OpenAI, and Ollama.

Unofficial Guide

This is a community-maintained guide built by developers. We are not affiliated with Google.

Why Trust This Guide?

Official documentation is great, but real-world projects are messy. As developers using LangExtract in production, we built this guide to bridge the gap between "Hello World" and deployment.

We cover local LLM setups, cost optimization, and handling complex documents — things you won't find in the standard README.


Quick Start

Get up and running in 30 seconds.

1. Install

Install via pip. Requires Python 3.9+.

bash
pip install langextract

TIP

Use a virtual environment to avoid dependency conflicts: python -m venv venv && source venv/bin/activate

2. Configure API Key

By default, LangExtract uses Google Gemini. Get your key from Google AI Studio.

bash
export LANGEXTRACT_API_KEY="your-api-key-here"

3. Your First Extraction

Extract characters from a simple text using few-shot examples.

python
import langextract as lx

# Define your extraction prompt
prompt = "Extract characters and their emotions from the text."

# Provide a high-quality example to guide the model
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks?",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
        ]
    )
]

# Input text to process
text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo."

# Run extraction (uses Gemini Flash by default)
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

print(result.extractions)
print(result.extractions)

4. Visualize the Results (Interactive Visualization) 📊

LangExtract's killer feature. Generate an interactive HTML report for easy verification.

python
# 1. Save as JSONL
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")

# 2. Generate interactive HTML
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content.data if hasattr(html_content, 'data') else html_content)

View Visualization Example


LLM Configuration Guide

LangExtract supports multiple backends. Here is how to configure them.

Local LLMs with Ollama 🏠

Great for privacy and zero cost.

  1. Install Ollama: ollama.com
  2. Pull a Model: ollama pull gemma2:2b
  3. Run Ollama Server: ollama serve
  4. Configure Code:
python
import langextract as lx

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemma2:2b",  # Automatically selects Ollama provider
    model_url="http://localhost:11434",
    fence_output=False,
    use_schema_constraints=False
)

OpenAI GPT-4 🧠

Best for complex reasoning tasks. Requires optional dependency: pip install langextract[openai]

bash
export OPENAI_API_KEY="sk-..."
python
import os
import langextract as lx

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gpt-4o",  # Automatically selects OpenAI provider
    api_key=os.environ.get('OPENAI_API_KEY'),
    fence_output=True,
    use_schema_constraints=False
)

Note

OpenAI models require fence_output=True and use_schema_constraints=False because LangExtract doesn't implement schema constraints for OpenAI yet.

OpenAI-Compatible APIs 🔌

LangExtract works with any OpenAI-compatible API, including DeepSeek, Qwen, Doubao, and more.

python
import langextract as lx

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="deepseek-chat",
    api_key="your-api-key",
    language_model_params={
        "base_url": "https://api.deepseek.com/v1"  # Replace with provider's URL
    },
    fence_output=True,
    use_schema_constraints=False
)
    use_schema_constraints=False
)

Google Vertex AI Batch (Enterprise) 🏢

For large-scale tasks, enable Batch mode to save costs.

python
result = lx.extract(
    ...,
    language_model_params={"vertexai": True, "batch": {"enabled": True}}
)

Advanced Installation: Docker 🐳

Run without polluting your local environment:

bash
docker run --rm -e LANGEXTRACT_API_KEY="your-key" langextract python your_script.py

Scaling to Longer Documents 📚

How to handle books or PDFs larger than the context window? LangExtract features built-in Chunking and Parallel Processing.

No need to split text manually. Just pass the URL or long text:

python
# Example: Process the entire "Romeo and Juliet"
result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    # Core parameters
    extraction_passes=3,    # Multiple passes to improve recall
    max_workers=20,         # Parallel workers for speed
    max_char_buffer=1000    # Context buffer size
)

Real-World Data Extraction Examples

Explore our collection of Google AI data extraction examples for common use cases.

🏥 Medical Report Extraction

Extract medication names, dosages, and frequencies from clinical notes. View Full Example

📚 Long Text Extraction

Handling PDFs or books that exceed token limits. View Full Example

🇯🇵 Multilingual Extraction

Working with non-English text (Japanese, Chinese). View Full Example


FAQ

Q: What is the difference between LangExtract and Docling? A: Docling specializes in parsing documents (like PDFs) into Markdown, handling layout analysis. LangExtract focuses on extracting structured data (like JSON) from text. They work great together: use Docling to parse, and LangExtract to structure the data.

Q: Can I use DeepSeek, Groq, or other OpenAI-compatible models? A: Absolutely. LangExtract supports any model with an OpenAI-compatible API. Just set the base_url to your provider's endpoint. It works seamlessly with DeepSeek V3/R1, Groq, local vLLM, etc.

Q: How do I handle documents longer than the context window? A: LangExtract has built-in chunking mechanisms. Check out our Long Text Extraction Example to see how it automatically splits long texts, processes them (in parallel or sequence), and merges the results.

Q: Can I run this locally for privacy? A: Yes. Integrate with Ollama to run models like Llama 3 or Mistral locally. This is free and ensures data never leaves your machine, ideal for medical or legal data.

Q: Is LangExtract free? A: The library is 100% open-source and free. You only pay for the LLM API usage (e.g., Google Gemini, OpenAI). If run locally with Ollama, it operates completely cost-free.


For Chinese Users (中文用户)

Looking for a LangExtract 教程 or 安装指南? This guide covers everything from langextract使用本地模型 (Ollama) to langextract实测 examples.

Unofficial Guide. Not associated with Google.