Supported Document Formats¶
PrimeThink supports a wide variety of document formats for upload and processing. When you upload a document, the system automatically extracts the text content and makes it available to AI assistants in your chats.
Document Text Extraction Process¶
When you upload a document, PrimeThink automatically extracts text content in Markdown format for supported file types. This extraction process is asynchronous and happens in the background.
How Extraction Works¶
- Upload: When a file is uploaded, it receives an initial status of
"Added" - Processing: The system begins extracting text content asynchronously
- Completion: Once extraction is complete, the status changes (e.g., to
"Processed","Ready", or"Indexed") - Error Handling: If extraction fails, the status becomes
"Error"
Extraction Timing¶
- Processing Time: Varies from a few seconds to longer depending on:
- File size (larger files take longer)
- File complexity (scanned PDFs with OCR take longer than text-based documents)
-
System load
-
Status Indicators:
"Added"= Extraction not yet started or in progress"Error"= Extraction failed- Any other status (
"Processed","Ready","Indexed", etc.) = Extraction complete, text available
Extracted Content Format¶
The extracted text is converted to Markdown format, which includes: - Document structure (headings, paragraphs) - Tables (for spreadsheets and documents with tables) - Basic formatting (bold, italic, lists) - Text from images (via OCR or AI vision) - Transcriptions (for audio/video files)
Checking Extraction Status¶
To check if a document's text has been extracted: - Check the document's status field in API responses - Use pt.getDocumentStatus(docId) in live apps - Any status other than "Added" or "Error" indicates extraction is complete
For programmatic workflows in live apps, see Data Management API for details on checking status and retrieving extracted text.
Document Files¶
| Format | Extensions | Description |
|---|---|---|
.pdf | Portable Document Format with OCR support for scanned documents | |
| Word | .doc, .docx | Microsoft Word documents (both legacy and modern formats) |
| Plain Text | .txt | Simple text files |
| Markdown | .md | Markdown formatted text |
| HTML | .html, .htm | Web pages and HTML content |
PDF Processing: PrimeThink can extract text from both digital PDFs and scanned documents using optical character recognition (OCR).
Spreadsheets¶
| Format | Extensions | Description |
|---|---|---|
| Excel | .xls, .xlsx | Microsoft Excel spreadsheets (both legacy and modern formats) |
| CSV | .csv | Comma-separated values |
Presentations¶
| Format | Extensions | Description |
|---|---|---|
| PowerPoint | .ppt, .pptx | Microsoft PowerPoint presentations (both legacy and modern formats) |
Data Files¶
| Format | Extensions | Description |
|---|---|---|
| JSON | .json | JavaScript Object Notation |
| XML | .xml | Extensible Markup Language |
Images¶
| Format | Extensions | Description |
|---|---|---|
| JPEG | .jpg, .jpeg | Standard image format |
| PNG | .png | Portable Network Graphics |
| GIF | .gif | Graphics Interchange Format |
| BMP | .bmp | Bitmap images |
| WebP | .webp | Modern web image format |
Image Processing: PrimeThink uses AI vision to extract text from images and generate descriptions of image content.
Audio and Video¶
| Format | Extensions | Description |
|---|---|---|
| MP3 | .mp3 | Audio files |
| M4A | .m4a | Apple audio format |
| WAV | .wav | Waveform audio |
| MP4 | .mp4 | Video files |
Audio/Video Processing: Audio and video files are automatically transcribed to text, making spoken content searchable and accessible to AI assistants.
Email Files¶
| Format | Extensions | Description |
|---|---|---|
.eml | Standard email message files |
Email Processing: The system extracts email headers (From, To, Subject, Date), body content, and lists any attachments.
Archives¶
| Format | Extensions | Description |
|---|---|---|
| ZIP | .zip | Compressed archive files |
Archive Processing: ZIP files are automatically extracted, and each file within the archive is processed individually.
Web Content¶
| Type | Description |
|---|---|
| URLs | Paste any web URL to automatically fetch and extract its content |
| YouTube Videos | YouTube links are processed to extract video metadata, descriptions, and transcripts when available |
File Size Limits¶
- Maximum file size: 50 MB per file
- ZIP archives: Subject to limits on total uncompressed size and file count
Tips for Best Results¶
-
Use text-based formats when possible - Digital documents (like
.docx) typically yield better results than scanned PDFs. -
Ensure good image quality - For images containing text, higher resolution images produce more accurate text extraction.
-
Check audio quality - Clear audio recordings with minimal background noise transcribe more accurately.
-
Name files descriptively - Use meaningful filenames to help organize and identify your documents.
Related Topics¶
- Document Management - Learn how to manage documents in chats
- Managing Document Visibility - Control how AI assistants access your documents
- Collections - Organize documents into collections