Datasets

Existing datasets for text-centric image forensics often suffer from limited scale, lack of multilingual support, or absence of fine-grained reasoning annotations. To address these gaps and support the rigorous requirements of GenText-Forensics, we introduce RealText-V2—a large-scale, multi-dimensional benchmark purpose-built for text-centric image forensics.

RealText-V2 Dataset

RealText-V2 is the first large-scale benchmark dedicated to diverse text scenarios, ranging from sparse text in natural scenes to dense text in complex documents. By combining scale, scenario diversity, and rich semantic annotations, RealText-V2 serves as a comprehensive testbed for forgery analysis and adversarial generation tasks. Pioneering in both scale and annotation depth, it features:

20K+

Multi-modal samples with rich annotations

6

Languages spanning diverse script systems

6

Real-world domains

100+

Attack & forgery methods

3-Level

Multi-granularity forgery coverage

Triple

Detection + Grounding + Explanation

Dataset Samples

Languages & Domains

Multilingual Coverage

RealText-V2 spans 6 languages—English, Chinese, Arabic, Thai, Malay, and Indonesian—covering Latin, logographic, Arabic, and Thai script systems. Each writing system presents unique forensic challenges, from character-level substitution in Latin scripts to stroke-level tampering in Chinese characters.

Multi-Domain Scenarios

The dataset covers 6 critical domains: Finance, Healthcare, Education, Live Streaming, E-commerce, and Natural Scenes. This diversity—from dense structured documents (financial statements, medical records) to sparse scene text (street signs, product labels)—ensures models are tested across the full spectrum of real-world text-rich imagery.

Multi-Granularity Forgery

Forgery operations span three granularity levels: character-level (e.g., visually similar character substitution), word-level (e.g., content replacement with consistent typography), and semantic-level (e.g., logical contradictions, identity fraud). This hierarchy tests models' ability to detect both subtle visual artifacts and higher-level semantic inconsistencies.

Multi-Source Samples

The dataset incorporates both real-world tampered samples and AIGC-synthesized forgeries covering diverse generation pipelines. This dual-source design ensures forensic models generalize across traditional manipulation techniques and emerging AI-powered editing methods.

Pixel-Level Localization

Each forged sample is annotated with precise pixel-level ground truth masks that delineate the exact tampered regions. These masks are critical for training and evaluating spatial grounding models that must identify not just whether an image is forged, but where the manipulation occurred.

Expert-Level Explanations

Beyond binary labels and masks, RealText-V2 provides natural language explanations authored by domain experts. These explanations describe the specific visual artifacts (e.g., font inconsistency, noise discontinuity) and semantic contradictions (e.g., logical errors, identity fraud) that underpin each forgery judgment, enabling explainable forensic analysis.

Comparison with Existing Datasets

RealText-V2 is the first large-scale benchmark to support multilingual analysis with comprehensive annotations for detection (Det), grounding (Mask), and explanation (Expl).

Dataset Total Text Line Multi-Lang Det Mask Expl
T-IC13 462
T-SROIE 986
OSFT 2,938
DocTamper 170K
RealText-V1 5,397
RealText-V2 (Ours) 20K+

Annotation & Report Format

Each forged sample in RealText-V2 is annotated with pixel-level localization masks, tampering type labels, and expert-level natural language explanations. For the challenge, participants generate structured forensic analysis reports in the following format:

Forgery Analysis Report — Sample Output
[Conclusion] FORGED
[RISK_SCORE] 73/100
ANOMALY_001
[GROUNDING] [1081, 933, 1288, 998]
[REASON] A crude, solid black rectangular block has been applied to obscure the phone number. The sharp edges and uniform color create a clear discontinuity with the surrounding texture.
ANOMALY_002
[GROUNDING] [1372, 585, 1630, 655]
[REASON] The document states the meeting is at '5:00 a.m.', which contradicts standard business practice. The font of 'a.m.' shows slight misalignment, suggesting digital alteration.

The document exhibits 2 distinct anomalies. The forgery pattern involves a mix of crude redactions and logical inconsistencies, providing substantial evidence of tampering.

Data Access

RealText-V2 dataset files and metadata are available on Hugging Face. Please check the Schedule page for data release dates.

Training Data

Includes forged and authentic images with pixel-level ground truth masks, tampering type labels, and expert-level natural language explanations for "Hard Samples." Designed for training explainable forensic agents.

Test Data

Used for final leaderboard ranking. Ground truth labels are withheld. Participants submit structured forensic reports (JSONL format) for automated evaluation via the LLM Judge pipeline.