DOCUGRAPH — Graph-Based Document Layout Analysis

Features

Built for structured document understanding

DOCUGRAPH combines computer vision, graph neural networks, and OCR to recover the logical structure of any document — not just its text.

Smart OCR Recognition

Understands complex document layouts beyond linear text — including multi-column research papers, forms and reports.

GNN-Based Analysis

Uses graph neural networks to learn the relational structure between visual regions for context-aware layout understanding.

Structured Segmentation

Detects tables, headers, figures and column flow accurately, preserving document hierarchy for downstream processing.

Multilingual Support

Handles structured multilingual documents through Tesseract integration and language-agnostic graph representations.

Real Table Grids

Automatically detects and reconstructs tables with proper grid formatting, preserving cell alignment and data structure.

Shapes & Flowchart Visualization

Recognizes shapes, arrows, flowchart elements and diagrams, then visualizes them in your exported documents.

Image Inclusion in DOCX

Preserves all images, diagrams, and visual elements directly in exported DOCX files with proper formatting.

Custom Formatting Templates

Export with customizable document styles, fonts, colors, and layouts tailored to your needs.

Advanced Capabilities

Multi-Structured Document Intelligence

Traditional systems process text linearly and struggle with complex layouts. DOCUGRAPH goes beyond, automatically understanding document structure and content continuity.

Paragraph Continuity Across Columns

When a paragraph is cut off in one column and continues in the next, traditional OCR systems fail — scrambling text and losing context.

DOCUGRAPH's solution: Our Graph Neural Network automatically detects column boundaries and intelligently reconstructs paragraphs by:

Tracking spatial relationships between text blocks across columns
Following semantic flow to connect paragraph fragments in the correct order
Preserving context so multi-column research papers and magazines are reconstructed perfectly

Result: Your multi-column documents become seamlessly readable, not jumbled.

Example: Magazine Article

Column 1 (Top):

The research demonstrates that modern neural networks can understand document structure...

Column 2 (Continuation):

...with remarkable accuracy, even when paragraphs span multiple columns.

Reconstructed as one coherent paragraph

Example: News Article with Sidebar

Main Article:

Lorem ipsum dolor sit amet, consectetur adipiscing elit...

SIDEBAR

Did you know? Related fact...

Main Article (Cont.):

Sed do eiusmod tempor incididunt...

Content properly separated and tagged

Automatic Section & Box Separation

Documents like news articles and office files contain enclosed sections, sidebars, and callout boxes that should stay separate from main content.

DOCUGRAPH's solution: Our system automatically detects and isolates distinct sections by:

Identifying enclosed structures visually distinct from main content
Classifying section types (sidebar, callout, table, caption, footer, etc.)
Preventing content mixing so your sidebars don't contaminate your main text
Preserving hierarchy with metadata about section relationships

Result: Clean, organized output where each content stream is properly identified and separated.

88%+

Multi-column document accuracy

Zero

Content mixing issues

100%

Paragraph continuity preserved

How It Works

From scan to structured intelligence

A six-stage pipeline that converts raw documents into machine-readable, hierarchically organized output.

01

Upload Document

PDF · JPG · PNG

02

Preprocessing

Denoise · deskew

03

Graph Construction

Nodes & edges

04

GNN Layout Analysis

Region classification

05

OCR Integration

Tesseract pass

06

Structured Output

JSON · DOCX · PDF

About the Research

A thesis on document intelligence

DOCUGRAPH is an undergraduate research project investigating how Graph Neural Networks can outperform traditional CNN-only pipelines for document layout analysis.

About DOCUGRAPH

The study introduces a hybrid pipeline that represents document pages as graphs of visual regions and uses GNNs to classify those regions into headers, paragraphs, tables, and figures — improving downstream OCR accuracy and document understanding.

Trained and evaluated on the PubLayNet benchmark and benchmarked against CNN baselines such as Faster R-CNN.

Technologies Used

Tesseract OCR
Graph Neural Networks
CNN Comparison
PubLayNet Dataset
PyTorch Geometric
Firebase + Vercel

Research Objectives

Build a graph-based representation of document pages preserving spatial & relational context.
Train a GNN to classify visual regions into headers, paragraphs, tables, and figures.
Evaluate accuracy and structural fidelity against CNN-only baselines on PubLayNet.
Integrate the GNN with Tesseract OCR for end-to-end structured text extraction.
Deliver a deployable web prototype usable by researchers and institutions.

9

UN SDG 9 — Industry, Innovation & Infrastructure

Advancing accessible AI infrastructure for document digitization and research workflows.

Team

The researchers behind DOCUGRAPH

An undergraduate research team building the future of document layout analysis.

Joaquin Centeno

Project Manager

jtcenteno@fit.edu.ph

Nicole Oliva

Researcher/Documenter

ntoliva@fit.edu.ph

Lenj Magsino

Fullstack Developer

lrmagsino@fit.edu.ph

James Ocasiones

Researcher/Quality Assurance

jaocasiones@fit.edu.ph

Contact

Get in touch

Questions about the research, collaboration, or thesis defense? Reach out below.

We'd love to hear from you

Whether you're a fellow researcher, an institution exploring document intelligence, or a panel reviewer — drop us a note and we'll get back within a few days.

hello@docugraph.dev
Quezon City, Philippines
docugraph.research

DOCUGRAPH Graph-Based Document Layout Analysis