Undergraduate Thesis Research · 2026

DOCUGRAPH
Graph-Based Document
Layout Analysis

Enhancing OCR for researchers and institutions using Graph Neural Networks. Transforming complex documents into structured digital intelligence.

97.4%Layout F1-score
6-stageProcessing pipeline
SDG 9UN aligned
Features

Built for structured document understanding

DOCUGRAPH combines computer vision, graph neural networks, and OCR to recover the logical structure of any document — not just its text.

Smart OCR Recognition

Understands complex document layouts beyond linear text — including multi-column research papers, forms and reports.

GNN-Based Analysis

Uses graph neural networks to learn the relational structure between visual regions for context-aware layout understanding.

Structured Segmentation

Detects tables, headers, figures and column flow accurately, preserving document hierarchy for downstream processing.

Multilingual Support

Handles structured multilingual documents through Tesseract integration and language-agnostic graph representations.

Advanced Capabilities

Multi-Structured Document Intelligence

Traditional systems process text linearly and struggle with complex layouts. DOCUGRAPH goes beyond, automatically understanding document structure and content continuity.

Paragraph Continuity Across Columns

When a paragraph is cut off in one column and continues in the next, traditional OCR systems fail — scrambling text and losing context.

DOCUGRAPH's solution: Our Graph Neural Network automatically detects column boundaries and intelligently reconstructs paragraphs by:

  • Tracking spatial relationships between text blocks across columns
  • Following semantic flow to connect paragraph fragments in the correct order
  • Preserving context so multi-column research papers and magazines are reconstructed perfectly

Result: Your multi-column documents become seamlessly readable, not jumbled.

Example: Magazine Article
Column 1 (Top):
The research demonstrates that modern neural networks can understand document structure...
Column 2 (Continuation):
...with remarkable accuracy, even when paragraphs span multiple columns.
Reconstructed as one coherent paragraph
Example: News Article with Sidebar
Main Article:
Lorem ipsum dolor sit amet, consectetur adipiscing elit...
SIDEBAR
Did you know? Related fact...
Main Article (Cont.):
Sed do eiusmod tempor incididunt...
Content properly separated and tagged

Automatic Section & Box Separation

Documents like news articles and office files contain enclosed sections, sidebars, and callout boxes that should stay separate from main content.

DOCUGRAPH's solution: Our system automatically detects and isolates distinct sections by:

  • Identifying enclosed structures visually distinct from main content
  • Classifying section types (sidebar, callout, table, caption, footer, etc.)
  • Preventing content mixing so your sidebars don't contaminate your main text
  • Preserving hierarchy with metadata about section relationships

Result: Clean, organized output where each content stream is properly identified and separated.

88%+
Multi-column document accuracy
Zero
Content mixing issues
100%
Paragraph continuity preserved
How It Works

From scan to structured intelligence

A six-stage pipeline that converts raw documents into machine-readable, hierarchically organized output.

01

Upload Document

PDF · JPG · PNG

02

Preprocessing

Denoise · deskew

03

Graph Construction

Nodes & edges

04

GNN Layout Analysis

Region classification

05

OCR Integration

Tesseract pass

06

Structured Output

JSON · DOCX · PDF

About the Research

A thesis on document intelligence

DOCUGRAPH is an undergraduate research project investigating how Graph Neural Networks can outperform traditional CNN-only pipelines for document layout analysis.

About DOCUGRAPH

The study introduces a hybrid pipeline that represents document pages as graphs of visual regions and uses GNNs to classify those regions into headers, paragraphs, tables, and figures — improving downstream OCR accuracy and document understanding.

Trained and evaluated on the PubLayNet benchmark and benchmarked against CNN baselines such as Faster R-CNN.

Technologies Used

  • Tesseract OCR
  • Graph Neural Networks
  • CNN Comparison
  • PubLayNet Dataset
  • PyTorch Geometric
  • Firebase + Vercel

Research Objectives

  • Build a graph-based representation of document pages preserving spatial & relational context.
  • Train a GNN to classify visual regions into headers, paragraphs, tables, and figures.
  • Evaluate accuracy and structural fidelity against CNN-only baselines on PubLayNet.
  • Integrate the GNN with Tesseract OCR for end-to-end structured text extraction.
  • Deliver a deployable web prototype usable by researchers and institutions.
9

UN SDG 9 — Industry, Innovation & Infrastructure

Advancing accessible AI infrastructure for document digitization and research workflows.

Team

The researchers behind DOCUGRAPH

An undergraduate research team building the future of document layout analysis.

Joaquin Centeno

Joaquin Centeno

Project Manager
Nicole Oliva

Nicole Oliva

Researcher/Documenter
Lenj Magsino

Lenj Magsino

Fullstack Developer
James Ocasiones

James Ocasiones

Researcher/Quality Assurance
Contact

Get in touch

Questions about the research, collaboration, or thesis defense? Reach out below.

We'd love to hear from you

Whether you're a fellow researcher, an institution exploring document intelligence, or a panel reviewer — drop us a note and we'll get back within a few days.

  • hello@docugraph.dev
  • Quezon City, Philippines
  • docugraph.research