From Image to Summary: Using Tesseract and BART for Text Extraction and Summary
In an increasingly data-driven world, the ability to extract and summarize information quickly and efficiently is crucial. This article explores the use of two powerful technologies: Tesseract for text extraction from images, and Facebook’s BART (Bidirectional and Auto-Regressive Transformers) for text synthesis and summarization. By integrating these technologies into a Gradio interface, we will demonstrate how to simplify and automate the process of text extraction and summarization.
Installation Libraries
To set up the project follow these steps
- 1 Create a virtual Environement
Creating a virtual environment helps in managing dependancies and maintaining a clean workspace
python -m venv env
source env/bin/activate # On Windows, use `env\Scripts\activate`
- 2 Create a
requirements.txt
file
List all the necessary libraries in a requirements.txt
tesseract==4.41.2
transformers==4.9.2
gradio==2.4.7
torch==2.3.1
gradio==4.37.2
numpy==1.23.5
pillow==10.3.0
pytesseract==0.3.10
protobuf
- 3 Install dependencies
Install the libraries listed in the requirements.txt
pip install -r requirements.txt
Now let’s set up the Python application file app.py
wiht the neccesary import and functions
import pytesseract
from transformers import pipeline, BartForConditionalGeneration, BartTokenizer
import gradio as gr
from PIL import Image,UnidentifieldImageError
import re
import logging
Initializing the Model :
We initialized the BART using the pipeline from huggingFace
summarize =pipeline('summarization',model="facebook/bart-large-cnn")
Function to extract text from image:
Now create a function to extract text from image
def image_extract(image):
try:
if image is None:
return "Not found image"
#extract text from image
raw_text = pytesseract.image_to_string(image)
#preprocessing raw text
preprocessing = re.sub(r'\s+',' ',raw_text).strip()
# summarize the preprocessing
text_summary = summarize(preprocessing,do_sample=False,min_lenght=50,max_lenght=512
summary_text_from_image = text_summary[0].get('summary_text')
return summary_text_from_image
except UnidentifiedImageError:
return "Error to load image"
except Exception as e:
return str(e)
Interface Gradio
- The Gradio interface takes an image input and a checkbox to choose between using the pipeline or direct model/tokenizer interaction.
- It displays the extracted text and the summarized version based on the chosen method.
Create a gradio interface to interact
interface = gr.Interface(
fn=image_extract,
inputs=gr.Image(label="Upload a file to extract text",type='pil'),
outputs=gr.Textbox(label="Summary"),
title="From image to summary:Using Bart et Tesseract"
)
if __name__=="__main__":
interface.lunch()
Run the Application:
python.app
And open our browser and type : 127.0.0.1:7860
Conclusion
This project showcases the potential of combining OCR technology with advanced NLP models to handle large volumes of textual data efficiently. By providing accessible tools like Gradio, this application demonstrates how such technologies can be harnessed to streamline and enhance data-driven workflows in various domains.
Demonstration Link to Hugging Face
link to my github :