PYOSTIE( Python Open Source Text Information Extractor)
PYOSTIE is short for Python Open Source Text Information Extractor.
A very elegant and simple library to extract text from many file formats.
This module can extract text from PDfs, Office files, text files, Image files. Also, we generate an excel file that gives you some deeper insights into the text. We are now only extracting insights for Image and PDF formats.( More to come soon.)
git clone https://github.com/anirudhpnbb/Pyostie.git
pip3 install Pyostie
(or)
pip install Pyostie
<!-- USAGE EXAMPLES -->
## Usage
```python
import pyostie
output = pyostie.extract(filename, insights=True, extension="jpg") #### Format of the extension can also be "tif" or "pnb"
df, text = output.start()
output = pyostie.extract(filename, insights=False, extension="jpg")
text = output.start()
output = pyostie.extract(filename, extension="pdf")
text = output.start()
output = pyostie.extract(filename, insights=True, extension="pdf")
text = output.start()
output = pyostie.extract(filename, extension="xlsx")
text = output.start()
image_folder(optional): Address where image needs to be written
output = pyostie.extract(filename, image_folder, extension="docx")
text = output.start()
output = pyostie.extract(filename, extension="mp3")
text = output.start()
In this version, we can only extract text from PDFs, Excel, TXT, CSV and MP3 formats. Soon, we will be adding doc, ppt, pptx, and many more. Watch this space for more updates.
Anirudh Palaparthi - @anirudh8889 - pnbbanirudh - aniruddhapnbb@gmail.com
Balaram Guddanti - Balaram Guddanti - balaram.guddanti6@gmail.com
Project Link: PYOSTIE