Java

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • PowerPoint extractor

    8 answers - 871 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Hi!
    I just want to extract the text to index in lucene. I am using the
    poi-3.0 jar files. The hslf package in the scratchpad jar.
    My code is as follows
    PowerPointExtractor ppExtractor = new PowerPointExtractor(new
    FileInputStream(filename.ppt));
    String text = ppExtractor.getText();
    but I am getting the following exceptions. What am I doing wrong?
    Exception in thread "main" java.lang.NullPointerException
    at
    (SlideShow.java:211)
    at <init>(SlideShow.java:83)
    at
    <init>(PowerPointExtractor.java:85)
    at (PowerPointHandler.java:22)
    at indexing.IndexFiles.indexFile(IndexFiles.java:132)
    thanks,
    suba suresh
    To unsubscribe, e-mail: poi-user-unsubscribe (AT) jakarta (DOT) apache.org
    Mailing List: #poi
    The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
  • No.1 | | 528 bytes | |

    Mon, 26 Jun 2006, Suba Suresh wrote:
    but I am getting the following exceptions. What am I doing wrong?

    Can you upload your problem powerpoint file to bugzilla? We're shortly
    going to be changing the block of code this broke in, and that way we can
    be sure we've fixed this bug (along with a couple of others)

    Nick

    To unsubscribe, e-mail: poi-user-unsubscribe (AT) jakarta (DOT) apache.org
    Mailing List: #poi
    The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
  • No.2 | | 1139 bytes | |

    I can go to the link and download the file to bugzilla. Is there any
    procedure I have to follow? What is the link to bugzilla?

    an aside note I am trying to do the same with word document file with
    poi hdf library. I just want to extract text. How can I do it and also
    how can I extract meta data from all the microsoft format files.

    thanks,
    suba suresh

    Nick Burch wrote:
    Mon, 26 Jun 2006, Suba Suresh wrote:

    >but I am getting the following exceptions. What am I doing wrong?


    Can you upload your problem powerpoint file to bugzilla? We're shortly
    going to be changing the block of code this broke in, and that way we
    can be sure we've fixed this bug (along with a couple of others)

    Nick

    To unsubscribe, e-mail: poi-user-unsubscribe (AT) jakarta (DOT) apache.org
    Mailing List: #poi
    The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/

    To unsubscribe, e-mail: poi-user-unsubscribe (AT) jakarta (DOT) apache.org
    Mailing List: #poi
    The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
  • No.3 | | 936 bytes | |

    Mon, 26 Jun 2006, Suba Suresh wrote:
    I can go to the link and download the file to bugzilla. Is there any
    procedure I have to follow? What is the link to bugzilla?

    Just follow the "Bug Database" link from the sidebar when at
    That said, I've updated the slide building
    code today, so your problem might now be fixed. Try a new SVN build, and
    report back :)

    an aside note I am trying to do the same with word document file with
    poi hdf library. I just want to extract text. How can I do it

    You'll be better of with hwpf. See another post to the list today for a
    guide

    and also how can I extract meta data from all the microsoft format
    files.

    For that, you'll want hpsf:

    Nick

    To unsubscribe, e-mail: poi-user-unsubscribe (AT) jakarta (DOT) apache.org
    Mailing List: #poi
    The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
  • No.4 | | 2371 bytes | |

    Thank you for all the pointers. It is a great help. I used today's
    build. It worked fine for WordDocument. I did not try the meta data yet.
    For PowerPoint I am getting the following for powerpoint extractor just
    for one file. Am I doing anything wrong? I did'nt change my code.

    No core record found with ID 3 based on PersistPtr lookup
    No core record found with ID 10 based on PersistPtr lookup
    No core record found with ID 12 based on PersistPtr lookup
    No core record found with ID 13 based on PersistPtr lookup
    No core record found with ID 16 based on PersistPtr lookup

    No core record found with ID 246 based on PersistPtr lookup

    PowerPointExtractor ppExtractor = new PowerPointExtractor(new
    FileInputStream(filename.ppt));
    String text = ppExtractor.getText();

    Also since some the excel files were not 97-2002 format I used the
    PIFSFilesystem and read it as a bytestream and stored as text string. I
    hope that is fine.

    thanks,
    suba suresh.

    Nick Burch wrote:
    Mon, 26 Jun 2006, Suba Suresh wrote:

    >>I can go to the link and download the file to bugzilla. Is there any
    >>procedure I have to follow? What is the link to bugzilla?


    Just follow the "Bug Database" link from the sidebar when at
    That said, I've updated the slide building
    code today, so your problem might now be fixed. Try a new SVN build, and
    report back :)


    >an aside note I am trying to do the same with word document file with
    >>poi hdf library. I just want to extract text. How can I do it


    You'll be better of with hwpf. See another post to the list today for a
    guide


    >>and also how can I extract meta data from all the microsoft format
    >>files.


    For that, you'll want hpsf:

    Nick

    To unsubscribe, e-mail: poi-user-unsubscribe (AT) jakarta (DOT) apache.org
    Mailing List: #poi
    The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/

    To unsubscribe, e-mail: poi-user-unsubscribe (AT) jakarta (DOT) apache.org
    Mailing List: #poi
    The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
  • No.5 | | 1555 bytes | |

    I have given the link to the file and opened a bug report. Yesterday's
    build was giving me lots of "No core record found with ID
    235 wiht PersistPtr lookup" warnings.

    thanks,
    suba suresh.

    Suba Suresh wrote:
    I can go to the link and download the file to bugzilla. Is there any
    procedure I have to follow? What is the link to bugzilla?

    an aside note I am trying to do the same with word document file with
    poi hdf library. I just want to extract text. How can I do it and also
    how can I extract meta data from all the microsoft format files.

    thanks,
    suba suresh

    Nick Burch wrote:

    >Mon, 26 Jun 2006, Suba Suresh wrote:
    >>

    but I am getting the following exceptions. What am I doing wrong?
    >>
    >>
    >>

    >Can you upload your problem powerpoint file to bugzilla? We're shortly
    >going to be changing the block of code this broke in, and that way we
    >can be sure we've fixed this bug (along with a couple of others)
    >>

    >Nick
    >>

    >
    >To unsubscribe, e-mail: URL
    >Mailing List: #poi
    >The Apache Jakarta Poi Project: URL


    To unsubscribe, e-mail: URL
    Mailing List: #poi
    The Apache Jakarta Poi Project: URL

    To unsubscribe, e-mail: URL
    Mailing List: #poi
    The Apache Jakarta Poi Project: URL
  • No.6 | | 403 bytes | |

    Wed, 28 Jun 2006, Suba Suresh wrote:
    I have given the link to the file and opened a bug report. Yesterday's
    build was giving me lots of "No core record found with ID
    235 wiht PersistPtr lookup" warnings.

    K, I'll take a look when I'm back from apachecon

    Nick

    To unsubscribe, e-mail: URL
    Mailing List: #poi
    The Apache Jakarta Poi Project: URL
  • No.7 | | 1024 bytes | |

    Tue, 27 Jun 2006, Suba Suresh wrote:
    Thank you for all the pointers. It is a great help. I used today's
    build. It worked fine for WordDocument. I did not try the meta data yet.
    For PowerPoint I am getting the following for powerpoint extractor just
    for one file. Am I doing anything wrong? I did'nt change my code.

    These errors should now have gone. Can you try a new svn checkout /
    tomorrow's SVN build?

    Also since some the excel files were not 97-2002 format I used the
    PIFSFilesystem and read it as a bytestream and stored as text string. I
    hope that is fine.

    If you have some code for getting some basic text out of Excel 95 files,
    we'd be interested in hosting it. I'm sure that something that outputs
    text that can be fed to lucene would be useful for a lot of people, even
    if that's all the excel 95 support we have.

    Nick

    To unsubscribe, e-mail: URL
    Mailing List: #poi
    The Apache Jakarta Poi Project: URL
  • No.8 | | 2906 bytes | |

    I tried the July4th build. The warnings are gone. Thank You.

    I used the following code for a couple of small excel files to index
    with lucene. I don't know how effective the search is going to be since
    it is still in the implementation stage.If there are any errors please
    let me know.

    public class ExcelHandler implements DocumentHandler {

    String fileName;
    public ExcelHandler(String name) {
    super();
    fileName = new String(name);

    }

    public Document getDocument(InputStream is) throws
    DocumentHandlerException {

    Document doc = new Document();
    PIFSDocument pdoc = new PIFSDocument(fileName,is);
    DocumentInputStream docis = new DocumentInputStream(pdoc);
    byte[] content = new byte[docis.available()];
    docis.read(content);
    docis.close();
    StringBuffer textBuf = new StringBuffer();
    for(int i =0; i<content.length; i++){
    String byteString = new Byte(content[i]).toString();
    textBuf.append(byteString);
    }
    String text = textBuf.toString();
    if((text!=null) && (!text.equals(""))){

    doc.add(new Field("body", text, Field.Store.YES, Field.Index.N));
    }
    }

    catch(IException io){
    throw new DocumentHandlerException("Cannot parse Excel Document", io);
    }
    return doc;
    }
    }

    Separately in another file I am indexing the filename, filepath, date as
    keywords. Hope it helps.

    thanks,
    suba suresh.

    Nick Burch wrote:
    Tue, 27 Jun 2006, Suba Suresh wrote:

    >>Thank you for all the pointers. It is a great help. I used today's
    >>build. It worked fine for WordDocument. I did not try the meta data yet.
    >>For PowerPoint I am getting the following for powerpoint extractor just
    >>for one file. Am I doing anything wrong? I did'nt change my code.


    These errors should now have gone. Can you try a new svn checkout /
    tomorrow's SVN build?


    >>Also since some the excel files were not 97-2002 format I used the
    >>PIFSFilesystem and read it as a bytestream and stored as text string. I
    >>hope that is fine.


    If you have some code for getting some basic text out of Excel 95 files,
    we'd be interested in hosting it. I'm sure that something that outputs
    text that can be fed to lucene would be useful for a lot of people, even
    if that's all the excel 95 support we have.

    Nick

    To unsubscribe, e-mail: poi-user-unsubscribe (AT) jakarta (DOT) apache.org
    Mailing List: #poi
    The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/

    To unsubscribe, e-mail: poi-user-unsubscribe (AT) jakarta (DOT) apache.org
    Mailing List: #poi
    The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/

Re: PowerPoint extractor


max 4000 letters.
Your nickname that display:
In order to stop the spam: 8 + 7 =
QUESTION ON "Java"

EMSDN.COM