Read Content from Files which are inside Zip file

JavaZipExtractApache Tika

Java Problem Overview


I am trying to create a simple java program which reads and extracts the content from the file(s) inside zip file. Zip file contains 3 files (txt, pdf, docx). I need to read the contents of all these files and I am using Apache Tika for this purpose.

Can somebody help me out here to achieve the functionality. I have tried this so far but no success

Code Snippet

public class SampleZipExtract {

	
	public static void main(String[] args) {
		
		List<String> tempString = new ArrayList<String>();
		StringBuffer sbf = new StringBuffer();
		
		File file = new File("C:\\Users\\xxx\\Desktop\\abc.zip");
		InputStream input;
		try {
		
		  input = new FileInputStream(file);
		  ZipInputStream zip = new ZipInputStream(input);
		  ZipEntry entry = zip.getNextEntry();
		  
		  BodyContentHandler textHandler = new BodyContentHandler();
		  Metadata metadata = new Metadata();

	      Parser parser = new AutoDetectParser();
	       
	      while (entry!= null){
	    	
	            if(entry.getName().endsWith(".txt") || 
	            		   entry.getName().endsWith(".pdf")||
	            		   entry.getName().endsWith(".docx")){
	    	  System.out.println("entry=" + entry.getName() + " " + entry.getSize());
	            	 parser.parse(input, textHandler, metadata, new ParseContext());
	            	 tempString.add(textHandler.toString());
	            }
	       }
		   zip.close();
	       input.close();
		   
	       for (String text : tempString) {
	       System.out.println("Apache Tika - Converted input string : " + text);
	       sbf.append(text);
		   System.out.println("Final text from all the three files " + sbf.toString());
		} catch (FileNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (SAXException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (TikaException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
}

Java Solutions


Solution 1 - Java

If you're wondering how to get the file content from each ZipEntry it's actually quite simple. Here's a sample code:

public static void main(String[] args) throws IOException {
	ZipFile zipFile = new ZipFile("C:/test.zip");
	
	Enumeration<? extends ZipEntry> entries = zipFile.entries();
	
	while(entries.hasMoreElements()){
		ZipEntry entry = entries.nextElement();
		InputStream stream = zipFile.getInputStream(entry);
	}
}

Once you have the InputStream you can read it however you want.

Solution 2 - Java

As of Java 7, the NIO Api provides a better and more generic way of accessing the contents of Zip or Jar files. Actually, it is now a unified API which allows you to treat Zip files exactly like normal files.

In order to extract all of the files contained inside of a zip file in this API, you'd do this:

In Java 8:

private void extractAll(URI fromZip, Path toDirectory) throws IOException{
    FileSystems.newFileSystem(fromZip, Collections.emptyMap())
            .getRootDirectories()
            .forEach(root -> {
                // in a full implementation, you'd have to
                // handle directories 
                Files.walk(root).forEach(path -> Files.copy(path, toDirectory));
            });
}

In java 7:

private void extractAll(URI fromZip, Path toDirectory) throws IOException{
    FileSystem zipFs = FileSystems.newFileSystem(fromZip, Collections.emptyMap());

    for(Path root : zipFs.getRootDirectories()) {
        Files.walkFileTree(root, new SimpleFileVisitor<Path>() {
            @Override
            public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) 
                    throws IOException {
                // You can do anything you want with the path here
                Files.copy(file, toDirectory);
                return FileVisitResult.CONTINUE;
            }

            @Override
            public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) 
                    throws IOException {
                // In a full implementation, you'd need to create each 
                // sub-directory of the destination directory before 
                // copying files into it
                return super.preVisitDirectory(dir, attrs);
            }
        });
    }
}

Solution 3 - Java

Because of the condition in while, the loop might never break:

while (entry != null) {
  // If entry never becomes null here, loop will never break.
}

Instead of the null check there, you can try this:

ZipEntry entry = null;
while ((entry = zip.getNextEntry()) != null) {
  // Rest of your code
}

Solution 4 - Java

Sample code you can use to let Tika take care of container files for you. http://wiki.apache.org/tika/RecursiveMetadata

Form what I can tell, the accepted solution will not work for cases where there are nested zip files. Tika, however will take care of such situations as well.

Solution 5 - Java

My way of achieving this is by creating ZipInputStream wrapping class that would handle that would provide only the stream of current entry:

The wrapper class:

public class ZippedFileInputStream extends InputStream {

    private ZipInputStream is;
    
    public ZippedFileInputStream(ZipInputStream is){
        this.is = is;
    }

    @Override
    public int read() throws IOException {
        return is.read();
    }
    
    @Override
    public void close() throws IOException {
        is.closeEntry();
    }
    

}

The use of it:

    ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream("SomeFile.zip"));
    
    while((entry = zipInputStream.getNextEntry())!= null) {
     
     ZippedFileInputStream archivedFileInputStream = new ZippedFileInputStream(zipInputStream);
     
     //... perform whatever logic you want here with ZippedFileInputStream 
    
     // note that this will only close the current entry stream and not the ZipInputStream
     archivedFileInputStream.close();
    
    }
    zipInputStream.close();

One advantage of this approach: InputStreams are passed as an arguments to methods that process them and those methods have a tendency to immediately close the input stream after they are done with it.

Solution 6 - Java

i did mine like this and remember to change url or zip files jdk 15

import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Scanner;
import java.util.stream.Stream;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
import java.io.*;
import java.util.*;
import java.nio.file.Paths;

class Main {
  public static void main(String[] args) throws MalformedURLException,FileNotFoundException,IOException{
    String url,kfile;
    Scanner getkw = new Scanner(System.in);
    System.out.println(" Please Paste Url ::");
    url = getkw.nextLine();
    System.out.println("Please enter name of file you want to save as :: ");
    kfile = getkw.nextLine();
    getkw.close();
    Main Dinit = new Main();
    System.out.println(Dinit.dloader(url, kfile));
    ZipFile Vanilla = new ZipFile(new File("Vanilla.zip"));
    Enumeration<? extends ZipEntry> entries = Vanilla.entries();

    while(entries.hasMoreElements()){
        ZipEntry entry = entries.nextElement();
//        String nextr =  entries.nextElement();
        InputStream stream = Vanilla.getInputStream(entry);
        FileInputStream inpure= new FileInputStream("Vanilla.zip");
        FileOutputStream outter = new FileOutputStream(new File(entry.toString()));
        outter.write(inpure.readAllBytes());
        outter.close();
    }

  }
  private String dloader(String kurl, String fname)throws IOException{
    String status ="";
    try {
      URL url = new URL("URL here");
      FileOutputStream out = new FileOutputStream(new File("Vanilla.zip"));         // Output File
      out.write(url.openStream().readAllBytes());
      out.close();
    } catch (MalformedURLException e) {
      status = "Status: MalformedURLException Occured";
    }catch (IOException e) {
      status = "Status: IOexception Occured";
    }finally{
      status = "Status: Good";}
    String path="\\tkwgter5834\\";
    extractor(fname,"tkwgter5834",path);
    

    return status;
  }
  private String extractor(String fname,String dir,String path){
    File folder = new File(dir);
    if(!folder.exists()){
      folder.mkdir();
    }
    return "";
  }
}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionS JagdeeshView Question on Stackoverflow
Solution 1 - JavaRodrigo SasakiView Answer on Stackoverflow
Solution 2 - JavaLordOfThePigsView Answer on Stackoverflow
Solution 3 - Javauser2030471View Answer on Stackoverflow
Solution 4 - JavaHarinderView Answer on Stackoverflow
Solution 5 - JavaViliusView Answer on Stackoverflow
Solution 6 - JavavaimalaviyaView Answer on Stackoverflow