After many fruitful hours of exploring OCR libraries, bounding boxes and clustering algorithms - I found a solution so simple it makes you want to cry!

I hope you are using Linux;

`pdftotext -layout NAME_OF_PDF.pdf`
 
AMAZING!!

Now you have a nice text file with all the information lined up in nice columns, now it is trivial to format into a csv etc..

It is for times like this that I love Linux, these guys came up with AMAZING solutions to everything, and put it there for FREE!


You should definitely have a look at this answer of mine:

* **[Extracting table contents from a collection of PDF files](https://stackoverflow.com/a/26110587/359307)**

and also have a look at all the links included therein.

[Tabula/TabulaPDF](https://github.com/tabulapdf/tabula) is currently the best table extraction tool that is available for PDF scraping.

I&#39;d just like to add to the very helpful answer from Kurt Pfeifle - there is now a Python wrapper for Tabula, and this seems to work very well so far: https://github.com/chezou/tabula-py

This will convert your PDF table to a Pandas data frame.  You can also set the area in x,y co-ordinates which is obviously very handy for irregular data.

Now My application is using gridview to list all information and it&#39;s also have pagination.when the user click on pagination number and then click on edit and then save. It redirect user to view page. What I want to do it to redirect user to previous page(url with pagination number).

Yii2 redirect to previous page

I set up sliding tabs with two `Fragment`s each `Fragment` has a `Button` which goes to a `WebView`. The problem with this is when the `WebView` `Button` is clicked the sliding tabs are still activated and when a user tries to navigate within the `WebView` you end up swiping to the other tab. Is there a way in an on click method to disable the swiping ability of the tabs? Any help would be hugely appreciated!

Here the code:

    public class MyWebViewClass extends Fragment {

    private WebView mWebView;
    private Button mButton;

    public MyWebViewClass() {
        // Required empty public constructor
    }


    @Override
    public View onCreateView(LayoutInflater inflater, ViewGroup container,
                             Bundle savedInstanceState) {
        // Inflate the layout for this fragment
        View view = inflater.inflate(R.layout.fragment_webview, container, false);

        mWebView = (WebView) view.findViewById(R.id.WebView);

        mButton = (Button) view.findViewById(R.id.Button1);
        mButton.setOnClickListener(new View.OnClickListener() {
            @Override
            public void onClick(View v) {
                mWebView.setVisibility(View.VISIBLE);
                mButton.setVisibility(View.GONE);
                mWebView.getSettings().setJavaScriptEnabled(true);
                mWebView.loadUrl(&quot;www.google.com&quot;);
            }
        });

        return view;
    }




Disable swiping between tabs

Are there any open source libraries that support table identification &amp; extraction?

By this I mean: 

 1. Identify a table structure exists
 2. Classify the table from its contents
 3. Extract data from the table in a useful output format e.g. JSON / CSV etc.

I have looked through similar questions on this topic and found the following:

 - [PDFMiner][1] which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I&#39;m wrong)
 - [pdf-table-extract][2] which attempts to address problem 1 but according to the [To-Do][3] list, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!

Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!


  [1]: https://pypi.python.org/pypi/pdfminer/
  [2]: https://github.com/ashima/pdf-table-extract
  [3]: https://github.com/ashima/pdf-table-extract/blob/master/TODO.md

Extract / Identify Tables from PDF python

Are there any open source libraries that support table identification &#x26; extraction?
By this I mean:
<ol>
<li>Identify a table structure exists</li>
<li>Classify the table from its contents</li>
<li>Extract data from the table in a useful output format e.g. JSON / CSV etc.</li>
</ol>
I have looked through similar questions on this topic and found the following:
<ul>
<li><a href="https://pypi.python.org/pypi/pdfminer/" target="_blank" rel="noopener noreferrer">PDFMiner</a> which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong)</li>
<li><a href="https://github.com/ashima/pdf-table-extract" target="_blank" rel="noopener noreferrer">pdf-table-extract</a> which attempts to address problem 1 but according to the <a href="https://github.com/ashima/pdf-table-extract/blob/master/TODO.md" target="_blank" rel="noopener noreferrer">To-Do</a> list, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!</li>
</ul>
Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!

NetworkX is powerful but I was trying to plot a graph which shows node labels by default and I was surprised how tedious this seemingly simple task could be for someone new to Networkx. There is an example which shows how to add labels to the plot.

https://networkx.github.io/documentation/latest/examples/drawing/labels_and_colors.html

The problem with this example is that it uses too many steps and methods when all I want to do is just show labels which are same as the node name while drawing the graph.

    # Add nodes and edges
    G.add_node(&quot;Node1&quot;)
    G.add_node(&quot;Node2&quot;)
    G.add_edge(&quot;Node1&quot;, &quot;Node2&quot;)
    nx.draw(G)    # Doesn&#39;t draw labels. How to make it show labels Node1, Node2 along?

Is there a way to make `nx.draw(G)` show the default labels (Node1, Node2 in this case) inline in the graph?

Plotting networkx graph with node labels defaulting to node name

I have some data and when I import it, I get the following unneeded columns. I&#39;m looking for an easy way to delete all of these.
      
    &#39;Unnamed: 24&#39;, &#39;Unnamed: 25&#39;, &#39;Unnamed: 26&#39;, &#39;Unnamed: 27&#39;,
    &#39;Unnamed: 28&#39;, &#39;Unnamed: 29&#39;, &#39;Unnamed: 30&#39;, &#39;Unnamed: 31&#39;,
    &#39;Unnamed: 32&#39;, &#39;Unnamed: 33&#39;, &#39;Unnamed: 34&#39;, &#39;Unnamed: 35&#39;,
    &#39;Unnamed: 36&#39;, &#39;Unnamed: 37&#39;, &#39;Unnamed: 38&#39;, &#39;Unnamed: 39&#39;,
    &#39;Unnamed: 40&#39;, &#39;Unnamed: 41&#39;, &#39;Unnamed: 42&#39;, &#39;Unnamed: 43&#39;,
    &#39;Unnamed: 44&#39;, &#39;Unnamed: 45&#39;, &#39;Unnamed: 46&#39;, &#39;Unnamed: 47&#39;,
    &#39;Unnamed: 48&#39;, &#39;Unnamed: 49&#39;, &#39;Unnamed: 50&#39;, &#39;Unnamed: 51&#39;,
    &#39;Unnamed: 52&#39;, &#39;Unnamed: 53&#39;, &#39;Unnamed: 54&#39;, &#39;Unnamed: 55&#39;,
    &#39;Unnamed: 56&#39;, &#39;Unnamed: 57&#39;, &#39;Unnamed: 58&#39;, &#39;Unnamed: 59&#39;,
    &#39;Unnamed: 60&#39;
They are indexed by 0-indexing so I tried something like 
        
    df.drop(df.columns[[22, 23, 24, 25, 
    26, 27, 28, 29, 30, 31, 32 ,55]], axis=1, inplace=True)
But this isn&#39;t very efficient. I tried writing some for loops but this struck me as bad Pandas behaviour. Hence i ask the question here.

I&#39;ve seen some examples which are similar (https://stackoverflow.com/questions/26347412/drop-multiple-columns-pandas) but this doesn&#39;t answer my question. 


Deleting multiple columns based on column names in Pandas

How can I run multiple python scripts? At the moment I run one like so `python script1.py`.

I&#39;ve tried `python script1.py script2.py` and that doesn&#39;t work: only the first script is run. Also, I&#39;ve tried using a single file like this;

    import script1
    import script2
    
    python script1.py
    python script2.py

However this doesn&#39;t work either.

Run multiple python scripts concurrently

I would like to install `scipy-0.15.1-cp33-none-win_amd64.whl` that I have saved to the local drive. I am using:

```lang-none
pip 6.0.8 from C:\Python27\Lib\site-packages
python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)]
```

When I run:

```lang-none
pip install scipy-0.15.1-cp33-none-win_amd64.whl
```

I get the following error:

&gt; scipy-0.15.1-cp33-none-win_amd64.whl is not a supported wheel on this platform

What is the problem?


Error &quot;filename.whl is not a supported wheel on this platform&quot;

I am trying to understand what is the motivation behind using Python&#39;s library functions for executing OS-specific tasks such as creating files/directories, changing file attributes, etc. instead of just executing those commands via `os.system()` or `subprocess.call()`?

For example, why would I want to use `os.chmod` instead of doing `os.system(&quot;chmod...&quot;)`? 

I understand that it is more &quot;pythonic&quot; to use Python&#39;s available library methods as much as possible instead of just executing shell commands directly. But, is there any other motivation behind doing this from a functionality point of view? 

I am only talking about executing simple one-line shell commands here. When we need more control over the execution of the task, I understand that using `subprocess` module makes more sense, for example.

Why use Python&#39;s os module methods instead of executing shell commands directly?

So, I&#39;ve been trying to save a jupyter notebook as PDF but I just can&#39;t figure out how to do this. The first thing I try is from the file menu just download as PDF, but doing that results in:

    nbconvert failed: PDF creating failed

the next thing I try is try to do the conversion from the Command Prompt like this

    $ ipython nbconvert --to latex --post PDF MyNotebook.ipynb 

but again, this results in an error message

    ImportError: No module named &#39;PDF&#39;

and if I try

    $ ipython nbconvert --to latex MyNotebook.ipynb 

this results in 

    IPython.nbconvert.utils.pandoc.PandocMissing: Pandoc wasn&#39;t found:
    Please check that pandoc is installed

if I try to install pandoc (`pip install pandoc`), this gives me

    ImportError: No module named &#39;ConfigParser&#39;

and this is where I get stuck because I just don&#39;t know what else to do.
Anyone have idea how to fix whatever is wrong?


IPython/Jupyter Problems saving notebook as PDF

I have used the Swagger UI to display my REST webservices and hosted it on a server.

However this service of Swagger can only be accessed on a particular server. If I want to work offline, does anybody know how I can create a static PDF using the Swagger UI and work with it? Additionally a PDF is easy to share with people who don&#39;t have access to the server.

Many thanks!

Generate PDF from Swagger API documentation

I&#39;m using the Python [requests lib][1] to get a PDF file from the web. This works fine, but I now also want the original filename. If I go to a PDF file in Firefox and click `download` it already has a filename defined to save the pdf. How do I get this filename?

For example:

    import requests
    r = requests.get(&#39;http://www.researchgate.net/profile/M_Gotic/publication/260197848_Mater_Sci_Eng_B47_%281997%29_33/links/0c9605301e48beda0f000000.pdf&#39;)
    print r.headers[&#39;content-type&#39;]  # prints &#39;application/pdf&#39;

I checked the `r.headers` for anything interesting, but there&#39;s no filename in there. I was actually hoping for something like `r.filename`..

Does anybody know how I can get the filename of a downloaded PDF file with requests library?


  [1]: http://docs.python-requests.org/en/latest/

How to get pdf filename with Python requests?

Microsoft Windows 10 comes with a Microsoft Print To PDF printer which can print something to a PDF file.  It prompts for the filename to download.  

How can I programmatically control this from C# to not prompt for the PDF filename but save to a specific filename in some folder that I provide?

This is for batch processing of printing a lot of documents or other types of files to a PDF programmatically.

How to programmatically print to PDF file without prompting for filename in C# using the Microsoft Print To PDF printer that comes with Windows 10

I&#39;ve been using `PDFTK` Server on `OSX pre 10.11` for over a year without any issues running commands on the command line.

After installing OSX 10.11 beta, I can no longer run any `PDFTK` Server commands on the command line.  It does not throw any error, all the commands I try to execute just hang indefinitely.

I installed from a pkg I downloaded from the `PDFTK` Server website as always: 

https://www.pdflabs.com/tools/pdftk-server/

I&#39;ve also tried installing from source with Homebrew.  The installation works, but I get the same results, the commands just hang in the terminal:

[Homebrew tap][1]

**I&#39;ve sent in the complaint to Apple via their Feedback Assistant application that gets installed when installing the beta two months ago without a response.**

I&#39;ve been told regarding the Apple Feedback Assistant:

&gt; You likely won&#39;t ever receive a response. Apple only replies through
&gt; Feedback Assistant for major bugs of the operating system where they
&gt; need additional information. It is not a way to obtain support, even
&gt; less so for a third-party application. 


&lt;strike&gt;**I&#39;ve also reached out to [PDF Labs][2], the makers of the package, without response.**&lt;/strike&gt;



**On [MacPorts][3] they&#39;re having an issue with the build on OS X 10.11, does this mean that there is a compatibility issue with PDFtk and 10.11 at the software level?**

I&#39;ve searched the web for a clue as to what might be causing the issue and how to fix it, but have found nothing tangible.

On OS X, I don&#39;t know how to go about figuring out if there is now a permission or path or even a Java issue after the upgrade.

Any help either trouble shooting the root cause or offering a fix is appreciated. 


----------


**UPDATE 1:**

I heard back from [Sid Steward at PDF Labs][2]:

&gt; We have been wrestling with this almost as long as the beta has been
&gt; out. We are still working on it. There appears to be an
&gt; incompatibility with one of the (non-Apple) libraries that pdftk uses
&gt; and OS X 10.11. Presently I am installing yet another update to
&gt; Apple&#39;s developer tools with the hope that it will solve the problem.
&gt; I will update you with our progress.


----------

**UPDATE 2:** 

[Sid Steward at PDF Labs][2] again:

&gt; It looks like there are two threads running under pdftk, and that they
&gt; are deadlocked. That means that each thread is waiting for the other
&gt; to finish. I&#39;m not an expert here, but that&#39;s my impression. Here is a
&gt; screenshot from Mac&#39;s Activity Monitor to illustrate:

[![enter image description here][4]][4]

&gt; The above snapshot is from trying to run the pdftk binary currently on
&gt; our site on OS X 10.11. The libgcj library noted above comes with
&gt; pdftk, where the others are OS X libraries.
&gt; 
&gt; As I say, I just installed Xcode 7.0.1, which was released yesterday
&gt; on the App store. I will now attempt to use these tools to build
&gt; pdftk.


----------

**UPDATE 3:**

[MacPorts][3] is working the build issue with PDFtk, this is an [update on that thread][3] (Note: this is unrelated to PDFtk Labs):


&gt; This is due to the recompilation of libunwind in 10.11 using Apple
&gt; Clang 7 producing new valid optimizations (according to Apple) that
&gt; tickle an unknown bug in FSF boehm-gc.

​https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66848

&gt; Don&#39;t expect any fixes from Apple as they can&#39;t touch the GPLv3 code
&gt; to look at the FSF boehm-gc problem (unless they used the gcc43
&gt; package which should still be GPLv2). That this issue is triggered by
&gt; the recompilation of libunwind is demonstrated by that fact that
&gt; substituting the libunwind.dylib from 10.10 eliminates both the
&gt; boehm-gc and gcj failures.
&gt; 
&gt; Note that is was filed as radr://21372179, &quot;the FSF boehm-gc library
&gt; built on 10.10 fails to pass its tests on 10.11&quot; but closed as being
&gt; an FSF boehm-gc bug.

----------

**UPDATE 4:**

[MacPorts][3] found a way to solve the build issue, this is an [update on that thread][3]

&gt; The attached Portfile.diff (when used with the proposed gcc5 update on
&gt; [#49227][5] which fixes gcj) solves the build issue with pdftk.

----------

**UPDATE 5:** 

[Sid Steward at PDF Labs][2] has a successful build, his feedback:

&gt; A fix for MacPorts gcc5 allowed me to build a working pdftk that
&gt; merges PDFs on El Capitan. The fix was added to the ticket you had
&gt; posted to:

[MacPorts][3]

&gt; I will proceed to fully test this pdftk before packing it up into an
&gt; installer. This process could take a couple days. 


----------


  [1]: https://github.com/docmunch/homebrew-pdftk
  [2]: https://www.pdflabs.com/company/contact/
  [3]: https://trac.macports.org/ticket/48528
  [4]: http://i.stack.imgur.com/3fa38.png
  [5]: https://trac.macports.org/ticket/49227

Content Type	Original Author	Original Content on Stackoverflow
Question	Alexander McFarlane	View Question on Stackoverflow
Solution 1 - Python	Ike	View Answer on Stackoverflow
Solution 2 - Python	Kurt Pfeifle	View Answer on Stackoverflow
Solution 3 - Python	Ricky McMaster	View Answer on Stackoverflow

Extract / Identify Tables from PDF python

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

Disable swiping between tabs

Yii2 redirect to previous page

Attributions