This is actually on the [main page of nltk.org][1]:


    &gt;&gt;&gt; import nltk
    &gt;&gt;&gt; sentence = &quot;&quot;&quot;At eight o&#39;clock on Thursday morning
    ... Arthur didn&#39;t feel very good.&quot;&quot;&quot;
    &gt;&gt;&gt; tokens = nltk.word_tokenize(sentence)
    &gt;&gt;&gt; tokens
    [&#39;At&#39;, &#39;eight&#39;, &quot;o&#39;clock&quot;, &#39;on&#39;, &#39;Thursday&#39;, &#39;morning&#39;,
    &#39;Arthur&#39;, &#39;did&#39;, &quot;n&#39;t&quot;, &#39;feel&#39;, &#39;very&#39;, &#39;good&#39;, &#39;.&#39;]


  [1]: http://nltk.org/

As @PavelAnossov answered, the canonical answer, use the `word_tokenize` function in nltk:

    from nltk import word_tokenize
    sent = &quot;This is my text, this is a nice way to input text.&quot;
    word_tokenize(sent)

----

**If your sentence is truly simple enough:**

Using the `string.punctuation` set, remove punctuation then split using the whitespace delimiter:

    import string
    x = &quot;This is my text, this is a nice way to input text.&quot;
    y = &quot;&quot;.join([i for i in x if not in string.punctuation]).split(&quot; &quot;)
    print y




I am new to Git and have a fairly large project that I want to push to a remote repo (Repo B) on Github. The original project was on Github as well but from a different repo (Repo A). I have to make some changes to files from Repo A before I can setup the project up on Repo B. I have setup the remotes, ssh keys etc. and I run into an issue when pushing the codebase to Repo B.


I get the following error all the time:

    $ git push &lt;remote_repo_name&gt; master
    Enter passphrase for key &#39;/c/ssh/.ssh/id_rsa&#39;:
    Counting objects: 146106, done.
    Delta compression using up to 4 threads.
    Compressing objects: 100% (35519/35519), done.
    fatal: pack exceeds maximum allowed size00 GiB | 154 KiB/s
    fatal: sha1 file &#39;&lt;stdout&gt;&#39; write error: Invalid arguments
    error: failed to push some refs to &#39;git@github.com:&lt;repo&gt;.git

I changed the following settings in my local gitconfig

    git config pack.packSizeLimit 1g
    git config pack.windowMemory 1g

... and ran git gc (which I see reorganized the packs so that each pack stayed within the packsize of 1GB). This did not work and I get the error seen above.

I tried to lower the size of each pack as well ....

    git config pack.packSizeLimit 500m
    git config pack.windowMemory 500m

... and ran git gc (which I see reorganized the packs so that each pack stayed within the packsize of 500MB). This did not work either and I ran into the same error.

I am not sure of what Github&#39;s default packsize limits are (if any). The account is a micro account if that matters.

Github remote push pack size exceeded

I have a Maven web project in my repo.  

I am a Maven noob but still I understand the fact that there are plugins which we need to configure only then we could run plugin specific commands.

**Facts:**

I have a sonar server running on my local machine at port 9000.

I have not added any sonar specific plugin in my POM.xml

**Reference:**

http://www.sonarsource.org/we-had-a-dream-mvn-sonarsonar/

**Observation:**

But still when I run `mvn sonar:sonar` in my project from command line it works fine.

Matter of the fact is **I have NOT configured sonar plugin in my POM.xml Even then from where the hell Maven is picking up and understanding &quot;sonar:sonar&quot; goal/command?**

**Question / curiosity:**

I don&#39;t want the working knowledge of sonar itself. I want to know why `mvn sonar:sonar` works without configuring a sonar plugin in my pom.xml

WHY and how?

Why does the Maven command &quot;mvn sonar:sonar&quot; work without any plugin configuration in my &quot;pom.xml&quot;?

I am using nltk, so I want to create my own custom texts just like the default ones on nltk.books. However, I&#39;ve just got up to the method like

    my_text = [&#39;This&#39;, &#39;is&#39;, &#39;my&#39;, &#39;text&#39;]

I&#39;d like to discover any way to input my &quot;text&quot; as:

    my_text = &quot;This is my text, this is a nice way to input text.&quot;

Which method, python&#39;s or from nltk allows me to do this. And more important, how can I dismiss punctuation symbols?

How do I tokenize a string sentence in NLTK?

I am using nltk, so I want to create my own custom texts just like the default ones on nltk.books. However, I've just got up to the method like
<pre><code class="hljs language-ini">my_text = ['This', 'is', 'my', 'text']
</code></pre>
I'd like to discover any way to input my "text" as:
<pre><code class="hljs language-css">my_text = "This is my text, this is a nice way to input text."
</code></pre>
Which method, python's or from nltk allows me to do this. And more important, how can I dismiss punctuation symbols?

How can I save all cookies in Python&#39;s Selenium WebDriver to a .txt file, and then load them later?

The documentation doesn&#39;t say much of anything about the getCookies function.


How to save and load cookies using Python + Selenium WebDriver

I have a script reading in a csv file with very huge fields:

    # example from http://docs.python.org/3.3/library/csv.html?highlight=csv%20dictreader#examples
    import csv
    with open(&#39;some.csv&#39;, newline=&#39;&#39;) as f:
        reader = csv.reader(f)
        for row in reader:
            print(row)

However, this throws the following error on some csv files:

    _csv.Error: field larger than field limit (131072)
    
How can I analyze csv files with huge fields? Skipping the lines with huge fields is not an option as the data needs to be analyzed in subsequent steps.

_csv.Error: field larger than field limit (131072)

I need to make a candlestick chart (something like this) using some stock data. For this I want to use the function [matplotlib.finance.candlestick()][1]. To this function I need to supply quotes and &quot;*an Axes instance to plot to*&quot;. I created some sample quotes as follows: 

    quotes = [(1, 5, 6, 7, 4), (2, 6, 9, 9, 6), (3, 9, 8, 10, 8), (4, 8, 8, 9, 8), (5, 8, 11, 13, 7)]
I now also need an Axes instance though, at which I am a bit lost. I created plots before using matplotlib.pyplot. I think I now need to do something with [matplotlib.axes][2] though, but I am unsure what exactly.

Could anybody help me out a little bit here? All tips are welcome!


  [1]: https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/finance.py
  [2]: http://matplotlib.org/api/axes_api.html

How to get a matplotlib Axes instance to plot to?

I have an image and I want to extract a region from it. I have coordinates of left upper corner and right lower corner of this region. In gray scale I do it like this:

    I = cv2.imread(&quot;lena.png&quot;)
    I = cv2.cvtColor(I, cv2.COLOR_RGB2GRAY)
    region = I[248:280,245:288]
    tools.show_1_image_pylab(region)

I can&#39;t figure it out how to do it in color. I thought of extracting each channel R, G, B; slicing this region from each of the channels and to merge them back together but there is gotta be a shorter way. 



Extracting a region from an image using slicing in Python, OpenCV

I&#39;m checking to see if a directory exists, but I noticed I&#39;m using `os.path.exists` instead of `os.path.isdir`.  Both work just fine, but I&#39;m curious as to what the advantages are for using `isdir` instead of `exists`. 

pros and cons between os.path.exists vs os.path.isdir

From https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity , it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate cosine similarity between 2 strings?


    s1 = &quot;This is a foo bar sentence .&quot;
    s2 = &quot;This sentence is similar to a foo bar sentence .&quot;
    s3 = &quot;What is this string ? Totally not related to the other two lines .&quot;
    
    cosine_sim(s1, s2) # Should give high cosine similarity
    cosine_sim(s1, s3) # Shouldn&#39;t give high cosine similarity value
    cosine_sim(s2, s3) # Shouldn&#39;t give high cosine similarity value





Calculate cosine similarity given 2 sentence strings

I&#39;m just starting to use NLTK and I don&#39;t quite understand how to get a list of words from text. If I use `nltk.word_tokenize()`, I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also `word_tokenize` doesn&#39;t work with multiple sentences: dots are added to the last word.

How to get rid of punctuation using NLTK tokenizer?

Title pretty much sums up the question.  I&#39;ve noticed that in some papers people have referred to a BILOU encoding scheme for NER as opposed to the typical BIO tagging scheme (Such as this paper by Ratinov and Roth in 2009 &lt;http://cogcomp.cs.illinois.edu/page/publication_view/199&gt;)

From working with the 2003 CoNLL data I know that
   
    B stands for &#39;beginning&#39; (signifies beginning of an NE)
    I stands for &#39;inside&#39; (signifies that the word is inside an NE)
    O stands for &#39;outside&#39; (signifies that the word is just a regular word outside of an NE)

While I&#39;ve been told that the words in BILOU stand for

    B - &#39;beginning&#39;
    I - &#39;inside&#39;
    L - &#39;last&#39;
    O - &#39;outside&#39;
    U - &#39;unit&#39;

I&#39;ve also seen people reference another tag 

    E - &#39;end&#39;, use it concurrently with the &#39;last&#39; tag
    S - &#39;singleton&#39;, use it concurrently with the &#39;unit&#39; tag

I&#39;m pretty new to the NER literature, but I&#39;ve been unable to find something clearly explaining these tags.  My questions in particular relates to what the difference between &#39;last&#39; and &#39;end&#39; tags are, and what &#39;unit&#39; tag stands for.

What do the BILOU tags mean in Named Entity Recognition?

Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if &quot;perfect&quot; lemmatizers exist. It&#39;s because stemmers change the surface form of a word/token into some meaningless stems. 

Then again the definition of the &quot;perfect&quot; lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. https://stackoverflow.com/questions/14489309/convert-words-between-verb-noun-adjective-forms. 

**Stemmers** 

    [in]: having
    [out]: hav

**Lemmatizers**

    [in]: having
    [out]: have

 - So the question is, are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English
   
 - If not, then how should we move on to build robust lemmatizers that
   can take on `nounify`, `verbify`, `adjectify` and `adverbify`
   preprocesses?
   
 - How could the lemmatization task be easily scaled to other languages
   that have similar morphological structures as English?

Stemmers vs Lemmatizers

I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like &#39;and&#39;, &#39;or&#39;, &#39;not&#39; gets removed. I want these words to be present after stopword removal process as they are operators which are required for later processing text as query. I don&#39;t know which are the words which can be operators in text query, and I also want to remove unnecessary words from my text.

Stopword removal with NLTK

In my JavaEE application, I&#39;m using the Atom-based [Google Sites API](https://developers.google.com/google-apps/sites/) to retrieve content from a non-public Google Site. In essence, we&#39;re using the Google Site as a lightweight CMS, and from within the application I use the API to retrieve the site contents to feed my online help system. I&#39;ve had this setup for a while and it&#39;s working without a hitch.

The issue
---------

In my application, I need to add full-text search functionality to the online help system. I knew this feature request would come along at some point, so when deciding on Google Sites to host my content, I checked whether the Sites API supports full-text search. [It does](https://developers.google.com/google-apps/sites/docs/1.0/developers_guide_protocol#ContentFeedQueriesFullText). For example, the following URL will search the entire site `my-site` for pages containing the keyword `user`.

    https://sites.google.com/feeds/content/my.doma.in/my-site?q=user

This works, and gives me the expected result pages. But it does so **only for content written in Western languages**, or, more specifically, languages in which tokens/words are separated by whitespace and punctuation. When I run a similar search on my Japanese content, searching for the keyword `ユーザー`:

    https://sites.google.com/feeds/content/my.doma.in/my-site?q=%E3%83%A6%E3%83%BC%E3%82%B6%E3%83%BC

I will only get result pages in which the search term appears as a bare string, i.e. delimited by either white-space or punctuation. Since Japanese is a language written in [scriptio continua](http://en.wikipedia.org/wiki/Scriptio_continua), this is not sufficient. Pages that contain, for example:

&gt; ご自身の**ユーザー**基本情報の確認

will not show up in the results. So it seems that the search index that is used behind the scenes is created based on &quot;Western&quot; lexical rules, and that Japanese content is not correctly tokenized. However, when I search for the same keyword from the Google Site&#39;s *Search this site* field, I do get the correct results. I conclude that **a correctly tokenized index exists, but it seems to be impossible to use it for an API-based search**.

What I&#39;ve tried so far
----------------------

To remedy this situation, these are the avenues that I&#39;ve explored so far:

 - I&#39;ve tried looking for language settings in Google Sites itself. There&#39;s a general UI language setting which was already set to Japanese and has no impact on the API query results. There are no per-page or per-template language settings to force the indexer/tokenizer&#39;s hand.
 - I&#39;ve tried quoting the search string with double quotes (`&quot;ユーザー&quot;`).
 - I&#39;ve tried including wildcards (`*ユーザー*`).
 - I&#39;ve tried using additional language parameters to the URL that are common in other Google APIs: `lang`, `hl` (interface language), `rl` (results language),..
 - I&#39;ve tried creating a Google [Custom Search Engine](https://www.google.jp/cse/), but it seems impossible to get it to work on a non-public Google Site.

So...
-----

I&#39;m quickly running out of ideas here. In a worst case scenario, I will end up having to retrieve, tokenize, and index all of the content myself and make it searchable that way. Since this will require a substantial effort, I would like to know if anyone has encountered the same issue and has found an acceptable workaround or solution.

----------

Update 1
--------

I have yet to find an elegant solution for this issue, so I raised a defect on the Google Apps APIs issue tracker: https://code.google.com/a/google.com/p/apps-api-issues/issues/detail?id=3780

Update 2
--------

After some going back and forth, Google&#39;s engineers have acknowledged that the problem indeed exists as described, and have *&quot;filed the issue internally&quot;*. The defect ticket has been stuck in **triaged** state ever since. If you, like me, are interested in seeing this issue resolved, please take a moment to star/vote for it on [Google&#39;s issue tracker](https://code.google.com/a/google.com/p/apps-api-issues/issues/detail?id=3780).

Google Sites API full-text search does not work for non-Western languages

From something like this:

    print(get_indentation_level())

        print(get_indentation_level())

            print(get_indentation_level())

I would like to get something like this:

    1
    2
    3
Can the code read itself in this way?

All I want is the output from the more nested parts of the code to be more nested. In the same way that this makes code easier to read, it would make the output easier to read. 

Of course I could implement this manually, using e.g. `.format()`, but what I had in mind was a custom print function which would `print(i*&#39; &#39; + string)` where `i` is the indentation level. This would be a quick way to make readable output on my terminal. 

Is there a better way to do this which avoids painstaking manual formatting?

Can a line of Python code know its indentation nesting level?

How do I find a list with all possible pos tags used by the Natural Language Toolkit (nltk)?

What are all possible pos tags of NLTK?

I wanted to use wordnet lemmatizer in python and I have learnt that the default pos tag is NOUN and that it does not output the correct lemma for a verb, unless the pos tag is explicitly specified as VERB.

My question is what is the best shot inorder to perform the above lemmatization accurately?


I did the pos tagging using `nltk.pos_tag` and I am lost in integrating the tree bank pos tags to wordnet compatible pos tags. Please help

    from nltk.stem.wordnet import WordNetLemmatizer
    lmtzr = WordNetLemmatizer()
    tagged = nltk.pos_tag(tokens)

I get the output tags in NN,JJ,VB,RB. How do I change these to wordnet compatible tags?

Also do I have to train `nltk.pos_tag()` with a tagged corpus or can I use it directly on my data to evaluate?


wordnet lemmatization and pos tagging in python

I have a difficult time using pip to install almost anything. I&#39;m new to coding, so I thought maybe this is something I&#39;ve been doing wrong and have opted out to easy_install to get most of what I needed done, which has generally worked. However, now I&#39;m trying to download the nltk library, and neither is getting the job done.

I tried entering

    sudo pip install nltk

but got the following response:

    /Library/Frameworks/Python.framework/Versions/2.7/bin/pip run on Sat May  4 00:15:38 2013
    Downloading/unpacking nltk
    
      Getting page https://pypi.python.org/simple/nltk/
      Could not fetch URL [need more reputation to post link]: There was a problem confirming the ssl certificate: &lt;urlopen error [Errno 1] _ssl.c:504: error:0D0890A1:asn1 encoding routines:ASN1_verify:unknown message digest algorithm&gt;
    
      Will skip URL [need more reputation to post link]/simple/nltk/ when looking for download links for nltk
    
      Getting page [need more reputation to post link]/simple/
      Could not fetch URL https://pypi.python. org/simple/: There was a problem confirming the ssl certificate: &lt;urlopen error [Errno 1] _ssl.c:504: error:0D0890A1:asn1 encoding routines:ASN1_verify:unknown message digest algorithm&gt;
    
      Will skip URL [need more reputation to post link] when looking for download links for nltk
    
      Cannot fetch index base URL [need more reputation to post link]
    
      URLs to search for versions for nltk:
      * [need more reputation to post link]
      Getting page [need more reputation to post link]
      Could not fetch URL [need more reputation to post link]: There was a problem confirming the ssl certificate: &lt;urlopen error [Errno 1] _ssl.c:504: error:0D0890A1:asn1 encoding routines:ASN1_verify:unknown message digest algorithm&gt;
    
      Will skip URL [need more reputation to post link] when looking for download links for nltk
    
      Could not find any downloads that satisfy the requirement nltk
    
    No distributions at all found for nltk
    
    Exception information:
    Traceback (most recent call last):
      File &quot;/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/basecommand.py&quot;, line 139, in main
        status = self.run(options, args)
      File &quot;/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/commands/install.py&quot;, line 266, in run
        requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
      File &quot;/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/req.py&quot;, line 1026, in prepare_files
        url = finder.find_requirement(req_to_install, upgrade=self.upgrade)
      File &quot;/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg/pip/index.py&quot;, line 171, in find_requirement
        raise DistributionNotFound(&#39;No distributions at all found for %s&#39; % req)
    DistributionNotFound: No distributions at all found for nltk
    
    --easy_install installed fragments of the library and the code ran into trouble very quickly upon trying to run it.

Any thoughts on this issue? I&#39;d really appreciate some feedback on how I can either get pip working or something to get around the issue in the meantime.

pip issue installing almost any library

I&#39;m looking for a way to split a text into n-grams.
Normally I would do something like:

    import nltk
    from nltk import bigrams
    string = &quot;I really like python, it&#39;s pretty awesome.&quot;
    string_bigrams = bigrams(string)
    print string_bigrams
 
I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams?

Thanks!




Content Type	Original Author	Original Content on Stackoverflow
Question	diegoaguilar	View Question on Stackoverflow
Solution 1 - Python	Pavel Anossov	View Answer on Stackoverflow
Solution 2 - Python	alvas	View Answer on Stackoverflow

How do I tokenize a string sentence in NLTK?

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Why does the Maven command "mvn sonar:sonar" work without any plugin configuration in my "pom.xml"?

Github remote push pack size exceeded

Attributions