If you want to first take mean on the combination  of `[&#39;cluster&#39;, &#39;org&#39;]` and then take mean on `cluster` groups, you can use:

    In [59]: (df.groupby([&#39;cluster&#39;, &#39;org&#39;], as_index=False).mean()
                .groupby(&#39;cluster&#39;)[&#39;time&#39;].mean())
    Out[59]:
    cluster
    1          15
    2          54
    3           6
    Name: time, dtype: int64

If you want the mean of `cluster` groups only, then you can use:

    In [58]: df.groupby([&#39;cluster&#39;]).mean()
    Out[58]:
                  time
    cluster
    1        12.333333
    2        54.000000
    3         6.000000

You can also use `groupby` on `[&#39;cluster&#39;, &#39;org&#39;]` and then use `mean()`:

    In [57]: df.groupby([&#39;cluster&#39;, &#39;org&#39;]).mean()
    Out[57]:
                   time
    cluster org
    1       a    438886
            c        23
    2       d      9874
            h        34
    3       w         6

I would simply do this, which literally follows what your desired logic was:

    df.groupby([&#39;org&#39;]).mean().groupby([&#39;cluster&#39;]).mean()

Say I have an `Observable`, like so:

    var one = someObservable.take(1);

    one.subscribe(function(){ /* do something */ });

Then, I have a second `Observable`:

    var two = someOtherObservable.take(1);

Now, I want to `subscribe()` to `two`, but I want to make sure that `one` has completed before the `two` subscriber is fired.

What kind of buffering method can I use on `two` to make the second one wait for the first one to be completed?

I suppose I am looking to pause `two` until `one` is complete.

How to make one Observable sequence wait for another to complete before emitting?

Is it possible to add breakpoints to Mocha tests using Visual Studio Code? 

Normally when debugging code, one needs to configure the launch.json, setting the program attribute to the Javascript file to execute. I am not sure how to do this for Mocha though.

Mocha breakpoints using Visual Studio Code

I have a dataframe like this:

    cluster  org      time
       1      a       8
       1      a       6
       2      h       34
       1      c       23
       2      d       74
       3      w       6 

I would like to calculate the average of time per org per cluster.

Expected result:

    cluster mean(time)
    1       15 ((8+6)/2+23)/2
    2       54   (74+34)/2
    3       6

 I do not know how to do it in Pandas, can anybody help?


Python Pandas : group by in group by and average?

I have a dataframe like this:
<pre><code class="hljs language-css">cluster org time
 1 a 8
 1 a 6
 2 h 34
 1 c 23
 2 d 74
 3 w 6 
</code></pre>
I would like to calculate the average of time per org per cluster.
Expected result:
<pre><code class="hljs language-scss">cluster mean(time)
1 15 ((8+6)/2+23)/2
2 54 (74+34)/2
3 6
</code></pre>
I do not know how to do it in Pandas, can anybody help?

I have a script that generates two-dimensional `numpy` arrays with `dtype=float` and shape on the order of `(1e3, 1e6)`.  Right now I&#39;m using `np.save` and `np.load` to perform IO operations with the arrays.  However, these functions take several seconds for each array.  Are there faster methods for saving and loading the entire arrays (i.e., without making assumptions about their contents and reducing them)?  I&#39;m open to converting the arrays to another type before saving as long as the data are retained exactly.

Fastest save and load options for a numpy array

I am trying to get a grouped boxplot working using Seaborn as per the [example][1]

I can get the above example working, however the line:

    tips = sns.load_dataset(&quot;tips&quot;)

is not explained at all.  I have located the tips.csv file, but I can&#39;t seem to find adequate documentation on what load_dataset specifically does.  I tried to create my own csv and load this, but to no avail.  I also renamed the tips file and it still worked... 

My question is thus:

Where is `load_dataset` actually looking for files? Can I actually use this for my own boxplots? 

EDIT: I managed to get my own boxplots working using my own `DataFrame`, but I am still wondering whether `load_dataset` is used for anything more than mysterious tutorial examples.

  [1]: http://stanford.edu/~mwaskom/software/seaborn/examples/grouped_boxplot.html

Seaborn load_dataset

I am new to python, and i read some code snippet from some place. It&#39;s an implementation of counting sort.

The code is as below:

    from collections import defaultdict
    def sort_colors(A):
        ht = {}                        # a hash map
        ht = defaultdict(lambda:0, ht) # with default value 1
        for i in A:
             ht[i] += 1
        ret = []
        for k in [0, 1, 2]:
            ret.extend([k]*ht[k])
        return ret

As in the first two lines of the func, it&#39;s

    ht = {}
    ht = defaultdict(lambda:0, ht)

I am not quite clear about this initialization.Could you kindly help me figure it out? and also, Shall we just replace these two lines with following?

    ht = defaultdict(int) # default value 0

defaultdict with default value 1?

I would like to fill missing values in one column with values from another column, using `fillna` method. 

(I read that looping through each row would be very bad practice and that it would be better to do everything in one go but I could not find out how to do it with `fillna`.)

Data before:

    Day  Cat1  Cat2
    1    cat   mouse
    2    dog   elephant
    3    cat   giraf
    4    NaN   ant

Data after:

    Day  Cat1  Cat2
    1    cat   mouse
    2    dog   elephant
    3    cat   giraf
    4    ant   ant

How to pass another entire column as argument to pandas fillna()

I&#39;ve the following code using `asyncio` and `aiohttp` to make asynchronous HTTP requests.

    import sys
    import asyncio
    import aiohttp

    @asyncio.coroutine
    def get(url):
        try:
            print(&#39;GET %s&#39; % url)
            resp = yield from aiohttp.request(&#39;GET&#39;, url)
        except Exception as e:
            raise Exception(&quot;%s has error &#39;%s&#39;&quot; % (url, e))
        else:
            if resp.status &gt;= 400:
                raise Exception(&quot;%s has error &#39;%s: %s&#39;&quot; % (url, resp.status, resp.reason))

        return (yield from resp.text())

    @asyncio.coroutine
    def fill_data(run):
        url = &#39;http://www.google.com/%s&#39; % run[&#39;name&#39;]
        run[&#39;data&#39;] = yield from get(url)

    def get_runs():
        runs = [ {&#39;name&#39;: &#39;one&#39;}, {&#39;name&#39;: &#39;two&#39;} ]
        loop = asyncio.get_event_loop()
        task = asyncio.wait([fill_data(r) for r in runs])
        loop.run_until_complete(task)   
        return runs

    try:
        get_runs()
    except Exception as e:
        print(repr(e))
        sys.exit(1)

For some reason, exceptions raised inside the `get` function are not caught:

    Future/Task exception was never retrieved
    Traceback (most recent call last):
      File &quot;site-packages/asyncio/tasks.py&quot;, line 236, in _step
        result = coro.send(value)
      File &quot;mwe.py&quot;, line 25, in fill_data
        run[&#39;data&#39;] = yield from get(url)
      File &quot;mwe.py&quot;, line 17, in get
        raise Exception(&quot;%s has error &#39;%s: %s&#39;&quot; % (url, resp.status, resp.reason))
    Exception: http://www.google.com/two has error &#39;404: Not Found&#39;

So, what is correct way to handle exceptions raised by coroutines?

Asynchronous exception handling in Python

I import a dataframe via `read_csv`, but for some reason can&#39;t extract the year or month from the series `df[&#39;date&#39;]`, trying that gives `AttributeError: &#39;Series&#39; object has no attribute &#39;year&#39;`:

    date	Count
    6/30/2010	525
    7/30/2010	136
    8/31/2010	125
    9/30/2010	84
    10/29/2010	4469

    df = pd.read_csv(&#39;sample_data.csv&#39;, parse_dates=True)

    df[&#39;date&#39;] = pd.to_datetime(df[&#39;date&#39;])

    df[&#39;year&#39;] = df[&#39;date&#39;].year
    df[&#39;month&#39;] = df[&#39;date&#39;].month

UPDATE:
and when I try solutions with `df[&#39;date&#39;].dt` on my pandas version 0.14.1, I get   &quot;AttributeError: &#39;Series&#39; object has no attribute &#39;dt&#39; &quot;:

    df = pd.read_csv(&#39;sample_data.csv&#39;,parse_dates=True)

    df[&#39;date&#39;] = pd.to_datetime(df[&#39;date&#39;])

    df[&#39;year&#39;] = df[&#39;date&#39;].dt.year
    df[&#39;month&#39;] = df[&#39;date&#39;].dt.month

Sorry for this question that seems repetitive - I expect the answer will make me feel like a bonehead... but I have not had any luck using answers to the similar questions on SO.

---

FOLLOWUP: I can&#39;t seem to update my pandas 0.14.1 to a newer release in my Anaconda environment, each of the attempts below generates an invalid syntax error. I&#39;m using Python 3.4.1 64bit.

    conda update pandas

    conda install pandas==0.15.2

    conda install -f pandas


Any ideas?

python pandas extract year from datetime: df[&#39;year&#39;] = df[&#39;date&#39;].year is not working

While working in Pandas in Python...

I&#39;m working with a dataset that contains some missing values, and I&#39;d like to return a dataframe which contains only those rows which have missing data.  Is there a nice way to do this?

(My current method to do this is an inefficient &quot;look to see what index isn&#39;t in the dataframe without the missing values, then make a df out of those indices.&quot;)

Python, Pandas : Return only those rows which have missing values

I&#39;m attempting to read a CSV file into a Dataframe in Pandas. When I try to do that, I get the following error:

&gt;UnicodeDecodeError: &#39;utf-8&#39; codec can&#39;t decode byte 0x96 in position 55: invalid start byte

This is from code:

    import pandas as pd

    location = r&quot;C:\Users\khtad\Documents\test.csv&quot;

    df = pd.read_csv(location, header=0, quotechar=&#39;&quot;&#39;)

This is on a Windows 7 Enterprise Service Pack 1 machine and it seems to apply to every CSV file I create. In this particular case the binary from location 55 is 00101001 and location 54 is 01110011, if that matters. 

Saving the file as UTF-8 with a text editor doesn&#39;t seem to help, either. Similarly, adding the param &quot;encoding=&#39;utf-8&#39; doesn&#39;t work, either--it returns the same error.

What is the most likely cause of this error and are there any workarounds other than abandoning the DataFrame construct for the moment and using the csv module to read in the CSV line-by-line?



Encoding Error in Panda read_csv

For the following dataframe:

    StationID  HoursAhead    BiasTemp  
    SS0279           0          10
    SS0279           1          20
    KEOPS            0          0
    KEOPS            1          5
    BB               0          5
    BB               1          5

I&#39;d like to get something like:

    StationID  BiasTemp  
    SS0279     15
    KEOPS      2.5
    BB         5

I know I can script something like this to get the desired result:

    def transform_DF(old_df,col):
        list_stations = list(set(old_df[&#39;StationID&#39;].values.tolist()))
        header = list(old_df.columns.values)
        header.remove(col)
        header_new = header
        new_df = pandas.DataFrame(columns = header_new)
        for i,station in enumerate(list_stations):
            general_results = old_df[(old_df[&#39;StationID&#39;] == station)].describe()
            new_row = []
            for column in header_new:
                if column in [&#39;StationID&#39;]: 
                    new_row.append(station)
                    continue
                new_row.append(general_results[column][&#39;mean&#39;])
            new_df.loc[i] = new_row
        return new_df

But I wonder if there is something more straightforward in pandas.

How to calculate mean values grouped on another column in Pandas

Given the below pandas DataFrame:

    In [115]: times = pd.to_datetime(pd.Series([&#39;2014-08-25 21:00:00&#39;,&#39;2014-08-25 21:04:00&#39;,
                                                &#39;2014-08-25 22:07:00&#39;,&#39;2014-08-25 22:09:00&#39;]))
              locations = [&#39;HK&#39;, &#39;LDN&#39;, &#39;LDN&#39;, &#39;LDN&#39;]
              event = [&#39;foo&#39;, &#39;bar&#39;, &#39;baz&#39;, &#39;qux&#39;]
              df = pd.DataFrame({&#39;Location&#39;: locations,
                                 &#39;Event&#39;: event}, index=times)
              df
    Out[115]:
                                   Event Location
              2014-08-25 21:00:00  foo	 HK
              2014-08-25 21:04:00  bar   LDN
              2014-08-25 22:07:00  baz   LDN
              2014-08-25 22:09:00  qux   LDN

I would like resample the data to aggregate it hourly by count while grouping by location to produce a data frame that looks like this:

    Out[115]:
                                   HK    LDN
              2014-08-25 21:00:00  1	 1
              2014-08-25 22:00:00  0     2

I&#39;ve tried various combinations of resample() and groupby() but with no luck. How would I go about this?

Pandas: resample timeseries with groupby

What is the best way to do a groupby on a Pandas dataframe, but exclude some columns from that groupby? e.g. I have the following dataframe:
```none
Code   Country      Item_Code   Item    Ele_Code    Unit    Y1961    Y1962   Y1963
2      Afghanistan  15          Wheat   5312        Ha      10       20      30
2      Afghanistan  25          Maize   5312        Ha      10       20      30
4      Angola       15          Wheat   7312        Ha      30       40      50
4      Angola       25          Maize   7312        Ha      30       40      50
```
I want to groupby the column Country and Item_Code and only compute the sum of the rows falling under the columns Y1961, Y1962 and Y1963. The resulting dataframe should look like this:
```none
Code   Country      Item_Code   Item    Ele_Code    Unit    Y1961    Y1962   Y1963
2      Afghanistan  15          C3      5312        Ha      20       40       60
4      Angola       25          C4      7312        Ha      60       80      100
```
Right now I am doing this:

    df.groupby(&#39;Country&#39;).sum()

However this adds up the values in the Item_Code column as well. Is there any way I can specify which columns to include in the `sum()` operation and which ones to exclude?

Pandas sum by groupby, but exclude certain columns

Hello I have the following dataframe.   

        Group           Size
        
        Short          Small
        Short          Small
        Moderate       Medium
        Moderate       Small
        Tall           Large

I want to count the frequency of how many time the same row appears in the dataframe.

        Group           Size      Time
        
        Short          Small        2
        Moderate       Medium       1 
        Moderate       Small        1
        Tall           Large        1

Python: get a frequency count based on two columns (variables) in pandas dataframe some row appers

I currently have a pandas `Series` with dtype `Timestamp`, and I want to group it by date (and have many rows with different times in each group).

The seemingly obvious way of doing this would be something similar to

    grouped = s.groupby(lambda x: x.date())

However, pandas&#39; `groupby` groups Series by its index. How can I make it group by value instead?

How to group a Series by values in pandas?

I have upgraded my system and have installed MySql 5.7.9 with php for a web application I am working on. I have a query that is dynamically created, and when run in older versions of MySql it works fine.  Since upgrading to 5.7 I get this error:

&gt; Expression #1 of SELECT list is not in GROUP BY clause and contains
&gt; nonaggregated column &#39;support_desk.mod_users_groups.group_id&#39; which is
&gt; not functionally dependent on columns in GROUP BY clause; this is
&gt; incompatible with sql_mode=only_full_group_by

Note the Manual page for Mysql 5.7 on the topic of [Server SQL Modes][1].

This is the query that is giving me trouble:

&lt;!-- language: lang-sql --&gt;

    SELECT mod_users_groups.group_id AS &#39;value&#39;, 
           group_name AS &#39;text&#39; 
    FROM mod_users_groups
    LEFT JOIN mod_users_data ON mod_users_groups.group_id = mod_users_data.group_id 
    WHERE  mod_users_groups.active = 1 
      AND mod_users_groups.department_id = 1 
      AND mod_users_groups.manage_work_orders = 1 
      AND group_name != &#39;root&#39; 
      AND group_name != &#39;superuser&#39; 
    GROUP BY group_name 
    HAVING COUNT(`user_id`) &gt; 0 
    ORDER BY group_name

I did some googling on the issue, but I don&#39;t understand `only_full_group_by` enough to figure out what I need to do to fix the query. Can I just turn off the `only_full_group_by` option, or is there something else I need to do?

Let me know if you need more information.

[1]: http://dev.mysql.com/doc/refman/5.7/en/sql-mode.html

Error related to only_full_group_by when executing a query in MySql

I&#39;m very new for this stuff, and trying to make some express app

    var express = require(&#39;express&#39;);
    var app = express();

    app.listen(3000, function(err) {
        if(err){
           console.log(err);
           } else {
           console.log(&quot;listen:3000&quot;);
        }
    });

    //something useful
    app.get(&#39;*&#39;, function(req, res) {
      res.status(200).send(&#39;ok&#39;)
    });

When I start the server with the command: 

    node server.js 

everything goes fine.

I see on the console
    
    listen:3000

and when I try

    curl http://localhost:3000

I see &#39;ok&#39;. 

When I try 

    telnet localhost

I see

    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is &#39;^]&#39; 

but when I try 

    netstat -na | grep :3000

I see

    tcp  0  0 0.0.0.0:3000   0.0.0.0:*  LISTEN


The question is: why does it listen all interfaces instead of only localhost?

The OS is linux mint 17 without any whistles.


express app server . listen all interfaces instead of localhost only

In the [MNIST beginner tutorial][1], there is the statement 

    accuracy = tf.reduce_mean(tf.cast(correct_prediction, &quot;float&quot;))

`tf.cast` basically changes the type of tensor the object is, but what is the difference between [`tf.reduce_mean`][2] and [`np.mean`][3]? 

Here is the doc on [`tf.reduce_mean`][2]:

&gt; `reduce_mean(input_tensor, reduction_indices=None, keep_dims=False, name=None)`

&gt; `input_tensor`: The tensor to reduce. Should have numeric type.

&gt; `reduction_indices`: The dimensions to reduce. If `None` (the defaut), reduces all dimensions.
    
&gt;     # &#39;x&#39; is [[1., 1. ]]
    #         [2., 2.]]
    tf.reduce_mean(x) ==&gt; 1.5
    tf.reduce_mean(x, 0) ==&gt; [1.5, 1.5]
    tf.reduce_mean(x, 1) ==&gt; [1.,  2.]

For a 1D vector, it looks like `np.mean == tf.reduce_mean`, but I don&#39;t understand what&#39;s happening in `tf.reduce_mean(x, 1) ==&gt; [1.,  2.]`. `tf.reduce_mean(x, 0) ==&gt; [1.5, 1.5]` kind of makes sense, since mean of `[1, 2]` and `[1, 2]` is `[1.5, 1.5]`, but what&#39;s going on with `tf.reduce_mean(x, 1)`?

[1]: https://www.tensorflow.org/versions/master/tutorials/mnist/beginners/index.html
[2]: https://www.tensorflow.org/api_docs/python/tf/reduce_mean
[3]: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.mean.html


What is the difference between np.mean and tf.reduce_mean?

I compared the performance of the `mean` function of the `statistics` module with the simple `sum(l)/len(l)` method and found the `mean` function to be very slow for some reason. I used `timeit` with the two code snippets below to compare them, does anyone know what causes the massive difference in execution speed? I&#39;m using Python 3.5.

    from timeit import repeat
    print(min(repeat(&#39;mean(l)&#39;,
                     &#39;&#39;&#39;from random import randint; from statistics import mean; \
                     l=[randint(0, 10000) for i in range(10000)]&#39;&#39;&#39;, repeat=20, number=10)))

The code above executes in about 0.043 seconds on my machine.

    from timeit import repeat
    print(min(repeat(&#39;sum(l)/len(l)&#39;,
                     &#39;&#39;&#39;from random import randint; from statistics import mean; \
                     l=[randint(0, 10000) for i in range(10000)]&#39;&#39;&#39;, repeat=20, number=10)))

The code above executes in about 0.000565 seconds on my machine.

Why is statistics.mean() so slow?

I have two lists as follows.

    mylist1 = [[&quot;lemon&quot;, 0.1], [&quot;egg&quot;, 0.1], [&quot;muffin&quot;, 0.3], [&quot;chocolate&quot;, 0.5]]
    mylist2 = [[&quot;chocolate&quot;, 0.5], [&quot;milk&quot;, 0.2], [&quot;carrot&quot;, 0.8], [&quot;egg&quot;, 0.8]]

I want to get the mean of the common elements in the two lists as follows.

    myoutput = [[&quot;chocolate&quot;, 0.5], [&quot;egg&quot;, 0.45]]

My current code is as follows

    for item1 in mylist1:
        for item2 in mylist2:
            if item1[0] == item2[0]:
                 print(np.mean([item1[1], item2[1]]))

However, since there are two `for` loops (`O(n^2)` complexity) this is very inefficient for very long lists. I am wondering if there is more standard/efficient way of doing this in Python.


Content Type	Original Author	Original Content on Stackoverflow
Question	UserYmY	View Question on Stackoverflow
Solution 1 - Python	Zero	View Answer on Stackoverflow
Solution 2 - Python	Vince Payandeh	View Answer on Stackoverflow

Python Pandas : group by in group by and average?

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Mocha breakpoints using Visual Studio Code

How to make one Observable sequence wait for another to complete before emitting?

Attributions