Try calling `read_csv` with `encoding=&#39;latin1&#39;`, `encoding=&#39;iso-8859-1&#39;` or `encoding=&#39;cp1252&#39;` (these are some of the various encodings found on Windows).



This works in Mac as well you can use

```python
df= pd.read_csv(&#39;Region_count.csv&#39;, encoding =&#39;latin1&#39;)
```

I&#39;ve just updated Android Studio and I can&#39;t sync my project anymore.

The event log reports:

    Gradle sync failed: /Applications/Android Studio.app/Contents/gradle/gradle-X.X.X/lib/plugins/gradle-diagnostics-X.X.X.jar (No such file or directory)

Android Studio - Gradle sync error on gradle-diagnostics-X.X.X.jar

The following code compiles (with Java 8):

    Integer i1 = 1000;
    int i2 = 1000;
    boolean compared = (i1 == i2);
   
But what does it do?

Unbox `i1`:

    boolean compared = (i1.intvalue() == i2);

or box `i2`:

    boolean compared = (i1 == new Integer(i2));

So does it compare two `Integer` objects (by reference) or two `int` variables by value?

Note that for some numbers the reference comparison will yield the correct result because the Integer class maintains an internal cache of values between `-128` to `127` (see also the comment by TheLostMind). This is why I used `1000` in my example and why I specifically ask about the unboxing/boxing and not about the result of the comparison.

When using == for a primitive and a boxed value, is autoboxing done, or is unboxing done

I&#39;m attempting to read a CSV file into a Dataframe in Pandas. When I try to do that, I get the following error:

&gt;UnicodeDecodeError: &#39;utf-8&#39; codec can&#39;t decode byte 0x96 in position 55: invalid start byte

This is from code:

    import pandas as pd

    location = r&quot;C:\Users\khtad\Documents\test.csv&quot;

    df = pd.read_csv(location, header=0, quotechar=&#39;&quot;&#39;)

This is on a Windows 7 Enterprise Service Pack 1 machine and it seems to apply to every CSV file I create. In this particular case the binary from location 55 is 00101001 and location 54 is 01110011, if that matters. 

Saving the file as UTF-8 with a text editor doesn&#39;t seem to help, either. Similarly, adding the param &quot;encoding=&#39;utf-8&#39; doesn&#39;t work, either--it returns the same error.

What is the most likely cause of this error and are there any workarounds other than abandoning the DataFrame construct for the moment and using the csv module to read in the CSV line-by-line?



Encoding Error in Panda read_csv

I'm attempting to read a CSV file into a Dataframe in Pandas. When I try to do that, I get the following error:
>UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 55: invalid start byte
This is from code:
<pre><code class="hljs language-python">import pandas as pd

location = r"C:\Users\khtad\Documents\test.csv"

df = pd.read_csv(location, header=0, quotechar='"')
</code></pre>
This is on a Windows 7 Enterprise Service Pack 1 machine and it seems to apply to every CSV file I create. In this particular case the binary from location 55 is 00101001 and location 54 is 01110011, if that matters.
Saving the file as UTF-8 with a text editor doesn't seem to help, either. Similarly, adding the param "encoding='utf-8' doesn't work, either--it returns the same error.
What is the most likely cause of this error and are there any workarounds other than abandoning the DataFrame construct for the moment and using the csv module to read in the CSV line-by-line?

I would like to use pd.write_csv to write &quot;filename&quot; (with headers) if &quot;filename&quot; doesn&#39;t exist, otherwise to append to &quot;filename&quot; if it exists.    If I simply use command: 

         df.to_csv(&#39;filename.csv&#39;,mode = &#39;a&#39;,header =&#39;column_names&#39;)

The write or append succeeds, but it seems like the header is written every time an append takes place.   

How can I only add the header if the file doesn&#39;t exist, and append without header if the file does exist? 



Panda&#39;s Write CSV - Append vs. Write

I need to create a CSV and upload it to an S3 bucket. Since I&#39;m creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.

Is there a way to do this? My project is in Python and I&#39;m fairly new to the language. Here is what I tried so far:

    import csv
    import csv
    import io
    import boto
    from boto.s3.key import Key


    conn = boto.connect_s3()
    bucket = conn.get_bucket(&#39;dev-vs&#39;)
    k = Key(bucket)
    k.key = &#39;foo/foobar&#39;

    fieldnames = [&#39;first_name&#39;, &#39;last_name&#39;]
    writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
    k.set_contents_from_stream(writer.writeheader())

I received this error: BotoClientError: s3 does not support chunked transfer

**UPDATE: I found a way to write directly to S3, but I can&#39;t find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:**

    conn = boto.connect_s3()
    bucket = conn.get_bucket(&#39;dev-vs&#39;)
    k = Key(bucket)
    k.key = &#39;foo/foobar&#39;

    testDict = [{
        &quot;fieldA&quot;: &quot;8&quot;,
        &quot;fieldB&quot;: None,
        &quot;fieldC&quot;: &quot;888888888888&quot;},
        {
        &quot;fieldA&quot;: &quot;9&quot;,
        &quot;fieldB&quot;: None,
        &quot;fieldC&quot;: &quot;99999999999&quot;}]

    f = io.StringIO()
    fieldnames = [&#39;fieldA&#39;, &#39;fieldB&#39;, &#39;fieldC&#39;]
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    k.set_contents_from_string(f.getvalue())

    for row in testDict:
        writer.writerow(row)
        k.set_contents_from_string(f.getvalue())

    f.close()

Writes 3 lines to the file, however I&#39;m unable to release memory to write a big file. If I add:

    f.seek(0)
    f.truncate(0)
to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?

Can you upload to S3 using a stream rather than a local file?

I am trying to read a csv file using readr::read_csv in R. The csv file that I am importing has about 150 columns, I am just including the first few columns for the example. I am looking to override the second column from the default type (which is date when I do read_csv) to character, or other date format.  
    
    GIS Join Match Code	Data File Year	State Name	State Code	County Name	County   Code	Area Name	Persons: Total
    G0100010	2008-2012	Alabama	1	Autauga County	1	Autauga County, Alabama	54590
 
    df &lt;- data.frame(&quot;GIS Join Match Code&quot;=&quot;G0100010&quot;, &quot;Data File&quot; = &quot;2008-2012&quot;, &quot;State&quot; = &quot;Alabama&quot;, &quot;County&quot; = &quot;Autauga County&quot;, &quot;Population&quot; = 54590)

The issue is that when I use readr::read_csv, it seems I may have to use all variables while overriding in the col_types (see error below). That is need to specify overriding all the 150 columns individually(?).. The question is that : Is there a way to specify overriding the col_type of just specific columns, or a named list of objects? In my case, it would be just overriding the column &quot;Data File Year&quot;.

I understand that any omitted columns will be automatically parsed, which is fine for my analysis. I think it gets further complex as the column names have a space in them in the file I downloaded (For e.g., &quot;Data File Year&quot;, &quot;State Code&quot;) etc.
    
    tempdata &lt;- read_csv(df, col_types = &quot;cc&quot;)
    Error: You have 135 column names, but 2 columns

The Other option I guess, if possible, is to just skip reading the second column all together? 


Override column types when importing data using readr::read_csv() when there are many columns

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.

Need a Scala function which will take parameter like path and file name and write that CSV file.

Write single CSV file using spark-csv

I am importing a CSV file like the one below, using `pandas.read_csv`:

    df = pd.read_csv(Input, delimiter=&quot;;&quot;)

Example of CSV file:

    10;01.02.2015 16:58;01.02.2015 16:58;-0.59;0.1;-4.39;NotApplicable;0.79;0.2
    11;01.02.2015 16:58;01.02.2015 16:58;-0.57;0.2;-2.87;NotApplicable;0.79;0.21

The problem is that when I later on in my code try to use these values I get this error: `TypeError: can&#39;t multiply sequence by non-int of type &#39;float&#39;` 

**The error is because the number I&#39;m trying to use is not written with a dot (`.`) as a decimal separator but a comma(`,`)**. After manually changing the commas to a dots my program works.

I can&#39;t change the format of my input, and thus have to replace the commas in my DataFrame in order for my code to work, and I want python to do this without the need of doing it manually. Do you have any suggestions?

Convert commas decimal separators to dots within a Dataframe

For the following dataframe:

    StationID  HoursAhead    BiasTemp  
    SS0279           0          10
    SS0279           1          20
    KEOPS            0          0
    KEOPS            1          5
    BB               0          5
    BB               1          5

I&#39;d like to get something like:

    StationID  BiasTemp  
    SS0279     15
    KEOPS      2.5
    BB         5

I know I can script something like this to get the desired result:

    def transform_DF(old_df,col):
        list_stations = list(set(old_df[&#39;StationID&#39;].values.tolist()))
        header = list(old_df.columns.values)
        header.remove(col)
        header_new = header
        new_df = pandas.DataFrame(columns = header_new)
        for i,station in enumerate(list_stations):
            general_results = old_df[(old_df[&#39;StationID&#39;] == station)].describe()
            new_row = []
            for column in header_new:
                if column in [&#39;StationID&#39;]: 
                    new_row.append(station)
                    continue
                new_row.append(general_results[column][&#39;mean&#39;])
            new_df.loc[i] = new_row
        return new_df

But I wonder if there is something more straightforward in pandas.

How to calculate mean values grouped on another column in Pandas

I&#39;m looking for a simple way to sort a pandas dataframe by the absolute value of a particular column, but without actually changing the values within the dataframe. Something similar to `sorted(df, key=abs)`. So if I had a dataframe like:

        a	b
    0	1	-3
    1	2	5 
    2	3	-1
    3	4	2
    4	5	-9

The resultant sorted data when sorting on &#39;b&#39; would look like:

        a	b
    2	3	-1
    3	4	2
    0	1	-3
    1	2	5 
    4	5	-9


Sorting by absolute value without changing the data

I am trying to find the count of distinct values in each column using Pandas. This is what I did.

    import pandas as pd
    import numpy as np
    
    # Generate data.
    NROW = 10000
    NCOL = 100
    df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),
                      columns=[&#39;col&#39; + x for x in np.arange(NCOL).astype(str)])

I need to count the number of distinct elements for each column, like this:

    col0    9538
    col1    9505
    col2    9524

What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?


----------

Based upon the answers, `df.apply(lambda x: len(x.unique()))` is the fastest ([notebook](https://colab.research.google.com/drive/1xsrG3iUR5fkouRtBTWA2oO8YJWYqNk4G)).

```%timeit df.apply(lambda x: len(x.unique()))
10 loops, best of 3: 49.5 ms per loop
%timeit df.nunique()
10 loops, best of 3: 59.7 ms per loop
%timeit df.apply(pd.Series.nunique)
10 loops, best of 3: 60.3 ms per loop
%timeit df.T.apply(lambda x: x.nunique(), axis=1)
10 loops, best of 3: 60.5 ms per loop
```

Finding count of distinct elements in DataFrame in each column

How do I take multiple lists and put them as different columns in a python dataframe? I tried [this solution][1] but had some trouble.

Attempt 1:

- Have three lists, and zip them together and use that `res = zip(lst1,lst2,lst3)`
- Yields just one column

Attempt 2:


    percentile_list = pd.DataFrame({&#39;lst1Tite&#39; : [lst1],
                                    &#39;lst2Tite&#39; : [lst2],
                                    &#39;lst3Tite&#39; : [lst3] }, 
                                    columns=[&#39;lst1Tite&#39;,&#39;lst1Tite&#39;, &#39;lst1Tite&#39;])

- yields either one row by 3 columns (the way above) or if I transpose it is 3 rows and 1 column

How do I get a 100 row (length of each independent list) by 3 column (three lists) pandas dataframe? 


  [1]: https://stackoverflow.com/questions/29014618/read-lists-into-columns-of-pandas-dataframe

Take multiple lists into dataframe

I&#39;m looking for a way to do the equivalent to the SQL 

    SELECT DISTINCT col1, col2 FROM dataframe_table

The pandas sql comparison doesn&#39;t have anything about `distinct`.

`.unique()` only works for a single column, so I suppose I could concat the columns, or put them in a list/tuple and compare that way, but this seems like something pandas should do in a more native way.  

Am I missing something obvious, or is there no way to do this?

How to &quot;select distinct&quot; across multiple data frame columns in pandas?

I&#39;m using Laravel (a PHP framework) to write a service for mobile and have the data returned in `JSON` format. In the data result there are some fields encoded in `UTF-8`. 

The following statement

    return JsonResponse::create($data); 

returns the error below

    InvalidArgumentException
    HELP
    Malformed UTF-8 characters, possibly incorrectly encoded

    Open: /var/www/html/vendor/symfony/http-foundation/Symfony/Component/HttpFoundation/JsonResponse.php
            } catch (\Exception $exception) {
                restore_error_handler();
     
                throw $exception;
            }
     
            if (JSON_ERROR_NONE !== json_last_error()) {
                throw new \InvalidArgumentException($this-&gt;transformJsonError());
            }


I&#39;ve changed:

    return JsonResponse::create($data);

to

    return JsonResponse::create($data, 200, array(&#39;Content-Type&#39;=&gt;&#39;application/json; charset=utf-8&#39; ));

but it still isn&#39;t working.

How can I fix it?
 




&#39;Malformed UTF-8 characters, possibly incorrectly encoded&#39; in Laravel

I keep getting UnicodeEncodeError when trying to print a &#39;&#193;&#39; that I get from a website requested using selenium in python 3.4.

I already defined at the top of my .py file

`#  -*- coding: utf-8 -*-`

the def is something like this:

    from selenium import webdriver

    b = webdriver.Firefox()
    b.get(&#39;http://fisica.uniandes.edu.co/personal/profesores-de-planta&#39;)
    dataProf = b.find_elements_by_css_selector(&#39;td[width=&quot;508&quot;]&#39;)
    for dato in dataProf:
            print(datos.text)

and the exception:

    Traceback (most recent call last):
      File &quot;C:/Users/Andres/Desktop/scrap/scrap.py&quot;, line 444, in &lt;module&gt;
        dar_p_fisica()
      File &quot;C:/Users/Andres/Desktop/scrap/scrap.py&quot;, line 390, in dar_p_fisica
        print(datos.text) #.encode().decode(&#39;ascii&#39;, &#39;ignore&#39;)
      File &quot;C:\Python34\lib\encodings\cp1252.py&quot;, line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    UnicodeEncodeError: &#39;charmap&#39; codec can&#39;t encode character &#39;\u2010&#39; in position 173: character maps to &lt;undefined&gt;

thanks in advance

UnicodeEncodeError: &#39;charmap&#39; codec can&#39;t encode character &#39;\u2010&#39;: character maps to &lt;undefined&gt;

I recently switched to PHP 7 on my development server, which has worked just fine - until now.

Since I updated to `PHP 7.0.3-10+deb.sury.org~trusty+1` (earlier today), the `utf8_decode` and `utf8_encode` functions are no longer accessible. They were, however, in previous versions of PHP7. When called, a fatal error is raised.

I read that these functions are provided by the `mbstring` extension, which I checked with `var_dump(extension_loaded(&#39;mbstring&#39;));` is loaded.

How can I get the above functions to work again?

utf8_(en|de)code removed from php7?

I tried to use UTF-8 and ran into trouble.

I have tried so many things; here are the results I have gotten:

* `????` instead of Asian characters. Even for European text, I got `Se?or` for `Se&#241;or`.
* Strange gibberish (Mojibake?) such as `Se&#195;&#177;or` or `&#230;–&#176;&#230;&#181;&#170;&#230;–&#176;&#233;—&#187;` for `新浪新闻`.
* Black diamonds, such as Se�or.
* Finally, I got into a situation where the data was lost, or at least truncated: `Se` for `Se&#241;or`.
* Even when I got text to _look_ right, it did not _sort_ correctly.

What am I doing wrong? How can I fix the _code_? Can I recover the _data_, if so, how?


Trouble with UTF-8 characters; what I see is not what I stored

By default, when you redirect the output of a command to a file or pipe it into something else in PowerShell, the encoding is UTF-16, which isn&#39;t useful. I&#39;m looking to change it to UTF-8.

It can be done on a case-by-case basis by replacing the `&gt;foo.txt` syntax with `| out-file foo.txt -encoding utf8` but this is awkward to have to repeat every time.

The persistent way to set things in PowerShell is to put them in `\Users\me\Documents\WindowsPowerShell\profile.ps1`; I&#39;ve verified that this file is indeed executed on startup.

It has been said that the output encoding can be set with `$PSDefaultParameterValues = @{&#39;Out-File:Encoding&#39; = &#39;utf8&#39;}` but I&#39;ve tried this and it had no effect.

https://blogs.msdn.microsoft.com/powershell/2006/12/11/outputencoding-to-the-rescue/ which talks about `$OutputEncoding` looks at first glance as though it should be relevant, but then it talks about output being encoded in ASCII, which is not what&#39;s actually happening.

How do you set PowerShell to use UTF-8?

Content Type	Original Author	Original Content on Stackoverflow
Question	khtad	View Question on Stackoverflow
Solution 1 - Csv	maxymoo	View Answer on Stackoverflow
Solution 2 - Csv	sushmit	View Answer on Stackoverflow

Encoding Error in Panda read_csv

Csv Problem Overview

Csv Solutions

Solution 1 - Csv

Solution 2 - Csv

When using == for a primitive and a boxed value, is autoboxing done, or is unboxing done

Android Studio - Gradle sync error on gradle-diagnostics-X.X.X.jar

Attributions