How do you search an amazon s3 bucket?

Amazon Web-ServicesAmazon S3

Amazon Web-Services Problem Overview


I have a bucket with thousands of files in it. How can I search the bucket?

Amazon Web-Services Solutions


Solution 1 - Amazon Web-Services

Just a note to add on here: it's now 3 years later, yet this post is top in Google when you type in "How to search an S3 Bucket."

Perhaps you're looking for something more complex, but if you landed here trying to figure out how to simply find an object (file) by it's title, it's crazy simple:

open the bucket, select "none" on the right hand side, and start typing in the file name.

http://docs.aws.amazon.com/AmazonS3/latest/UG/ListingObjectsinaBucket.html

Solution 2 - Amazon Web-Services

Here's a short and ugly way to do search file names using the AWS CLI:

aws s3 ls s3://your-bucket --recursive | grep your-search | cut -c 32-

Solution 3 - Amazon Web-Services

S3 doesn't have a native "search this bucket" since the actual content is unknown - also, since S3 is key/value based there is no native way to access many nodes at once ala more traditional datastores that offer a (SELECT * FROM ... WHERE ...) (in a SQL model).

What you will need to do is perform ListBucket to get a listing of objects in the bucket and then iterate over every item performing a custom operation that you implement - which is your searching.

Solution 4 - Amazon Web-Services

AWS released a new Service to query S3 buckets with SQL: Amazon Athena https://aws.amazon.com/athena/

Solution 5 - Amazon Web-Services

There are (at least) two different use cases which could be described as "search the bucket":

  1. Search for something inside every object stored at the bucket; this assumes a common format for all the objects in that bucket (say, text files), etc etc. For something like this, you're forced to do what Cody Caughlan just answered. The AWS S3 docs has example code showing how to do this with the AWS SDK for Java: http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?ListingObjectKeysUsingJava.html">Listing Keys Using the AWS SDK for Java (there you'll also find PHP and C# examples).

  2. List item Search for something in the object keys contained in that bucket; S3 does have partial support for this, in the form of allowing prefix exact matches + collapsing matches after a delimiter. This is explained in more detail at the http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?ListingKeysHierarchy.html">AWS S3 Developer Guide. This allows, for example, to implement "folders" through using as object keys something like

    folder/subfolder/file.txt
    If you follow this convention, most of the S3 GUIs (such as the AWS Console) will show you a folder view of your bucket.

Solution 6 - Amazon Web-Services

There are multiple options, none being simple "one shot" full text solution:

  1. Key name pattern search: Searching for keys starting with some string- if you design key names carefully, then you may have rather quick solution.

  2. Search metadata attached to keys: when posting a file to AWS S3, you may process the content, extract some meta information and attach this meta information in form of custom headers into the key. This allows you to fetch key names and headers without need to fetch complete content. The search has to be done sequentialy, there is no "sql like" search option for this. With large files this could save a lot of network traffic and time.

  3. Store metadata on SimpleDB: as previous point, but with storing the metadata on SimpleDB. Here you have sql like select statements. In case of large data sets you may hit SimpleDB limits, which can be overcome (partition metadata across multiple SimpleDB domains), but if you go really far, you may need to use another metedata type of database.

  4. Sequential full text search of the content - processing all the keys one by one. Very slow, if you have too many keys to process.

We are storing 1440 versions of a file a day (one per minute) for couple of years, using versioned bucket, it is easily possible. But getting some older version takes time, as one has to sequentially go version by version. Sometime I use simple CSV index with records, showing publication time plus version id, having this, I could jump to older version rather quickly.

As you see, AWS S3 is not on it's own designed for full text searches, it is simple storage service.

Solution 7 - Amazon Web-Services

If you're on Windows and have no time finding a nice grep alternative, a quick and dirty way would be:

aws s3 ls s3://your-bucket/folder/ --recursive > myfile.txt

and then do a quick-search in myfile.txt

The "folder" bit is optional.

P.S. if you don't have AWS CLI installed - here's a one liner using Chocolatey package manager

choco install awscli

P.P.S. If you don't have the Chocolatey package manager - get it! Your life on Windows will get 10x better. (I'm not affiliated with Chocolatey in any way, but hey, it's a must-have, really).

Solution 8 - Amazon Web-Services

Try this command:

aws s3api list-objects --bucket your-bucket --prefix sub-dir-path --output text --query 'Contents[].{Key: Key}'

Then you can pipe this into a grep to get specific file types to do whatever you want with them.

Solution 9 - Amazon Web-Services

Search by Prefix in S3 Console

directly in the AWS Console bucket view.

enter image description here

Copy wanted files using s3-dist-cp

When you have thousands or millions of files another way to get the wanted files is to copy them to another location using distributed copy. You run this on EMR in a Hadoop Job. The cool thing about AWS is that they provide their custom S3 version s3-dist-cp. It allows you to group wanted files using a regular expression in the groupBy field. You can use this for example in a custom step on EMR

[    {        "ActionOnFailure": "CONTINUE",        "Args": [            "s3-dist-cp",            "--s3Endpoint=s3.amazonaws.com",            "--src=s3://mybucket/",            "--dest=s3://mytarget-bucket/",            "--groupBy=MY_PATTERN",            "--targetSize=1000"        ],
        "Jar": "command-runner.jar",
        "Name": "S3DistCp Step Aggregate Results",
        "Type": "CUSTOM_JAR"
    }
]

Solution 10 - Amazon Web-Services

I tried in the following way

aws s3 ls s3://Bucket1/folder1/2019/ --recursive |grep filename.csv

This outputs the actual path where the file exists

2019-04-05 01:18:35     111111 folder1/2019/03/20/filename.csv

Solution 11 - Amazon Web-Services

Use Amazon Athena to query S3 bucket. Also, load data to Amazon Elastic search. Hope this helps.

Solution 12 - Amazon Web-Services

Another option is to mirror the S3 bucket on your web server and traverse locally. The trick is that the local files are empty and only used as a skeleton. Alternatively, the local files could hold useful meta data that you normally would need to get from S3 (e.g. filesize, mimetype, author, timestamp, uuid). When you provide a URL to download the file, search locally and but provide a link to the S3 address.

Local file traversing is easy and this approach for S3 management is language agnostic. Local file traversing also avoids maintaining and querying a database of files or delays making a series of remote API calls to authenticate and get the bucket contents.

You could allow users to upload files directly to your server via FTP or HTTP and then transfer a batch of new and updated files to Amazon at off peak times by just recursing over the directories for files with any size. On the completion of a file transfer to Amazon, replace the web server file with an empty one of the same name. If a local file has any filesize then serve it directly because its awaiting batch transfer.

Solution 13 - Amazon Web-Services

Given that you are in AWS...I would think you would want to use their CloudSearch tools. Put the data you want to search in their service...have it point to the S3 keys.

http://aws.amazon.com/cloudsearch/

Solution 14 - Amazon Web-Services

The way I did it is: I have thousands of files in s3. I saw the properties panel of one file in the list. You can see the URI of that file and I copy pasted that to the browser - it was a text file and it rendered nicely. Now I replaced the uuid in the url with the uuid that I had at hand and boom there the file is.

I wish AWS had a better way to search a file, but this worked for me.

Solution 15 - Amazon Web-Services

This is little bit old thread - but maybe help someone who still search - I'm the one who search for that a year.

Solution may be "AWS Athena" where you can search over data like this

'SELECT user_name FROM S3Object WHERE cast(age as int) > 20'

https://aws.amazon.com/blogs/developer/introducing-support-for-amazon-s3-select-in-the-aws-sdk-for-javascript/

Currently pricing is $5 for 1TB data - so for example, if your query search over one 1TB file 3times your cost is $15 - but for example if there is only 1column in "converted columnar format" what you want read, you'll pay 1/3 of price means $1.67/TB.

Solution 16 - Amazon Web-Services

Although not an AWS native service, there is Mixpeek, which runs text extraction like Tika, Tesseract and ImageAI on your S3 files then places them in a Lucene index to make them searchable.

Review the docs here

You integrate it as follows:

  1. Download the module: https://github.com/mixpeek/mixpeek-python

  2. Import the module and your API keys:

     from mixpeek import Mixpeek, S3
     from config import mixpeek_api_key, aws
    
  3. Instantiate the S3 class (which uses boto3 and requests):

     s3 = S3(
         aws_access_key_id=aws['aws_access_key_id'],
         aws_secret_access_key=aws['aws_secret_access_key'],
         region_name='us-east-2',
         mixpeek_api_key=mixpeek_api_key
     )
    
  4. Upload one or more existing S3 files:

         # upload all S3 files in bucket "demo"            
         s3.upload_all(bucket_name="demo")
    
         # upload one single file called "prescription.pdf" in bucket "demo"
         s3.upload_one(s3_file_name="prescription.pdf", bucket_name="demo")
    
  5. Now simply search using the Mixpeek module:

         # mixpeek api direct
         mix = Mixpeek(
             api_key=mixpeek_api_key
         )
         # search
         result = mix.search(query="Heartgard")
         print(result)
    
  6. Where result can be:

     [
         {
             "_id": "REDACTED",
             "api_key": "REDACTED",
             "highlights": [
                 {
                     "path": "document_str",
                     "score": 0.8759502172470093,
                     "texts": [
                         {
                             "type": "text",
                             "value": "Vetco Prescription\nVetcoClinics.com\n\nCustomer:\n\nAddress: Canine\n\nPhone: Australian Shepherd\n\nDate of Service: 2 Years 8 Months\n\nPrescription\nExpiration Date:\n\nWeight: 41.75\n\nSex: Female\n\nā„ž  "
                         },
                         {
                             "type": "hit",
                             "value": "Heartgard"
                         },
                         {
                             "type": "text",
                             "value": " Plus Green 26-50 lbs (Ivermectin 135 mcg/Pyrantel 114 mg)\n\nInstructions: Give one chewable tablet by mouth once monthly for protection against heartworms, and the treatment and\ncontrol of roundworms, and hookworms. "
                         }
                     ]
                 }
             ],
             "metadata": {
                 "date_inserted": "2021-10-07 03:19:23.632000",
                 "filename": "prescription.pdf"
             },
             "score": 0.13313256204128265
         }
     ] 
    

Then you parse the results

Solution 17 - Amazon Web-Services

Take a look at this documentation: http://docs.aws.amazon.com/AWSSDKforPHP/latest/index.html#m=amazons3/get_object_list

You can use a Perl-Compatible Regular Expression (PCRE) to filter the names.

Solution 18 - Amazon Web-Services

I did something as below to find patterns in my bucket

def getListOfPrefixesFromS3(dataPath: String, prefix: String, delimiter: String, batchSize: Integer): List[String] = {
    var s3Client = new AmazonS3Client()
    var listObjectsRequest = new ListObjectsRequest().withBucketName(dataPath).withMaxKeys(batchSize).withPrefix(prefix).withDelimiter(delimiter)
    var objectListing: ObjectListing = null
    var res: List[String] = List()

    do {
      objectListing = s3Client.listObjects(listObjectsRequest)
      res = res ++ objectListing.getCommonPrefixes
      listObjectsRequest.setMarker(objectListing.getNextMarker)
    } while (objectListing.isTruncated)
    res
  }

For larger buckets this consumes too much of time since all the object summaries are returned by the Aws and not only the ones that match the prefix and the delimiter. I am looking for ways to improve the performance and so far i've only found that i should name the keys and organise them in buckets properly.

Solution 19 - Amazon Web-Services

Status 2018-07: Amazon do have native sql like search for csv and json files!

https://aws.amazon.com/blogs/developer/introducing-support-for-amazon-s3-select-in-the-aws-sdk-for-javascript/

Solution 20 - Amazon Web-Services

I faced the same problem. Searching in S3 should be much more easier than the current situation. That's why, I implemented this open source tool for searching in S3.

SSEARCH is full open source S3 search tool. It has been implemented always keeping mind that the performance is the critical factor and according to the benchmarks it searches the bucket which contains ~1000 files within seconds.

Installation is simple. You only download docker-compose file and running it with

docker-compose up

SSEARCH will be started and you can search anything in any bucket you have.

Solution 21 - Amazon Web-Services

Fast forward to 2020, and using aws-okta as our 2fa, the following command, while slow as hell to iterate through all of the objects and folders in this particular bucket (+270,000) worked fine.

aws-okta exec dev -- aws s3 ls my-cool-bucket --recursive | grep needle-in-haystax.txt

Solution 22 - Amazon Web-Services

Not a technical answer, but I have built an application which allows for wildcard search: https://bucketsearch.net/

It will asynchronously index your bucket and then allow you to search the results.

It's free to use (donationware).

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionvinhboyView Question on Stackoverflow
Solution 1 - Amazon Web-Servicesrhonda bradleyView Answer on Stackoverflow
Solution 2 - Amazon Web-ServicesAbe VoelkerView Answer on Stackoverflow
Solution 3 - Amazon Web-ServicesCody CaughlanView Answer on Stackoverflow
Solution 4 - Amazon Web-ServiceshellomichibyeView Answer on Stackoverflow
Solution 5 - Amazon Web-ServicesEduardo Pareja TobesView Answer on Stackoverflow
Solution 6 - Amazon Web-ServicesJan VlcinskyView Answer on Stackoverflow
Solution 7 - Amazon Web-ServicesAlex from JitbitView Answer on Stackoverflow
Solution 8 - Amazon Web-ServicesRobert EvansView Answer on Stackoverflow
Solution 9 - Amazon Web-ServicesH6.View Answer on Stackoverflow
Solution 10 - Amazon Web-ServicesDheerajView Answer on Stackoverflow
Solution 11 - Amazon Web-ServicesAskMeView Answer on Stackoverflow
Solution 12 - Amazon Web-ServicesDylan ValadeView Answer on Stackoverflow
Solution 13 - Amazon Web-ServicesAndrew SiemerView Answer on Stackoverflow
Solution 14 - Amazon Web-ServicesRoseView Answer on Stackoverflow
Solution 15 - Amazon Web-ServicesBGBRUNOView Answer on Stackoverflow
Solution 16 - Amazon Web-ServicesdanywigglebuttView Answer on Stackoverflow
Solution 17 - Amazon Web-ServicesRagnarView Answer on Stackoverflow
Solution 18 - Amazon Web-ServicesRaghvendra SinghView Answer on Stackoverflow
Solution 19 - Amazon Web-ServicesJSiView Answer on Stackoverflow
Solution 20 - Amazon Web-ServicesArda GüçlüView Answer on Stackoverflow
Solution 21 - Amazon Web-ServicesjamescampbellView Answer on Stackoverflow
Solution 22 - Amazon Web-ServicesJon MView Answer on Stackoverflow