The way to check a HDFS directory's size?

HadoopCommand LineDirectoryHdfs

Hadoop Problem Overview


I know du -sh in common Linux filesystems. But how to do that with HDFS?

Hadoop Solutions


Solution 1 - Hadoop

Prior to 0.20.203, and officially deprecated in 2.6.0:

hadoop fs -dus [directory]

Since 0.20.203 (dead link) 1.0.4 and still compatible through 2.6.0:

hdfs dfs -du [-s] [-h] URI [URI …]

You can also run hadoop fs -help for more info and specifics.

Solution 2 - Hadoop

hadoop fs -du -s -h /path/to/dir displays a directory's size in readable form.

Solution 3 - Hadoop

Extending to Matt D and others answers, the command can be till Apache Hadoop 3.0.0

> # hadoop fs -du [-s] [-h] [-v] [-x] URI [URI ...]

> It displays sizes of files and directories contained in the given directory or the length of a file in case it's just a file.

> ## Options:

> - The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files. Without the -s option, the calculation is done by going 1-level deep from the given path. > - The -h option will format file sizes in a human-readable fashion (e.g 64.0m instead of 67108864) > - The -v option will display the names of columns as a header line. > - The -x option will exclude snapshots from the result calculation. Without the -x option (default), the result is always calculated from all INodes, including all snapshots under the given path.

du returns three columns with the following format:
 +-------------------------------------------------------------------+ 
 | size  |  disk_space_consumed_with_all_replicas  |  full_path_name | 
 +-------------------------------------------------------------------+ 

Example command:

hadoop fs -du /user/hadoop/dir1 \
    /user/hadoop/file1 \
	hdfs://nn.example.com/user/hadoop/dir1 

Exit Code: Returns 0 on success and -1 on error.

source: Apache doc

Solution 4 - Hadoop

With this you will get size in GB

hdfs dfs -du PATHTODIRECTORY | awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }'

Solution 5 - Hadoop

When trying to calculate the total of a particular group of files within a directory the -s option does not work (in Hadoop 2.7.1). For example:

Directory structure:

some_dir
├abc.txt    
├count1.txt 
├count2.txt 
└def.txt    

Assume each file is 1 KB in size. You can summarize the entire directory with:

hdfs dfs -du -s some_dir
4096 some_dir

However, if I want the sum of all files containing "count" the command falls short.

hdfs dfs -du -s some_dir/count*
1024 some_dir/count1.txt
1024 some_dir/count2.txt

To get around this I usually pass the output through awk.

hdfs dfs -du some_dir/count* | awk '{ total+=$1 } END { print total }'
2048 

Solution 6 - Hadoop

To get the size of the directory hdfs dfs -du -s -h /$yourDirectoryName can be used. hdfs dfsadmin -report can be used to see a quick cluster level storage report.

Solution 7 - Hadoop

The easiest way to get the folder size in a human readable format is

hdfs dfs -du -h /folderpath

where -s can be added to get the total sum

Solution 8 - Hadoop

hadoop version 2.3.33:

hadoop fs -dus  /path/to/dir  |   awk '{print $2/1024**3 " G"}' 

enter image description here

Solution 9 - Hadoop

% of used space on Hadoop cluster
sudo -u hdfs hadoop fs –df

Capacity under specific folder:
sudo -u hdfs hadoop fs -du -h /user

Solution 10 - Hadoop

hdfs dfs -count <dir>

info from man page:

-count [-q] [-h] [-v] [-t [<storage type>]] [-u] <path> ... :
  Count the number of directories, files and bytes under the paths
  that match the specified file pattern.  The output columns are:
  DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
  or, with the -q option:
  QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA
        DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME

Solution 11 - Hadoop

Command Should be hadoop fs -du -s -h \dirPath

  • -du [-s] [-h] ... : Show the amount of space, in bytes, used by the files that match the specified file pattern.

  • -s : Rather than showing the size of each individual file that matches the
    pattern, shows the total (summary) size.

  • -h : Formats the sizes of files in a human-readable fashion rather than a number of bytes. (Ex MB/GB/TB etc)

Note that, even without the -s option, this only shows size summaries one level deep into a directory.

The output is in the form size name(full path)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionChengView Question on Stackoverflow
Solution 1 - HadoopMatt DView Answer on Stackoverflow
Solution 2 - HadoopMarius SoutierView Answer on Stackoverflow
Solution 3 - HadoopmrsrinivasView Answer on Stackoverflow
Solution 4 - HadoopdilshadView Answer on Stackoverflow
Solution 5 - HadoopGrrView Answer on Stackoverflow
Solution 6 - HadoopHarikrishnan CkView Answer on Stackoverflow
Solution 7 - HadoopGaluoisesView Answer on Stackoverflow
Solution 8 - HadoopLuciferJackView Answer on Stackoverflow
Solution 9 - HadoopOren EfronView Answer on Stackoverflow
Solution 10 - HadoopJ.DoeView Answer on Stackoverflow
Solution 11 - Hadoopvijayraj34View Answer on Stackoverflow