Build the find tool.................
The find command-line utility is incredibly useful for quickly finding files and directories. In this tutorial we show you how to code your own version of it.
Mihalis Tsoukalos explains how the find command utility is incredibly useful for quickly finding files and directories then shows you how to code your own version of it.
This tutorial will teach you how to implement the basic functionality of the find command line utility in Python 3. It’s a very popular and handy Unix tool that helps you find files and directories wherever they are on your Linux machine.
This tutorial will implement a small part of find. However, once you understand the Python 3 code presented, you can add features on your own.
The basics of find
This section will illustrate the use of the find utility using easy to understand examples.
For instance, this example shows simple uses of find: $ find /home/mtsouk -name myFind.py /home/mtsouk/docs/myFind.py $ find . -name python ./python ./python/python $ find . -type d -name python ./python $ find . -type -name python find: Arguments to -type should contain only one letter
The first command searches all directories starting from /home/mtsouk, which is an absolute path because it starts from the root directory ('/'), in order to find one or more files or directories with the exact name of myFind.py. If there is a match, the path of the file will be printed on the screen. Otherwise, according to the Unix philosophy, find will print no output at all. If multiple matches are found, then all matches will be printed.
The second find command searches the directory tree that starts from the current directory for files or directories named “python”, whereas the third command searches for directories only. The last example presents an error message from find because the -type option is not complete.
Please note that on Unix systems the single dot (’.‘) points to the current working directory whereas two dots (’..‘) point to the parent directory. This is pretty handy when you do not want to type full paths.
The find utility can do many more things, as it is a very powerful command, but we can’t cover them all here. The good thing is that once you have developed a Python 3 version, it is relatively easy to extend it in order to support more options. The most difficult thing is developing a working version in the first place.
Knowing what you are going to implement is extremely important, so always try to make clear what features your programs will have as soon as possible!
The image on the left shows a small part of the man page of the find utility.
__main__ and __name__
Although it is not absolutely necessary, it is considered good practice for a Python 3 script to begin its execution by testing the value of the __name__ variable. Look at the following code saved as b.py: if __name__ == ‘__main__':
print('This is b.py executed as ', __name__) else: print('This is b.py executed from: ', __name__) What __name__ does is check whether the file is being imported from another module or not. When a Python 3 script runs as a standalone program, the value of the __ name__ variable is: __main__. This is automatically defined by Python 3. The following output showcases the previous code:
$ cat a.py import b print("This is module a!") $ python3 b.py This is b.py executed as __main__ $ python3 a.py This is b.py executed from: b This is module a! $ ls -l __pycache__/ total 4 -rw-r--r-- 1 mtsouk mtsouk 267 Oct 15 15:40 b.cpython-34. pyc $ file __pycache__/b.cpython-34.pyc __pycache__/b.cpython-34.pyc: python 3.4 byte-compiled
After using b.py as a module, you might find a new directory named __pycache__ in your current directory. The __pycache__ directory contains Python 3 bytecode, which is used for speeding up the execution of Python 3 programs. You can delete it if you want to, but Python 3 will generate it again the next time you use the b.py module, or any other module for that matter.
The file command is a Linux command line utility that can determine the type of a file.
So, the purpose of the __name__ variable is to let your code know whether the Python code runs as a standalone program or not in order to act appropriately. Due to the limited space we have, only the final version of the code will use this technique.
Take an os.walk()
The os.walk() function is extremely handy for recursively visiting and processing all files and directories in a directory tree, starting from a given root directory. It is very interesting to note that os.walk() needs just one parameter, the name of the directory you want to visit. Next, a for loop does the rest of the job by iterating over all the subdirectories and files of the root directory that was given as the parameter to os.walk(). The following code, saved as learnWalk.py, illustrates the use of os.walk(): #!/usr/bin/env python3 import os import sys if len(sys.argv) >= 2: directory = str(sys.argv[1]) else: print('Not enough arguments!') sys.exit(0) for root, dirs, files in os.walk(directory): print('**’, root) for file in files: pathname = os.path.join(root,file) if os.path.exists(pathname): print(pathname)
Executing learnWalk.py generates the following kind of output, snipped to save space: $ ./learnWalk.py ~/code ** /home/mtsouk/code /home/mtsouk/code/a.out ** /home/mtsouk/code/Haskell ... /home/mtsouk/code/C/sysProg/sparse.c /home/mtsouk/code/C/sysProg/filetype.c
As you can see, learnWalk.py visits everything starting from a root directory and prints any file or directory it finds in the process, which is simple yet very effective. The os.path. exists() method makes sure that a file actually exists before printing it.
Please make sure that you completely understand this technique before continuing with the rest of the tutorial.
It is now time to present a first version of the Python 3 implementation of the find utility.
Finding Python 3
Now that you have all the necessary bits and pieces, it is time to combine them to create a first version of the find utility, which will be called firstFind.py. This version accepts two command line arguments, the first one is the name of the directory that the search will start from and the second is the name of the file to search for.
The contents of firstFind.py are: #!/usr/bin/env python3
import os import sys if len(sys.argv) >= 3: directory = str(sys.argv[1]) filename = str(sys.argv[2]) else: print('Not enough arguments!') sys.exit(0) for root, dirs, files in os.walk(directory): print('**’, root) for file in files: pathname = os.path.join(root,file) if os.path.exists(pathname): if file == filename: print(pathname) Executing firstFind.py generates the following kind of output: $ ./firstFind.py . a.out ** . ./a.out ** ./Haskell ** ./python ** ./perl ** ./C ./C/a.out ** ./C/system ** ./C/example ** ./C/cUNL ** ./C/sysProg $ ./firstFind.py Not enough arguments! The previous output is a very practical way to test that
firstFind.py actually works. The entries that begin with ** are directories – this kind of output is used for making sure that
firstFind.py visits all the desired directories, the final version of our system tool will not generate such output. firstFind.py can also find the desired files.
So far the firstFind.py script looks as if it is working as expected, so we can now continue with its development and implement the missing functionality a basic find command should have. The most important part of that missing functionality is the support for command line options, which can be very tricky.
The Final form
The final version of find is called myFind.py – its main difference from firstFind.py is that myFind.py accepts two command line options (also known as switches). The -d option tells myFind.py to search for directories only, whereas the -f option tells myFind.py to search for files only. If you use both switches or none of them, myFind.py will search for both types!
It is compulsory that the first command line argument is the directory name and that the second command line argument will be the name of the file or directory you want to search for. Therefore, the two switches will come at the end of the command in no particular order.
The code of myFind.py is as follows: #!/usr/bin/env python3 import os import sys def find(directory, filename, dirOnly, fileOnly): if (dirOnly == 0 and fileOnly == 0): dirOnly = 1; fileOnly = 1; for root, dirs, files in os.walk(directory): if dirOnly == 1: if os.path.basename(os.path.normpath((root))) == filename: print(root) for file in files: pathname = os.path.join(root,file) if os.path.exists(pathname): if fileOnly == 1: if file == filename: print(pathname) def main(): dirOnly = 0; fileOnly = 0; if len(sys.argv) == 3: directory = str(sys.argv[1]) filename = str(sys.argv[2]) elif len(sys.argv) == 4: directory = str(sys.argv[1]) filename = str(sys.argv[2]) option1 = str(sys.argv[3]) if option1 == “-d":
dirOnly = 1; if option1 == “-f":
fileOnly = 1; elif len(sys.argv) >= 5: directory = str(sys.argv[1]) filename = str(sys.argv[2]) option1 = str(sys.argv[3]) option2 = str(sys.argv[4]) if (option1 == “-d” or option2 == “-d"):
dirOnly = 1; if (option1 == “-f” or option2 == “-f"):
fileOnly = 1; else: print('Usage: ', sys.argv[0], ‘directory filename [-df]')
sys.exit(0) # If the given path exists do your job if os.path.isdir(directory): find(directory, filename, dirOnly, fileOnly) else: print('Directory ', directory, ‘does not exist!') if __name__ == ‘__main__': main() else: print("This is a standalone program not a module!")
As you can see, the __name__ variable is usually combined with a main() function; should you decide to use myFind.py as a module, the presence of the main() function allows you to use the capabilities of myFind.py from other programs.
As you can see, a large part of main() deals with the command line options. Although there exist modules that help you deal with arguments and switches using less code, this way is the easiest to understand. The core functionality of the program is implemented inside the find() function, which is more or less similar to the code in firstFind.py.
The find() function also makes sure that only the desired kind of files are displayed. The call to the os.path.basename() and os.path.normpath() methods is needed in order to extract the last part of a path, which is the name of the directory, and compare it to the filename you want to search.
This happens because myFind.py cannot correctly match a string such as “code” to a full path such as /home/ mtsouk/code.
Another tutorial is this series will talk about how to process command line arguments and options in Python 3 in more detail.
Testing your code
Testing is very important for any form of coding, especially when your script deals with systems files and directories. So, you will have to run some tests to verify that myFind.py works as expected – without testing, no tool can be deployed on a production system as it might compromise its security and its stability.
The first test is to try to find a file that does not exist, while the second test is about finding a file or directory that exists in one instance, while the third test is about finding a filename that exists multiple times in the directory structure you are searching. Because you cannot have the same filename in the same directory twice, in the third case the filename should exist in different directories. The last two tests will verify that myFind.py is able to differentiate between directories and files when used with the appropriate switches. The image at the top of the previous page shows myFind.py performing the tests we’ve just described.
The good thing with testing is that you are also learning how to use the code while you are testing it. Running myFind.py without any command line arguments reveals the following helpful message: $ ./myFind.py Usage: ./myFind.py directory filename [-df]
os.walk() Vs os.scandir()
If you are using Python version 3.5 or newer, you can use the os.scandir() function instead of os.walk() because os.scandir() is much faster than os.walk(), as it avoids calls to os.stat(). The good news is that if you are using Python 3.5 or newer, os.walk() automatically uses os.scandir() in its implementation so you do not need to do any extra work to your code.
This final part of the tutorial will test the speed differences between os.walk() and os.scandir() using the relatively simple time command. The tests will use Python version 3.5.2, which uses os.scandir(), and Python version 2.7.10, which does not use os.scandir(). In order to make the tests as reliable as possible, you will need to search a big directory tree such as /usr or /var. This section will use a modified version of the learnWalk.py script saved as testSpeed.py, implemented in the following fashion: import os import sys if len(sys.argv) >= 2: directory = str(sys.argv[1]) else: print('Not enough arguments!') sys.exit(0) total = 0 for root, dirs, files in os.walk(directory): for file in files: total = total + 1 print('Visited’, total, ‘files!') The results show the timing of various executions of the two slightly different implementations, telling us that the Python 3.5 version is significantly faster!