Linux Format

Build the find tool.................

The find command-line utility is incredibly useful for quickly finding files and directorie­s. In this tutorial we show you how to code your own version of it.

-

Mihalis Tsoukalos explains how the find command utility is incredibly useful for quickly finding files and directorie­s then shows you how to code your own version of it.

This tutorial will teach you how to implement the basic functional­ity of the find command line utility in Python 3. It’s a very popular and handy Unix tool that helps you find files and directorie­s wherever they are on your Linux machine.

This tutorial will implement a small part of find. However, once you understand the Python 3 code presented, you can add features on your own.

The basics of find

This section will illustrate the use of the find utility using easy to understand examples.

For instance, this example shows simple uses of find: $ find /home/mtsouk -name myFind.py /home/mtsouk/docs/myFind.py $ find . -name python ./python ./python/python $ find . -type d -name python ./python $ find . -type -name python find: Arguments to -type should contain only one letter

The first command searches all directorie­s starting from /home/mtsouk, which is an absolute path because it starts from the root directory ('/'), in order to find one or more files or directorie­s with the exact name of myFind.py. If there is a match, the path of the file will be printed on the screen. Otherwise, according to the Unix philosophy, find will print no output at all. If multiple matches are found, then all matches will be printed.

The second find command searches the directory tree that starts from the current directory for files or directorie­s named “python”, whereas the third command searches for directorie­s only. The last example presents an error message from find because the -type option is not complete.

Please note that on Unix systems the single dot (’.‘) points to the current working directory whereas two dots (’..‘) point to the parent directory. This is pretty handy when you do not want to type full paths.

The find utility can do many more things, as it is a very powerful command, but we can’t cover them all here. The good thing is that once you have developed a Python 3 version, it is relatively easy to extend it in order to support more options. The most difficult thing is developing a working version in the first place.

Knowing what you are going to implement is extremely important, so always try to make clear what features your programs will have as soon as possible!

The image on the left shows a small part of the man page of the find utility.

__main__ and __name__

Although it is not absolutely necessary, it is considered good practice for a Python 3 script to begin its execution by testing the value of the __name__ variable. Look at the following code saved as b.py: if __name__ == ‘__main__':

print('This is b.py executed as ', __name__) else: print('This is b.py executed from: ', __name__) What __name__ does is check whether the file is being imported from another module or not. When a Python 3 script runs as a standalone program, the value of the __ name__ variable is: __main__. This is automatica­lly defined by Python 3. The following output showcases the previous code:

$ cat a.py import b print("This is module a!") $ python3 b.py This is b.py executed as __main__ $ python3 a.py This is b.py executed from: b This is module a! $ ls -l __pycache__/ total 4 -rw-r--r-- 1 mtsouk mtsouk 267 Oct 15 15:40 b.cpython-34. pyc $ file __pycache__/b.cpython-34.pyc __pycache__/b.cpython-34.pyc: python 3.4 byte-compiled

After using b.py as a module, you might find a new directory named __pycache__ in your current directory. The __pycache__ directory contains Python 3 bytecode, which is used for speeding up the execution of Python 3 programs. You can delete it if you want to, but Python 3 will generate it again the next time you use the b.py module, or any other module for that matter.

The file command is a Linux command line utility that can determine the type of a file.

So, the purpose of the __name__ variable is to let your code know whether the Python code runs as a standalone program or not in order to act appropriat­ely. Due to the limited space we have, only the final version of the code will use this technique.

Take an os.walk()

The os.walk() function is extremely handy for recursivel­y visiting and processing all files and directorie­s in a directory tree, starting from a given root directory. It is very interestin­g to note that os.walk() needs just one parameter, the name of the directory you want to visit. Next, a for loop does the rest of the job by iterating over all the subdirecto­ries and files of the root directory that was given as the parameter to os.walk(). The following code, saved as learnWalk.py, illustrate­s the use of os.walk(): #!/usr/bin/env python3 import os import sys if len(sys.argv) >= 2: directory = str(sys.argv[1]) else: print('Not enough arguments!') sys.exit(0) for root, dirs, files in os.walk(directory): print('**’, root) for file in files: pathname = os.path.join(root,file) if os.path.exists(pathname): print(pathname)

Executing learnWalk.py generates the following kind of output, snipped to save space: $ ./learnWalk.py ~/code ** /home/mtsouk/code /home/mtsouk/code/a.out ** /home/mtsouk/code/Haskell ... /home/mtsouk/code/C/sysProg/sparse.c /home/mtsouk/code/C/sysProg/filetype.c

As you can see, learnWalk.py visits everything starting from a root directory and prints any file or directory it finds in the process, which is simple yet very effective. The os.path. exists() method makes sure that a file actually exists before printing it.

Please make sure that you completely understand this technique before continuing with the rest of the tutorial.

It is now time to present a first version of the Python 3 implementa­tion of the find utility.

Finding Python 3

Now that you have all the necessary bits and pieces, it is time to combine them to create a first version of the find utility, which will be called firstFind.py. This version accepts two command line arguments, the first one is the name of the directory that the search will start from and the second is the name of the file to search for.

The contents of firstFind.py are: #!/usr/bin/env python3

import os import sys if len(sys.argv) >= 3: directory = str(sys.argv[1]) filename = str(sys.argv[2]) else: print('Not enough arguments!') sys.exit(0) for root, dirs, files in os.walk(directory): print('**’, root) for file in files: pathname = os.path.join(root,file) if os.path.exists(pathname): if file == filename: print(pathname) Executing firstFind.py generates the following kind of output: $ ./firstFind.py . a.out ** . ./a.out ** ./Haskell ** ./python ** ./perl ** ./C ./C/a.out ** ./C/system ** ./C/example ** ./C/cUNL ** ./C/sysProg $ ./firstFind.py Not enough arguments! The previous output is a very practical way to test that

firstFind.py actually works. The entries that begin with ** are directorie­s – this kind of output is used for making sure that

firstFind.py visits all the desired directorie­s, the final version of our system tool will not generate such output. firstFind.py can also find the desired files.

So far the firstFind.py script looks as if it is working as expected, so we can now continue with its developmen­t and implement the missing functional­ity a basic find command should have. The most important part of that missing functional­ity is the support for command line options, which can be very tricky.

The Final form

The final version of find is called myFind.py – its main difference from firstFind.py is that myFind.py accepts two command line options (also known as switches). The -d option tells myFind.py to search for directorie­s only, whereas the -f option tells myFind.py to search for files only. If you use both switches or none of them, myFind.py will search for both types!

It is compulsory that the first command line argument is the directory name and that the second command line argument will be the name of the file or directory you want to search for. Therefore, the two switches will come at the end of the command in no particular order.

The code of myFind.py is as follows: #!/usr/bin/env python3 import os import sys def find(directory, filename, dirOnly, fileOnly): if (dirOnly == 0 and fileOnly == 0): dirOnly = 1; fileOnly = 1; for root, dirs, files in os.walk(directory): if dirOnly == 1: if os.path.basename(os.path.normpath((root))) == filename: print(root) for file in files: pathname = os.path.join(root,file) if os.path.exists(pathname): if fileOnly == 1: if file == filename: print(pathname) def main(): dirOnly = 0; fileOnly = 0; if len(sys.argv) == 3: directory = str(sys.argv[1]) filename = str(sys.argv[2]) elif len(sys.argv) == 4: directory = str(sys.argv[1]) filename = str(sys.argv[2]) option1 = str(sys.argv[3]) if option1 == “-d":

dirOnly = 1; if option1 == “-f":

fileOnly = 1; elif len(sys.argv) >= 5: directory = str(sys.argv[1]) filename = str(sys.argv[2]) option1 = str(sys.argv[3]) option2 = str(sys.argv[4]) if (option1 == “-d” or option2 == “-d"):

dirOnly = 1; if (option1 == “-f” or option2 == “-f"):

fileOnly = 1; else: print('Usage: ', sys.argv[0], ‘directory filename [-df]')

sys.exit(0) # If the given path exists do your job if os.path.isdir(directory): find(directory, filename, dirOnly, fileOnly) else: print('Directory ', directory, ‘does not exist!') if __name__ == ‘__main__': main() else: print("This is a standalone program not a module!")

As you can see, the __name__ variable is usually combined with a main() function; should you decide to use myFind.py as a module, the presence of the main() function allows you to use the capabiliti­es of myFind.py from other programs.

As you can see, a large part of main() deals with the command line options. Although there exist modules that help you deal with arguments and switches using less code, this way is the easiest to understand. The core functional­ity of the program is implemente­d inside the find() function, which is more or less similar to the code in firstFind.py.

The find() function also makes sure that only the desired kind of files are displayed. The call to the os.path.basename() and os.path.normpath() methods is needed in order to extract the last part of a path, which is the name of the directory, and compare it to the filename you want to search.

This happens because myFind.py cannot correctly match a string such as “code” to a full path such as /home/ mtsouk/code.

Another tutorial is this series will talk about how to process command line arguments and options in Python 3 in more detail.

Testing your code

Testing is very important for any form of coding, especially when your script deals with systems files and directorie­s. So, you will have to run some tests to verify that myFind.py works as expected – without testing, no tool can be deployed on a production system as it might compromise its security and its stability.

The first test is to try to find a file that does not exist, while the second test is about finding a file or directory that exists in one instance, while the third test is about finding a filename that exists multiple times in the directory structure you are searching. Because you cannot have the same filename in the same directory twice, in the third case the filename should exist in different directorie­s. The last two tests will verify that myFind.py is able to differenti­ate between directorie­s and files when used with the appropriat­e switches. The image at the top of the previous page shows myFind.py performing the tests we’ve just described.

The good thing with testing is that you are also learning how to use the code while you are testing it. Running myFind.py without any command line arguments reveals the following helpful message: $ ./myFind.py Usage: ./myFind.py directory filename [-df]

os.walk() Vs os.scandir()

If you are using Python version 3.5 or newer, you can use the os.scandir() function instead of os.walk() because os.scandir() is much faster than os.walk(), as it avoids calls to os.stat(). The good news is that if you are using Python 3.5 or newer, os.walk() automatica­lly uses os.scandir() in its implementa­tion so you do not need to do any extra work to your code.

This final part of the tutorial will test the speed difference­s between os.walk() and os.scandir() using the relatively simple time command. The tests will use Python version 3.5.2, which uses os.scandir(), and Python version 2.7.10, which does not use os.scandir(). In order to make the tests as reliable as possible, you will need to search a big directory tree such as /usr or /var. This section will use a modified version of the learnWalk.py script saved as testSpeed.py, implemente­d in the following fashion: import os import sys if len(sys.argv) >= 2: directory = str(sys.argv[1]) else: print('Not enough arguments!') sys.exit(0) total = 0 for root, dirs, files in os.walk(directory): for file in files: total = total + 1 print('Visited’, total, ‘files!') The results show the timing of various executions of the two slightly different implementa­tions, telling us that the Python 3.5 version is significan­tly faster!

 ??  ?? This image shows sample executions of the myFind. py system tool, which are mainly used for testing the script.
This image shows sample executions of the myFind. py system tool, which are mainly used for testing the script.
 ??  ?? This image shows a small part of the man page of the find utility, which you can see by typing “man find”.
This image shows a small part of the man page of the find utility, which you can see by typing “man find”.
 ??  ??
 ??  ?? This screenshot shows a big part of the online documentat­ion of the os.walk() method, which implements the core functional­ity of the myFind. py script.
This screenshot shows a big part of the online documentat­ion of the os.walk() method, which implements the core functional­ity of the myFind. py script.

Newspapers in English

Newspapers from Australia