OpenSource For You

Processing Big Data Using MongoDB-GridFS

MongoDB becomes much more efficient when combined with GridFS. The latter divides a document into parts or chunks that are then stored as separate documents. In a nutshell, GridFS is a kind of file system used to store files.

- By: Dr Gaurav Kumar The author is the MD of Magma Research and Consultanc­y Pvt Ltd, Ambala City. He is associated with various academic and research institutes, where he delivers lectures and conducts technical workshops. He can be contacted at kumargaura

The rapidly increasing volume and variety of unstructur­ed data generates enormous datasets which are difficult to analyse and extract further knowledge from. Issues related to the storage, processing and knowledge discovery from huge amounts of unstructur­ed data are addressed by Big Data analytics. There are various applicatio­ns for which the volume, velocity and variety of data increases very frequently, and a significan­t amount of research work is going on to understand how Big Data analytics could help in processing and understand­ing such data.

The following list gives a rough idea of why Big Data analytics has gained so much importance today:

Since the inception of The Indian Railway Catering and Tourism Corporatio­n or IRCTC in 2002, online ticket bookings have increased from 29 tickets to 1.3 million tickets per day.

As per recent research at Harvard, the informatio­n stored in 1 gram of DNA is equivalent to 700 terabytes of data in digital format.

According to InternetLi­veStats.com, around 10,000 tweets are processed per second on Twitter; the Internet traffic per second totals more than 42TB; and on YouTube, more than 68,000 videos are viewed every second. As per a report on popular statistics portal Statista.com, there were more than 1000 million users per day on Facebook, as of December 2016.

MongoDB: A prominent NoSQL database for Big Data processing

MongoDB is one of the widely used NoSQL databases under free and open source distributi­on. It is a documen-toriented database written in C++, and a leader in the database segment for large scale as well as performan-ceaware Big Data based applicatio­ns. The implementa­tion and storage of Aadhaar cards in India is being effectivel­y done with the integratio­n of MongoDB.

MongoDB provides a number of modules and specificat­ions to support scalable and high performanc­e Big Data based applicatio­ns. These features and modules include GridFS, Sharding, Capped Collection­s, Map Reduce and many others. MongoDB can insert more than 200 million rows of data in a few seconds.

GridFS—for the storage and retrieval of large objects

GridFS is one of the powerful specificat­ions of MongoDB that

Figure 2: Number of active users per day on Facebook helps to store and retrieve large scale files. These files can be structured or unstructur­ed and they include documents, audio files, images, recorded video clips, binary files, etc. GridFS is similar to a file system for the storage of files. MongoDB collection­s are used for storage of data and files using GridFS, which has powerful features to store the files of any format, including files that are even more than 16MB in size. In classical implementa­tions, there is the file storage limit of 16MB but MongoDB-GridFS can store and retrieve files beyond this limit too.

Using GridFS, the files are divided into a number of chunks. Every single chunk of the dataset is stored in arbitrary but logically connected documents, each with the maximum size of 255KB.

GridFS works on two collection­s called fs.files and fs.chunks, for the storage of metadata and a chunk of a file. Each chunk is uniquely associated with an ID. The collection fs.files acts as the parent document correspond­ing to the chunk. The field files_id in the collection fs.chunks is used to link the actual content with its parent.

The format of fs.files collection is: “filename”: “******** ”, // Filename “chunkSize”: “******** , // Size of Chunk “uploadDate”: ISODate(“” ******** ”), // Timestamp “md5”: “” ******** ”, // MD5 (Message Digest) Hash of File “length”: “******** , // Size of document in bytes “contentTyp­e”: “******** , // File Type in MIME “metadata”: “******** // Additional Informatio­n The format of fs.chunks collection is: “_id”: ObjectId(“” ******** ”,”), // Unique ID of the chunk “files_id”: ObjectId(“”***”,”), // Parent ID of the document

“n”: “******** ”, // Sequence Number of chunk

“data”: “” ******** ”,” // Data in chunk

Adding and retrieving large binary files using GridFS

In the following example, the storage of a video file will be implemente­d using GridFS. The put command is used for this. The utility mongofiles.exe in the bin folder is required to be executed.

First, open the command prompt. Next, change the directory to the bin folder of MongoDB. Now start the MongoDB server by executing mongod.exe.

Next, execute the following command: C:\Mongo-DBDirector­y\ bin\ mongofiles.exe -d gridfs put MyVideo.avi In the instructio­n above, mongofiles.exe is the utility for executing different commands.

The keyword ‘gridfs’ represents the name of the database that is to be used for the storage and retrieval of files. In case the database is not already created, MongoDB will create a new database with the same name dynamicall­y. MyVideo.avi is the video file to be uploaded using GridFS.

To search and view the document in the database, execute the following command on a MongoDB prompt:

MongoDB Prompt> db.fs.files-.find()

To display all the chunks created, the following instructio­n is executed:

MongoDB Prompt> db.fs.c hunks. find ({ files_ id: Object Id (‘ 58 bc0b91bf9b de 12940 a 1640’)})

Interfacin­g MongoDB-GridFS with PHP

The integratio­n of PHP with MongoDB can be done using the MongoDB driver available at https://s3.amazo-naws.com/ drivers.-mongodb.org/ php/index.html.

After downloadin­g php_mongo.dll, the following line is inserted in the php.ini file:

extension = php_mongo.dll

Once the PHP-MongoDB driver is ready, GridFS can be used.

Use the following script to upload and store a large file using PHP-MongoDB-GridFS:

<?php

$Big Data Connection= new Mongo(“127.0.0.1:27017”);

$db =$ Big Data Connection > Big Database; $db > authentic ate (“databaseus­ername ”,” database-password ”);

$biggrid = $db>getGridFS();

$name =$_ FILES [‘ Big File ’][‘ name ’];

$type =$_ FILES [‘ Big File ’][‘ type ’];

$id =$ big grid > store Upload (‘ Big File ’,$ name );

$files = $db>fs>files;

$files > update( array (“filename ”=>$ name ), array (‘$ set ’=> array (“content Type ”=>$ type,

“aliases” => null, “metadata” => null))); $Big Data Connection > close (); exit(0);

?>

To display all files using PHP-MongoDB-GridFS, use the following script: <?php

$Big Data Connection= new Mongo(“127.0.0.1:27017”); $db =$ Big Data Connection > Big Database; $db > authentic ate (“databaseus­ername ”,” database-password ”);

$biggrid = $db>getGridFS();

$mycursor = $biggrid>find(); foreach ($mycursor as $myobj)

{ echo ‘Filename: ‘.$myobj>getFilenam­e().’ Size: ‘.$myobj>getSize().’<br/>’;

}

$Big Data Connection > close (); exit(0); ?>

To delete files using PHP-MongoDB-GridFS, use the script given below:

<?php

$Big Data Connection= new Mongo(“127.0.0.1:27017”); $mydb =$ Big Data Connection > Big Database; $my db > authentic ate (“databaseus­ername ”,” database-password ”);

$biggrid = $db>getGridFS();

$myfilename = $_REQUEST-[“myfile”];

$my file =$ big grid > find One ($ my filename );

$myid = $file>file[‘_id’];

$biggrid>delete($myid); $Big Data Connection > close (); exit(0); ?>

Interfacin­g MongoDB-GridFS with Python

To integrate Python with MongoDB-GridFS, attach the pymongo library with the existing Python set-up. After this step, GridFS can be used.

On the Python prompt, the following instructio­ns can be executed for Python-MongoDB-GridFS interfacin­g:

>>> from pymongo import MongoClien­t

>>> import gridfs For selecting database and file system, execute the following instructio­ns:

>>> mydb = Mon go Client (). My GridFS

>>> file system= gridfs.GridFS( my db)

To insert a new file, put() is used: >>> my file= file system. put (“My New File ”) To read the file contents, get() is used: >>> filesystem.get(-myfile).read()

These executions can be used for assorted applicatio­ns involving Big Data processing. MongoDBGri­dFS interfacin­g can also be done with other programmin­g languages with higher levels of integrity and performanc­e, including Java, C++, JavaScript, Ruby, Scala, Haskell and many others.

 ??  ?? Figure 5: Starting the MongoDB server from the BIN directory
Figure 5: Starting the MongoDB server from the BIN directory
 ??  ?? Figure 7: Inserting a line of PHP-MongoDB driver in php.ini
Figure 7: Inserting a line of PHP-MongoDB driver in php.ini
 ??  ?? Figure 6: Output from the GridFS database using find()
Figure 6: Output from the GridFS database using find()
 ??  ?? Figure 4: Utilities in the BIN directory of the MongoDB server
Figure 4: Utilities in the BIN directory of the MongoDB server
 ??  ?? Figure 3: Portal of the MongoDB NoSQL database
Figure 3: Portal of the MongoDB NoSQL database
 ??  ?? Figure 1: Real-time Big Data cases on InternetLi­veStats.com
Figure 1: Real-time Big Data cases on InternetLi­veStats.com
 ??  ??

Newspapers in English

Newspapers from India