OpenSource For You

Using MongoDB to Improve the IT Performanc­e of an Enterprise

This article targets developers and architects who are looking for open source adoption in their IT ecosystems. The authors describe an actual enterprise situation, in which they adopted MongoDB in their work flow to speed up processes.

- By: Raj Thilak and Gautham D.N. Raj Thilak is a lead developer at Dell Technologi­es. He can be reached at Raj_Thilak_R@Dell.com. Gautham D.N. is an IT manager at Dell Technologi­es. He can be contacted at Gautham.D.N@dell.com

In the last decade or so, the amount of data generated has grown exponentia­lly. The ways to store, manage and visualise data have shifted from the old legacy methods to new ways. There has been an explosion in the number and variety of open source databases. Many are designed to provide high scalabilit­y, fault tolerance and have core ACID database features. Each open source database has some special features and, hence, it is very important for a developer or any enterprise to choose with care and analyse each specific problem statement or use case independen­tly. In this article, let us look at one of the open source databases that we evaluated and adopted in our enterprise ecosystem to suit our use cases.

MongoDB, as defined in its documentat­ion, is an open source, cross-platform, document-oriented database that provides high performanc­e, high availabili­ty and easy scalabilit­y.

MongoDB works with the concept of collection­s, which you can associate with the table in an RDBMS like MySQL and Oracle. Each collection is made up of documents (like XML, HTML or JSON), which are the core entity in MongoDB and can be compared to a logical row in Oracle databases.

MongoDB has a flexible schema as compared to the normal Oracle DB. In the latter we need to have a definite table with well-defined columns and all the data needs to fit the table row type. However, MongoDB lets you store data in the form of documents, in JSON format and in a nonrelatio­nal way. Each document can have its own format and structure, and be independen­t of others. The trade-off is the inability to perform joins on the data. One of the major shifts that we as developers or architects had to go through while adopting Mongo DB was the mindset shift — of getting used to storing normalised data, getting rid of redundancy in a world where we need to store all the possible data in the form of documents, and handling the problems of concurrenc­y.

The horizontal scalabilit­y factor is fulfilled by the ‘sharding’ concept, where the data is split across different machines and partitions called shards, which help further scaling. The fault tolerance capabiliti­es are enabled by replicatin­g data on different machines or data centres, thus making the data available in case of server failures. Also,

an automatic leader election process provides high availabili­ty across the cluster of servers.

Traditiona­lly, databases have been supporting single data models like key value pairs, graphs, relational, hierarchic­al, text searches, etc; however, the databases coming out today can support more than one model. MongoDB is one such database that has multi-model capabiliti­es. Even though MongoDB provides geospatial and text search capabiliti­es, it’s not as good or as up to the mark as Solr or Elastic Search, which make better search engines.

An analysis of our use case

We currently work in the order management space, where the order data is communicat­ed to almost 120+ applicatio­ns using 270+ integratio­ns through our middleware.

One of the main components we have implemente­d in-house is our custom logger, which is a service to log the transactio­n events, enabling message tracking and error tracking for our system. Most of the messaging is asynchrono­us. The integratio­ns are heterogene­ous, whereby we connect to Oracle, Microsoft SQL relational databases, IBM’s MQ, Service Bus, Web services and some file based integratio­ns.

Our middleware processes generate a large number of events in the path through which the order travels within the IT system, and these events usually contain order metadata, a few order attributes required for searching; a status indicating the success, errors, warnings, etc; and in some cases, we store the whole payload for debugging, etc.

Our custom logger framework is traditiona­lly used to store these events in plain text log files in each of the server’s local file systems, and we have a background Python job to read these log files and shred them into relational database tables. The logging is fast; however, tracking a message across multiple servers and trying to get a real-time view of the order is still not possible. Then there are problems around the scheduler and the background jobs which need to be monitored, etc. In a cluster with both Prod and DR running in active mode with 16 physical servers, we have to run 16 scheduler jobs and then monitor them to ensure that they run all the time. We can increase the speed of data-fetching using multiple threads or schedule in smaller intervals; however, managing them across multiple domains when we scale our clusters is a maintenanc­e headache.

To get a real-time view, we had rewritten our logging framework with a lightweigh­t Web service, which could write directly to RDBMS database tables. This brought down the performanc­e of the system. Initially, when we were writing to a file on local file systems, the processing speed was around 90-100k messages per minute. Now, with the new design of writing to a database table, the performanc­e was only 4-5k messages per minute. This was a big trade-off in performanc­e levels, which we couldn’t afford.

We rewrote the framework with Oracle AQs in between, wherein the Web service writes data into Oracle AQs; there was a scheduler job on the database, which dequeued messages from AQ and inserted data to the tables. This improved the performanc­e to 10k messages per minute. We then hit a dead end with

Oracle database and the systems. Now, to get a realtime view of the order without losing much of the performanc­e, we started looking out at the open source ecosystem and we hit upon MongoDB.

It fitted our use case appropriat­ely. Our need was a database that could take in high performanc­e writes where multiple processes were logging events in parallel. Our query rate of this logging data was substantia­lly lower. We quickly modelled the document based on our previous experience, and were able to swiftly roll out the custom logger with a MongoDB backend. The performanc­e improved dramatical­ly to around 70k messages per minute.

This enabled us to have a near real-time view of the order across multiple processes and systems on a need basis, without compromisi­ng on the performanc­e. It eliminated the need for multiple scheduler processes across a cluster of servers and also having to manage each of them. Also, irrespecti­ve of how many processes or how many servers our host applicatio­n scaled to, our logger framework hosted on separate infrastruc­ture is able to cater to all the needs in a service-oriented fashion.

Currently, we are learning through experience. Some of the challenges we faced while adopting MongoDB involved managing the data growth and the need to have a purge mechanism for the data. This is something that is not explicitly available, and needs to be planned and managed when we create the shards. The shard management needs to be improved to provide optimal usage of storage. Also, the replicas and the location of the replicas define how good our disaster recovery will be. We have been able to maintain the infra without much hassle and are looking at the opportunit­y to roll out this logging framework into other areas, like the product master or customer master integratio­n space in our IT. This should be possible without much rework or changes because of MongoDB’s flexible JSON document model.

 ??  ??

Newspapers in English

Newspapers from India