MongoDB - 4. Aggregation Framework


Aggregation framework mainly used for ETL majorly aggregation pipeline. Aggregations operations process data records and return computed results. There are 3 types for doing aggregation in mongoDB.

  • Single Purpose Aggregation

  • Aggregation Pipeline

  • Map Reduce Functions

Single Purpose Aggregation-

There is very simple function like getting no of records from collection or getting distinct records by removing duplicates.

db.mycollection.count()

db.mycollection.distinct()

Aggregation Pipeline

I aggregation pipelining there will be phases. Series of actions will be takes place on input and output will be produced. It is very similar to Map- Reduce concept in Hadoop. There will be different ETL phases as below –

MATCH

GROUP

UNWIND

PROJECT

SKIP

LIMIT

Most of the time when we have to fetch the data we compare syntax with the SQL.

So, similar commands are available in MongoDB.

  • WHERE - $match

  • GROUP BY- $group

  • HAVING - $match

  • SELECT - $project

  • ORDER BY - $sort

  • LIMIT - $limit

Map Reduce Functions

MongoDB also provides map-reduce operations to perform aggregation. Sometime Finalize function also used in map –reduce. I am not going into details of this type as we can do all the things by aggregation pipeline which we can achieve with these functions.

Let us take few examples to get into little deeper to understand these concepts.

  1. Get the project total by year

db.project.aggregate([{$group:{_id:"$year","total by year":{$sum:1}}}])

- aggregate by year

2. db.project.aggregate([{$group:{_id:"$company","total by company":{$sum:1}}}])

- aggregate by year

3. db.project.aggregate([{$group:{"_id": {"firm": "$company","year":"$year"}, "total by company":{$sum:1}}}])

- aggregate by year/company

4. db.project.aggregate([

{ "$match": {"year": 2010, "company": "ABC"} },

{ "$unwind": "$Project" },

{ "$match": { "$or": [{"Project.Domain": /Telecom/},{"Project.Domain": /Retail/}] } },

{ "$unwind": "$Project.Client" },

{

"$group": {

"_id": {

"company": "$company",

"year": "$year",

"domain": "$Project.Domain"

},

"headcount": { "$sum": "$Project.Client.size" }

}

}

])

-- aggregate by year and company all the project sizes

5. db.trial_test.aggregate([{$group:{_id:"$status","count":{$sum:1}}}])

- aggregate by status

6. db.flights.aggregate([{$group:{_id:"$Year","total airtime":{$sum:"$AirTime"}}}])

- the ID has been described

7. db.flights.aggregate([{$group:{_id:{"Carrier":"$UniqueCarrier"},"total airtime":{$sum:"$AirTime"}}}])

Match

=====

8. db.flights.aggregate([{$group:{_id:{"Carrier":"$UniqueCarrier"},"total airtime":{$sum:"$AirTime"}}},{$match: {"_id.Carrier":"DL"}}])

Projection

==========

9. db.flights.aggregate([{$group:{_id:{"Carrier":"$UniqueCarrier"},"total airtime":{$sum:"$AirTime"}}},{$project:{_id:0, "Flight":

{$toLower:"$_id.Carrier"},"Air Time":{$multiply:["$total airtime",60]}}},{$out:"FlightPlan"}])


About Author

Dattatray Shinde have over 12+ years of experience in Software Design, Development & Maintenance of Web Based Applications; worked on Healthcare, Insurance, E-commerce and Learning Management System domains. Over 6 + years as Data Scientist worked mainly in predictive analytics, survey analytics, risk analytics platforms.

Featured Posts
Recent Posts