User Defined Aggregate Functions (UDAF)

UDAF mean example

Create a Java class which extends org.apache.hadoop.hive.ql.exec.hive.UDAF

Create an inner class which implements UDAFEvaluator

Implement five methods
- init() – This method initializes the evaluator and resets its internal state. We are using new Column() in the code below to indicate that no values have been aggregated yet.
- iterate() – This method is called every time there is a new value to be aggregated. The evaluator should update its internal state with the result of performing the aggregation (we are doing sum – see below). We return true to indicate that the input was valid.
- terminatePartial() – This method is called when Hive wants a result for the partial aggregation. The method must return an object that encapsulates the state of the aggregation.
- merge() – This method is called when Hive decides to combine one partial aggregation with another.
- terminate() – This method is called when the final result of the aggregation is needed.
  
  public class MeanUDAF extends UDAF { // Define Logging static final Log LOG = LogFactory.getLog(MeanUDAF.class.getName()); public static class MeanUDAFEvaluator implements UDAFEvaluator { /**
  - Use Column class to serialize intermediate computation
  - This is our groupByColumn
  */ public static class Column { double sum = 0; int count = 0; } private Column col = null; public MeanUDAFEvaluator() { super(); init(); } // A - Initalize evaluator - indicating that no values have been // aggregated yet. public void init() { LOG.debug(“Initialize evaluator”); col = new Column(); } // B- Iterate every time there is a new value to be aggregated public boolean iterate(double value) throws HiveException { LOG.debug(“Iterating over each value for aggregation”); if (col == null) throw new HiveException(“Item is not initialized”); col.sum = col.sum + value; col.count = col.count + 1; return true; } // C - Called when Hive wants partially aggregated results. public Column terminatePartial() { LOG.debug(“Return partially aggregated results”); return col; } // D - Called when Hive decides to combine one partial aggregation with another public boolean merge(Column other) { LOG.debug(“merging by combining partial aggregation”); if(other == null) { return true; } col.sum += other.sum; col.count += other.count; return true; } // E - Called when the final result of the aggregation needed. public double terminate(){ LOG.debug(“At the end of last record of the group - returning final result”); return col.sum/col.count; } } }
  
  hive> CREATE TEMPORARY FUNCTION AS ‘JAR PATH.jar’; hive> select id, mean_udf(amount) from table group by id;