Snowplow: evolve your analytics stack with your business

  • Published on
    03-Mar-2017

  • View
    72

  • Download
    5

Transcript

  • Snowplow: evolve your analytics stack with your

    business

    Snowplow Meetup San Francisco, Feb 2017

  • Our businesses are constantly evolving

    Our digital products (apps and platforms) are constantly developing

    The questions we ask of our data are constantly changing

    It is critical that our analytics stack can evolve with our business

  • Self-describing data Event data modeling+

    Analytics stack that evolves with your business

    How Snowplow users evolve their analytics stacks with their business

  • Self-describing dataOverview

  • Event data varies widely by company

  • As a Snowplow user, you can define your own events and entities

    Events

    Entities (contexts)

    Build castle Form alliance Declare war

    Player Game Level Currency

    View product Buy product Deliver product

    Product Customer Basket Delivery van

  • You then define a schema for each event and entity

    { "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#", "description": "Schema for a fighter context", "self": { "vendor": "com.ufc", "name": "fighter_context", "format": "jsonschema", "version": "1-0-1" },

    "type": "object", "properties": { "FirstName": { "type": "string" }, "LastName": { "type": "string" }, "Nickname": { "type": "string" }, "FacebookProfile": { "type": "string" }, "TwitterName": { "type": "string" }, "GooglePlusProfile": { "type": "string" },

    "HeightFormat": { "type": "string" }, "HeightCm": { "type": ["integer", "null"] }, "Weight": { "type": ["integer", "null"] }, "WeightKg": { "type": ["integer", "null"] }, "Record": { "type": "string", "pattern": "^[0-9]+-[0-9]+-[0-9]+$" }, "Striking": { "type": ["number", "null"], "maxdecimal": 15 }, "Takedowns": { "type": ["number", "null"], "maxdecimal": 15 }, "Submissions": { "type": ["number", "null"], "maxdecimal": 15 }, "LastFightUrl": { "type": "string" },

    "LastFightEventText": { "type": "string" }, "NextFightUrl": { "type": "string" }, "NextFightEventText": { "type": "string" }, "LastFightDate": { "type": "string", "format": "timestamp" } }, "additionalProperties": false }

    Upload the schema to Iglu

  • Then send data into Snowplow as self-describing JSONs

    1. Validation 2. Dimension widening3. Data

    modeling

    { schema: iglu:com.israel365/temperature_measure/jsonschema/1-0-0, data: { timestamp: 2016-11-16 19:53:21, location: Berlin, temperature: 3 units: Centigrade } }

    { "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#", "description": "Schema for an ad impression event", "self": { "vendor": com.israel365", "name": temperature_measure", "format": "jsonschema", "version": "1-0-0" }, "type": "object",

    "properties": { "timestamp": { "type": "string" }, "location": { "type": "string" }, }, }

    Event

    Schema reference

    Schema

  • The schemas can then be used in a number of ways

    Validate the data (important for data quality)

    Load the data into tidy tables in your data warehouse

    Make it easy / safe to write downstream data processing application (e.g. for real-time users)

  • Event data modelingOverview

  • What is event data modeling?

    1. Validation 2. Dimension widening3. Data

    modeling

    Event data modeling is the process of using business logic to aggregate over event-level data to produce 'modeled' data that is simpler for querying.

  • event 1

    event n

    Users

    Sessions

    Funnels

    Immutable. Unopiniated. Hard to consume. Not contentious

    Mutable and opinionated. Easy to consume. May

    be contentious

    Unmodeled data Modeled data

  • In general, event data modeling is performed on the complete event stream

    Late arriving events can change the way you understand earlier arriving events

    If we change our data models: this gives us the flexibility to recompute historical data based on the new model

  • The evolving event data pipeline

  • How do we handle pipeline evolution?

    PUSH FACTORS:

    What is being tracked will change over

    time

    PULL FACTORS:

    What questions are being asked of the data will change

    over time

    Businesses are not static, so event pipelines should not be either

    Web

    Apps

    Servers

    Comms channels

    Push

    Data warehouse

    Data exploration

    Predictive modeling

    Real-time dashboards

    Real-time, data-driven applicationsRT

    Bidder Voucher

    Person-alization

    Collection Processing

    Smart car / home

  • Push example: new source of event data

    If data is self-describing it is easy to add an additional sources

    Self-describing data is good for managing bad data and pipeline evolution

    Im an email send event and I have information about the recipient (email address, customer ID) and the email

    (id, tags, variation)

  • Pull example: new business question

    Answer

    Insight

    Question?

  • Answering the question: 3 possibilities

    1. Existing data model supports answer

    2. Need to update data model

    3. Need to update data model and data

    collection

    Possible to answer question with existing modeled data

    Data collected already supports answer

    Additional computation required in data modeling step (additional logic)

    Need to extend event tracking

    Need to update data models to incorporate additional data (and potentially additional logic)

  • Self-describing data and the ability to recompute data models are essential to enable pipeline evolution

    Self-describing data Recompute data models on entire data set

    Updating existing events and entities in a backward compatible way e.g. add optional new fields

    Update existing events and entities in a backwards incompatible way e.g. change field types, remove fields, add compulsory fields

    Add new event and entity types

    Add new columns to existing derived tables e.g. add new audience segmentation

    Change the way existing derived tables are generated e.g. change sessionization logic

    Create new derived tables

  • Questions?