Planning Your Aggregate Deployment

ODK Aggregate can be deployed to:

There is also a fully set-up virtual machine that can be run in nearly any environment.

We recommend using Google App Engine or the ODK Aggregate VM before attempting an EC2 or Tomcat deployment. Once you have tried Aggregate together with ODK Collect and familiarized yourself with their use, you can consider alternative hosting platforms.

You can also go without Aggregate altogether and use ODK Briefcase.

This document provides general advice for thinking through your deployment decisions.

Things to Consider

Internet access

Google App Engine and Amazon Web Services both require internet access. If you don't have consistent internet access, Briefcase may be more appropriate.

Tomcat deployments can operate without internet access. In such an environment, Collect would only be able to upload finalized forms after it connects to the network containing the Tomcat deployment.

Computer skills

Tomcat deployments (including deployment to Amazon Web Services) have a steep learning curve and require technical aptitude. At a minimum you will be:

  • changing network configuration
  • selecting and using a website hosting service or specifying and configuring your own server and network router(s)
  • installing software
  • ensuring that your site has proper power-failure and data-backup systems in place

If this level of systems administration skill is not available, you will have more success using Google App Engine.

Ongoing support

Tomcat deployments require periodic backups of your data. If data security is a concern, you should have a system administrator or database administrator periodically review logs and look for malicious activity.

Availability

Google App Engine provides highly available servers and data storage. Tomcat deployments with similar availability will be expensive to operate unless your organization already has its own information technology department. The less downtime you can tolerate, the more expensive a Tomcat deployment will be.

On the other hand, high availability is not an issue for many deployments. Most users of Collect download blank forms once and rarely update those forms over the course of a study. Surveyors upload finalized forms to Aggregate infrequently and opportunistically. If that is your situation, you likely do not need a server that is as highly available as Google App Engine provides.

Dataset size

Google App Engine can store a virtually unlimited amount of data — well in excess of a million submissions.

However, in deployments with data sets exceeding 7,000 submissions, the data export feature will stop working. To correct this, you will need a custom deployment with a larger virtual machine. This problem affects both Google App Engine and Tomcat deployments.

On Google App Engine, a larger server will incur higher billing costs. Additionally, for datasets of over 100,000 records, it is likely that performance will be better when using MySQL or PostgreSQL, rather than Google App Engine's data store. You also have more optimization opportunities when running your own database servers than are available through Google's cloud services.

Note

Individual text database fields are capped at a length of 255 by default for performance reasons. If you intend to collect text data longer than 255 characters (including using types geotrace, geoshape or select multiple), your forms should specify database field lengths greater than 255.

Data locality and security

Google App Engine servers may be located anywhere in the world.

Depending on the sensitivity of the data and specific storage rule, regulations, or restrictions of your country or organization, the server infrastructure may not have all necessary locality guarantees or security precautions.

In some circumstances, you might be able to use Overview to achieve compliance. You should research and comply with applicable laws and regulations before storing data on Google App Engine.

Billing

For identity verification purposes, Google requires a credit card or banking details to use the Google Cloud Platform that Google App Engine runs on. Accounts that meet this requirement receive a recurring $200 monthly credit per billing account.

Independent of Cloud Platform credits, App Engine allows a certain amount of free activity. These free quotas reset every 24 hours and are high enough to enable free use of ODK Aggregate during evaluation and small pilot studies.

You may be able to run a full deployment within these activity thresholds provided you:

  • collect fewer than 2000 responses
  • access the site a limited number of times a day
  • can be flexible about when you upload and access data

Deployments with more activity that do not wish to wait 24 hours for quotas to reset can enable billing on their App Engine project.

Once billing is enabled, ODK Aggregate will start using the monthly credit that comes from the Cloud Platform. Once those credits are finished, the credit card or bank on file will then be used. Billing account owners can set spending limits to manage application costs.

Most ODK deployments will not surpass the $200/month credit and non-profits using more than that can apply for more credits through Google for Nonprofits.

Open source

The ODK software is free, open source, and available for use without charge.

It is important to recognize that the open source software model does place additional responsibilities on the users of that software.

Unless you pay for assistance when technical support is needed, you will be required to take the initiative to research and find answers, and to perform technical support tasks yourself.

Finally, unless you and others contribute back to Open Data Kit through involvement in the community and contributions to the project, this software will become irrelevant and obsolete.

Minimizing App Engine fees

On App Engine, the major driver of cost is Datastore Reads. These add up quickly:

  • Viewing a page of form submissions incurs at least one Read for each submission.
  • Each multiple-choice question in a form incurs an additional Read on every displayed submission.
  • An additional read is incurred for every 200 questions in your survey.
  • Each image incurs at least 10 reads.
  • The default view shows 100 submissions.
  • The form submissions display refreshes every six seconds.

For example, if your survey has 500 questions (q), with a repeat group containing an additional 300 questions, the typical survey has 4 filled-in repeats (rpt), and 100 submissions (s) are shown on each page load (pl), then the cost to display the Submissions tab is a minimum of 1100 Reads (R) with each refresh of the Submissions tab.

\[100 s/pl \times (500 q/s \times \lfloor 1 R / 200 q \rfloor + 4 rpt/s \times 300 q/rpt \times \lfloor 1 R / 200 q \rfloor) = 1100 \ R/pl\]

At this rate, the free quota would be exceeded within 5 minutes!

And this hypothetical survey did not contain any select-one or select-multiple questions, or any audio, video or image captures, all of which would require more Reads.

Therefore, to reduce datastore reads:

It is generally more efficient to use Briefcase to generate CSV files than to use Aggregate, as Briefcase will use the locally cached data to generate the CSV files.

With larger datasets, there are two modes of operation:

  • Aggregate retains the full dataset.

    In this mode, it is slightly more efficient to Pull data to your local computer then immediately Push it back up. This sets some internal tracking logic within Briefcase so that the next Pull is somewhat more efficient, as the Push only verifies that what you have locally matches the content on Aggregate, rather than re-uploading all of it.

  • Aggregate retains only a portion of the dataset.

    In this mode, you periodically purge older data collection records and never Push data up to Aggregate, as that would restore the purged data.