Close

Sign up to Gadget

The ultimate guide to syncing data from Shopify

-

About

Learn how to build a robust, reliable data syncing pipeline out of Shopify for meeting the demanding needs of e-commerce merchants using your app.

Problem

Solution

Result

The ultimate guide to syncing data from Shopify

Harry Brundage
July 8, 2022

Shopify has a rich and powerful set of APIs for managing e-commerce operations. Despite having a big ecosystem of apps that just about every merchant uses, the Shopify API has some of the most remarkably low rate limits on the internet at 2 requests per second. Yep, 2. Even for the smallest apps in the Shopify ecosystem, calls to the Shopify API must be painfully aware of this rate limit. If an app is implemented in a straightforward way where a user loading a page makes an API request to Shopify for that page, it means you can serve a maximum of 2 pages per second. One single user could easily exhaust that rate limit just by clicking any link quickly!

Instead, most developers sync data into their systems, so it’s available for them to query at will, and only reach out to Shopify for writes. This is hard to get right, so we’ve put this together to document the best practices we’ve seen for managing this unfortunate aspect of the Shopify APIs.

That Syncing Feeling

Before building a whole data sync, there are a few techniques you can use to limit the number of API calls you make:

  • only make requests when truly necessary (duh)
  • batch up calls to read or write in bulk, eg, update all variants of a product at once instead of issuing one call per variant
  • implement some kind of retry or queuing strategy to deal with the rate limit inevitably being hit

Retrying is the main solution developers use to make their apps resilient to the rate limits. We recommend every call to Shopify, read or write, be wrapped in code that watches for the 429 errors that Shopify returns, and retries the call many times to eventually get it to work. But retries don’t really solve the problem of building interactive applications on Shopify – you don’t really want to make your user wait 10 seconds for the page to load while you’re frantically retrying in the background.

If you want your app to be snappy, you need to take control over the data and store it in a place where you’re able to access it frequently. Most folks implement this as a webhook-based data pipeline: they register webhooks with Shopify which notify the app’s backend any time data changes, and then the app stores that changed data in its own database, which is used to serve the app when needed.

Diagram showing how webhooks work compared to continuous polling
Data flow when using webhooks for communicating with Shopify

Getting this right is annoyingly complicated, but also, annoyingly necessary for a lot of apps. There are a few key elements of this data syncing strategy that make life hard for Shopify app developers.

First, webhook deliverability from Shopify is not perfect. Shopify has occasional bugs where webhooks aren’t sent, apps have bugs where they accidentally drop webhooks, and some resources don’t have any webhooks at all. These issues mean that a purely-webhook based data sync will “drift” from the source data stored in Shopify – one missed request here, one timeout there adds up over time to cause confusion and bugs where the app’s copy of the data is different than Shopify’s.

Second, webhooks only report on new data, not old, already existing data. If you want to show a product picker in an app, you don’t want to show only the products created or updated since your app was installed. You want to show all the products in the store, and keep that list of products up to date over time. This means that in addition to registering webhooks for capturing changes in the data, you need to do an initial retrieval of all the existing, historical data to populate your app’s copy.

And third, storing the data from Shopify in your own database is not altogether simple. Shopify data is varied and strange in places, with lots of parent-child relationships in the object structure, nested JSON fields, decorations like metafields, and myriad other specifics that make converting a JSON blob into rows in a table an annoying process. Choosing how to model Shopify’s data within your app can be a whole chunk of development you must complete before you even get started on the bit that makes your app special or unique.

Gadget’s Recommendations

  • Only make API calls when you need to
  • Try to cram as much work into each API call as you can, especially when working with parent and child objects at the same time
  • Ensure you deal with the inevitability of hitting the rate limit by retrying
  • Sync data to your own database with webhooks and an initial historical sync

Receiving Webhooks

Merchants using Shopify expect their applications to reflect up-to-date data. Merchants don’t care that syncing is hard – they just want their numbers to match and their customers to be happy. So, most applications build a real-time data sync with Shopify using webhooks (or one of Shopify’s event real-time streaming solutions).

Shopify has great documentation on setting up and securely receiving webhooks. Like Shopify, we think it’s important that developers ensure they follow the correct security practices around HMAC verification to ensure that it really is only Shopify sending them data.

Similar to most webhook systems, Shopify will retry delivering a failed webhook several times. Notably, if Shopify is unable to successfully deliver a webhook after a certain number of attempts, it will *stop* any further attempts for any data on that store. Enough failed webhook deliveries cause the webhook to be disabled altogether. This is generally quite annoying for developers – if they have one tiny bug that causes a small percentage of webhook deliveries to fail, the whole thing can get turned off on them and lead to a major interruption to a merchant’s service. If possible with your tech stack, it's wise to always accept Shopify’s webhook data, enqueue the processing of it, and return an HTTP 200 status code. This allows you to implement your own retry logic for webhook processing within your system using background jobs or similar, and not risk Shopify disabling your webhook.

It’s also wise to pay attention to the scalability of your webhook receiving system. The Shopify App Store is a developer’s best friend because it shows off your amazing app to all merchants, but also a developer’s worst enemy because it shows off your amazing app to giant merchants who will push it hard. Shopify apps are expected to scale to whatever size merchant happens to install them that day, regardless of if that merchant is a teensy shop with one order a day or a Shopify Plus merchant doing giant hourly flash sales for all of BFCM.

Webhook delivery counts across Gadget apps: they come in waves

To receive webhooks scalably, we recommend two key strategies. First, serverless architectures match the burstiness of webhooks really well. You don’t need to pay for server capacity if nothing much is happening with your merchants and you can rely on the cloud platforms to scale up to meet whatever torrent of webhooks might come your way quite quickly. And second, when processing many webhooks sent all at the same time, it’s easy to create accidental race conditions in your database logic. Shopify can often send webhooks for the same resource in quick succession, and due to queue delays, network delays, or Shopify’s platform health on any given day, they may not be processed in exactly the correct order within your system. Instead of assuming you’ll always get a product/create before a product/update webhook and writing corresponding INSERT and UPDATE SQL statements, Gadget recommends using upserts when adding webhook data to your database for this reason. This makes your processing code much more resilient to whatever webhook order gets thrown your way.

Gadget’s recommendations

  • Try not to rely on Shopify’s webhook retry logic, because they’ll turn the webhook off if it fails too much
  • Ensure webhook registration is monitored and that you can re-register webhooks if/when Shopify disables them
  • Ensure you validate Shopify’s webhook HMAC headers for security
  • Prepare for flash-sale scale with serverless
  • Use database upserts to add data to your database

Students of History

Gathering and storing the existing data for a Shopify shop that your app needs is an annoyingly different process from webhook receiving. Shopify won’t send you webhooks for all the stuff already in a merchant, so you need to make read requests to it to paginate through each resource and store the records in your database.

Each resource within Shopify has idiosyncrasies for access: some things are available in the REST API only, others in the GraphQL API, and yet more are available in both but have different fields in each. Webhooks deliver REST-style payloads for resources, but the GraphQL APIs often have a different name or shape for the same data, which can be confusing.

The general pattern is this:

  • Identify which API to use to fetch the data you need, with REST being the easiest to use, but GraphQL tending to get the latest and greatest features
  • Add background code to paginate through a shop’s data using that API, making sure to use Shopify’s cursor-based pagination, and implementing retries for rate-limited requests
  • For each record of retrieved data, upsert a record/group of records to the database if it’s relevant to your application
  • Run this sync when a shop is first installed, and optionally every so often later to correct drift

You also may need to make sub-requests for related resources of the data you want to access. For example, Shopify’s /admin/orders.json endpoint includes most of the order data, but not the fraud risk information for an order. If you need this data, you need to make an extra request per order to fetch the OrderRisk object from /admin/orders/1/risks.json. GraphQL query cost limits can also force this approach.

Devilish Details

Once you’ve got the basic sync logic in place, it’s time to make it work in the real world. Developers frequently learn that their initial sync implementation breaks when they start syncing real merchants, and it’s because they need to deal with all manner and shapes of merchant data.

Especially for old shops that have been in business for a long time, there are new and strange edge cases in the data that break seemingly working code. This can be really annoying in the context of a big sync though: one failing record probably shouldn’t break the whole thing. Breaking the entire application experience for a merchant because you’ve encountered a new case in the data sucks for the merchant. For this reason, we suggest developers add careful error handling when syncing that logs failures instead of throwing and interrupting the whole process.

A second frequent issue with syncs is that they can take a really long time. When doing a historical sync of a big shop, you often just hit the rate limit no matter what, and no amount of tuning can escape the hard reality of 2 requests per second. For shops with a huge amount of orders or customers or products, reading everything at that rate can end up taking many hours! We’ve seen syncs that take upwards of 24 hours for giant shops after running almost full tilt at 2 requests/second the whole time.

Running super long background jobs is a major thorn in our sides as developers because these units are exposed to a much higher chance of transient failures. Momentary network blips, 500 errors from Shopify, query timeouts, and even deploys can interrupt and fail background jobs that run for this long. Making resilient background jobs is a deep topic, but Gadget has a few recommendations. First, if there’s any way to chunk the big job into many smaller, individually retriable jobs, that’s generally much more friendly to modern-day web architectures. Shopify implemented this solution for their own internal needs in Ruby. Other background job systems often support similar “workflow” or “dependency” setups. You can chunk up a sync job by resource, or enqueue one job per page of data in order to get finer-grained, retriable units. For super-advanced use cases, workflow systems like Temporal are a great fit for making these long-running jobs bulletproof.

Gadget uses Temporal internally to run Shopify syncs as robustly as possible

Gadget’s recommendations

  • Sync historical data by paginating it using API read requests with careful retries
  • Aggregate and log failures in syncs instead of letting them interrupt the whole thing
  • Prepare for long-running syncs, and try to chunk up long-running jobs into smaller ones with the patterns in your language

Drifting Back to Shore

A robust webhook processing pipeline and a robust historical sync put your application in great shape for serving data reliably. But, there’s one more insidious issue to grapple with: data drift. Despite our best efforts as developers, we will always be imperfect: we’ll have bugs, we’ll have outages, and Shopify will too! For this reason, webhooks (or event streaming) alone aren’t sufficient to keep a copy of Shopify’s data fully up to date, as some will invariably get missed, dropped, or ignored. In e-commerce, this is often actually a big deal, because you’re dealing with people’s money, and losing even 1-2% of it is an unacceptable error.

The good news is that any historical sync code you have already written is the perfect weapon for correcting drift! If you’ve written sync code, you can repurpose it to sync a smaller time range of a shop’s data periodically, detecting any differences between Shopify and your stored copy, and applying any missed updates. For maximum data quality, Gadget recommends running a smaller sync each night for each shop that compares the last 48 hours of data within your app and Shopify.This ensures any changes are correctly discovered.

The same idea can also be used when upgrading Shopify API versions, or when importing data that you previously hadn’t had inside your application. If you squint, you can see missing fields as just more data drift that needs correcting and let your sync code handle populating the missing data from Shopify.

Gadget’s recommendations

  • Sync a small window of recent data frequently to combat data drift and recover from bugs

Really though?

The above list of recommendations may seem like a LOT of work, and it is! A lot of complicated engineering is required to get sync right. And, frustratingly, it’s all in the name of just getting access to the data Shopify already has, instead of actually building net new functionality into your app that merchants will pay for. Until Shopify changes this restriction, this seems to be what we’re stuck with.

Like anything, you should take all this with a grain of salt. Not every app needs a super reliable data pipe to Shopify, and for some simple use cases, direct webhook processing can be sufficient. That said, we find that most merchants don’t see a distinction between new data going forward and data they’ve already generated, and merchants expect data in their apps to match the rest of their store, especially for anything business-critical. We encourage developers to think about using a hosted solution that implements all this sync functionality right out of the box (like Gadget’s Shopify Connection), giving you your time back to do the actual building part of building an app and making ecommerce better.

Godspeed out there to all you syncers, and if you’d like to ask questions or make your own recommendations, please join us on Discord!

Gadget’s recommendations:

  • Apply as many or as few of these recommendations as you see fit for your unique situation
  • Use a hosted product like Gadget if you don’t want to build it yourself
Keep reading to learn about how it's built

Under the hood

We're on Discord, drop in and say hi!
Join Discord
Bouncing Arrow