Wednesday, March 4, 2020

GraphQL MN tonight featured a tracing-for-lag exposition.

Jason Coleman is pictured in the center of the picture here. He works at CaringBridge and he gave a talk that offered an exposition on what CaringBridge is, and, in short, it is a builder app for persons having tough medical experiences wherein they may blog of what they are going through and share it on with others. The network has a base of thirty million users. At Jason's left towards the right of the photo is Christopher Bartling who is the head-of-state of this group and who recommended the book "Production Ready GraphQL" by Marc-André Giroux. Mark Soule is pictured at Jason's right towards the left side of the photo. He works at a consultancy called the Nerdery (which hosted this talk) and they have had him deployed to CaringBridge for about a year now to work with/for Jason. He actually worked on a lag issue and showed off some deeper tooling to that end. In the photo here all three men are struggling to deal with the projector not projecting before the talks got rolling. Ping Identity and Amazon Cognito are ways to come up with tokens for user identities like Auth0. EKS is the Elastic Kubernetes Services for Amazon. Shopify has an example of some public-facing source code that has GraphQL magic in it. Zipkin and Jaeger are ways to have "open tracing" and thus to audit what is going on in a GraphQL orchestration to see where a pain point lies. This is better than just logging stuff to Splunk and trying to make sense of the time stamps. You are setting yourself up for a lot of work in the later approach. The opentracing way to go logs step by step at various steps in a chain of events to allow you to find the lag. There is an OpenTelemetry community around this stuff. JMeter is a Java-based load testing tool. autocannon is a rival that recommended in this space. In Google Chrome Developer Tools when you go to the "Performance" tab and record with the circular button at the upper left that says "Record" (when you mouse over it) and then stop recording you will see tabs for Summary, Bottom-Up, Call Tree, and Event Log. Bottom-Up shows which activities took the most time to aggregate and Call Tree shows which activities took the most time standalone. Apollo Federation is the new way to do stitching and stitching is the old way to aggregate. There are new keywords and terms with Apollo Federation. graphql-middleware allows queries to other things that already were to pass through GraphQL. Mark's Solution for improving performance ultimately was of more caching and more scaling (horizontal scaling with more instances). When he put his finger on the problem he found that the problem was not so much of the CaringBridge code but of the supporting actors, apollo-server-core and GraphQL itself. prettyJSONstringify is pretty heavy and there is no optional flag for turning it off at the time of this writing. There is just a lack of maturity in this space lending itself to some pain. Clinic.js is a neat diagnostics tool for this stuff. It has three modes: Doctor, Bubbleprof, and Flame. Bubbleprof draws circles to represent servers where time lag occurs and Doctor and Flame offer more traditional timeline layouts that are easier to understand. Flame reddens to accentuate the "burning" problems in its charting. It shows what is choking the event loop in Node and could be its own separate process, like prettyJSONstringify.

No comments:

Post a Comment