Building analytic NLP workflows with CI/CD, containers, & python on cloud.gov

Building analytic NLP workflows with CI/CD, containers, & python on cloud.gov

I began this journey with familiar footing, providing reliable methods to enable data scientist to deliver insights on institutional data. The core fundamentals are not much different from delivering any other software product. There is a defined sandbox of system constraints on which shape the design, building, and operation of the overall system. We start with the fundamentals and build out with layers. To begin with we’re making sure the basics are followed like getting code and decisions documented in version control, making sure you and those are also operating with the same primitives to iterate and deliver value. From there we continue to layer on with understanding the prompt deliver against to explore is relatively simple; Explore how natural language processing can be used to evaluate data and identify duplicate values. Our data scientist whipped up a solution in a day or two and we’re done right? Not really, as underneath the initial experiment is the question of how to build a clear process or framework for the team to continuously deliver and compound our insights capabilities on the existing product.

Most of this work is net new, so there is some flexibility in how to pursue a solution. When I’m working on with greenfield technology products, I generally choose tooling that can be replicated locally and in the cloud. The current paradigm that I’m fond of is to pick a comfortable software language, bundle that up into a container image, apply some sort of version tags to that image, orchestrate getting that image to some sort of compute environment, and you’re good to go. From there, Developers can use the container image build locally, know what version is running in their various environments. From here, I explore the thought process and constraints in greater detail.

Check out the demo source code

The Constraints

Just to reiterate what the to work through Build a greenfield mechanism to use NLP for processing entries with python scripts on cloud.gov infrastructure & the CI/CD tooling uses CircleCI.

Infrastructure

Cloud.gov

Let’s look into cloud.gov, why would teams build on this platform vs AWS gov cloud? It has two major benefits from the start yes first they handle all the infrastructure management, second they’re FedRAMP authorized. Teams inherently need less resources to build, deliver and manage the application lifecycle

Compliant from the start

Cloud.gov offers a fast way for federal agencies to host and update websites, APIs, and other applications. Employees and contractors can focus on developing mission-critical applications, leaving server infrastructure management to us.

FedRAMP authorized

Cloud.gov has a FedRAMP Joint Authorization Board (JAB) authorization, which means it complies with federal security requirements. When you build a system on cloud.gov, you leverage this compliance and reduce the amount of work you need to do.

Cloud.gov operates as a Platform as a Service (PaaS) abstraction on top of AWS, it enables developers to deliver applications efficiently without needing to worry about too much finagling with infrastructure. The good news is you don’t have to worry about most of the things SREs or Ops people build careers on worrying about such VPCs, subnets, ingress, egress, route table, firewalls, DNS, ec2, or ECS, EKS, access controls or whatnots. Cloud.gov of managed this by building their platform using Cloud Foundry.

Cloud Foundry

Cloud Foundry has a CLI tool that enables developers to provision resources within an organization. There is also a Terraform provider that extends this functionality to have declarative Infrastructure as Code (IaC). A developer can easily add these to their CI/CD pipeline to automate the building, deploy, and maintenance processes of an application lifecycle hosted in cloud.gov. Operating with a proxy between the developer cloud.gov & AWS there are some quirks to navigate.

Limitations

Some limitations with operating within a PaaS you kind of get what you get. For example, I built out a demo app using the SAP BTP trial platform which also happens to operate on fundamentals provided by Cloud Foundry. SAP provides a free 90-day trial account, which was a great way to test some ideas. (By the way, incredible! No limits except for quotas, just build and use.)

There is usually a vendor provided library/marketplace of available service brokers for use. For the SAP Trials account, there were 46 available for use, with the vast majority tailored as integrations into the SAP platform. For my needs, I use a more general purpose broker for Postgres there. With a couple of button clicks or a string of commands to the CLI, a fully managed Postgres database is deployed. An application can “bind” to a service to enable direct connectivity.

With a deployed application, logs are proxie’d through the service platform. This can be plagued with vague errors or terminology, but for the most part a developer can fetch the necessary logs for the deployed application, for service brokers not so much. Building any application with complex dependencies on service brokers could put you in a tricky spot trying to debug any problems not directly caused by the deployed application.

Cloud Foundry runs on the linux/amd64 x86 architecture. Building across platforms can be a cause for some headaches. I run a M1 Mac that uses darwin/arm64 which has produced numerous conflicts on compatibility when building things. For the past year or so I’ve run into some hiccup to architecture mismatches on some dependency, package, causing an unnecessary detour. The main thing that gets jammed up is that Cloud Foundry (the underlying open source project behind cloud.gov) uses buildpacks  in order to deploy applications.

I actually find builders and builders to be fascinating, I believe they will be the way forward, eliminating any need to meticulously craft a multi-stage container images. Which is exactly how I first approached tackling this demo.

Open Container Initiative (OCI)

Open Container Initiative is a standard that was set in place by CNCF to ensure all container images follow the same protocol, this way, they can be used everywhere. I’m a big fan of using containers to encapsulate code and run in some sort of compute environment (pick your preference). After getting the NLP python script from the resident data scientist, my first step would be to bundle the code into a container. This is usually my first step when jumping into this greenfield initiative.

When it comes to building a container image the most familiar pattern is to build a Dockerfile there’s a ton of documentation out there, but in general use multi-stage builds to shrink the final product that will run. Also, if you can eliminate any means for shell access, do that too. Specific to building a python app check out pythonspeed.com the guy knows a thing or two about containers and python and he crushes on guidance for building a custom python container image. I’ve somewhat arrived at my own personal preference for building images, largely influenced by his guidance.
It boils down to using a base that updates and installs core dependencies for building the app, build the app, and then copy all the goodies to a final build image that is locked down from shell/ root access.

It generally looks something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from python:latest as build 
run apt-get install so many things
run pip install -u pip wheel
#create a nice foundation like install wheels upgrade pip

from build as package_installer
run pip insatll all the things & delete all the cruft, maybe use a venv just so you know what you actually need 

from distroless:python as final 
copy --from build /venv /opt/venv
entrypoint [python app.py]

[Inspect your container with Dive][dive]

Buildpacks

This article provides a nice distinction between a Dockerfile and Buildpacks

TLDR; buildpacks provide an opinionated abstraction to build a container from sourcecode.

Instead of needing to meticulously handcraft the masterpiece that is your thousand line Dockerfile. You can pick a container builder like [paketo][paketo] leverage their python build pack to produce an equivalent image, what you were expecting with using distroless. Paketo advertises “Just bring your app and Paketo Buildpacks will detect what language your app is using, gather the required dependencies, and build it into an image.” There’s always some edge cases where something falls through the cracks (for me, it was trying to build PyTorch with CPU on amd64 architecture)

Cloud.gov also maintains and provided buildpacks, I can’t really speak towards the effiencey of provided buildpacks, but anything managed services or tooling we can leverage, reduces the complexity and overhead needed for long term support or maintenance.

CI/CD

CircleCI
  • Connecting repo
  • env secrets
  • git bot account
  • sbom
  • artifacts
  • publish to cloudfoundry

First contact

My initial thought process was crank out a container image, use Cloud Foundry run-task command (which support docker images) and Bob’s your uncle right, clap our hands and were done. Of course, there would also need to be Static Code Analysis and container image scanning and potentially container runtime analysis to help alleviate the most concerns presented by any auditors. This could be paired with a Software Bill of Materials (SBOM) that generated on demand or with releases. Containers and Code can easily be version controlled with applying tags to your favorite container registry. This didn’t necessarily need to be a robust service (the current process handled entirely on a laptop) so I figured that the CI/CD or the native CRUD app could run the job on-demand or some cron. But of course there’s that saying no battle plan survives first contact…

After building the NLP container image, I mocked up an app to generate and populate fake data into Postgres database. This randomly generated a string, inserted into a table with a unique ID, and associated a unique user to that entry. Doing this allowed me to evaluate and see the NLP in action, identifying any duplicate records that that were created. Sweet! This design also more or less worked on the first or second try. Next, I tried to tie that into with CircleCI and deploy to Cloud Foundry. I immediately encountered disk quota issues. The container image being build by CircleCI was ~10G, at a loss as to why I dove in to see what the discrepency is. Of course CPU architecture is to blame Arm64 vs amd64, my local container image wound up somewhere between ~2G & 700Mb depending on how aggressive I got on cleaning up resources. Inspecting the image layers with [dive][dive] I was able to point fingers at biggest culprit for the added extra resources (Nvidia cuda for GPU processing) that are installed for amd64 versions of PyTorch. I now need to update the custom image to source the CPU version of PyTorch. Not an inherently difficult obstacle to overcome, patch my Dockerfile and carry on. But for some reason I continued to encounter various issues when deploying the docker image and determined I needed to understand in more depth as to what cloud.gov is doing with that docker image. It turns out it rebuilds the image with buildpacks.

Enter buildpack and a what is not a proverbial frustration and fascination with them.

When you deploy a docker image(OCI compliant) to cloud.gov it does not in fact deploy a docker image

Runtime differences

Pushing an application using a Docker image creates the same type of container in the same runtime as using a buildpack does. When you supply a Docker image for your application, Cloud Foundry

  1. fetches the Docker image
  2. uses the image layers to construct a base filesystem
  3. uses the image metadata to determine the command to run, environment vars, user id, and port to expose (if any)
  4. creates an app specification based on the steps above
  5. passes the app specification on to diego (the multi-host container management system) to be run as a linux container.

No Docker components are involved in this process - your applications are run under the garden-runc runtime (versus containerd in Docker). Both garden-runc and containerd are layers built on top of the Open Container Initiative’s runc package. They have significant overlap in the types of problems they solve and in many of the ways they try to solve them. For example, both garden-runc and containerd:

  • use cgroups to limit resource usage
  • use process namespaces to isolate processes
  • combine image layers into a single root filesystem
  • use user namespaces to prevent users with escalated privileges in containers from gaining escalated privileges on hosts (this is an available option on containerd and is a default on garden-runc)

Additionally, since containers are running in Cloud Foundry, most or all of the other components of the Docker ecosystem are are replaced with Cloud Foundry components, such as service discovery, process monitoring, virtual networking, routing, volumes, etc. This means most Docker-specific guidance, checklists, etc., will not be directly applicable for applications within Cloud Foundry, regardless of whether they’re pushed as Docker images or buildpack applications.

Using the buldkit

The last statement in the documentation is what really caught me off guard. This creates the common scenario of works on my machine, logic that many teams can get into with drift between devices. I felt that this would break the any idempotency of delivering bundled in a container image and there was additional overhead of monitoring and validating container images, that is not present for the existing app that is using buildpacks. I began thinking it might be easier to deliver a web app with the cloud.gov build pack as thats what its designed to provide. Not having worked with buildpack I looked into what was needed. turns out not much. Really just need the code, a requirements.txt file and and entrypoint to the application (Procfile). I manually deployed this version of the application with the cloud.gov cli and ran into the same issues as before.

This is where I went down a couple rabbit holes until settling on a solution. The first was evaluating if I could correctly install PyTorch with poetry(thats a no) vs using the more common package manager pip. It turn out the way PyTorch manages their packages is just at odds. Poetry struggles with the nested explicit version that exclude the GPU packages and pip does the same with there being mismatched dependencies beween what gets installed and what needs to be used.

PyTorch vs spaCy

This led me to look for alternatives to the PyTorch sentience-transformer. I came across spaCY which is a lightweight language model that can replace the functionality of what was originally being used in the NLP model. Once that was swapped I was able to successfully deploy the model well under the 2 GB container image I was building.

Building a flask demo

I opted to show mock out a demo for this analysis so the team can evaluate usage. I began with a lightweight flask app that was easy to get started. I connected to the database and generated data. From the mock data I was able to query a column and output all the duplicates. Slapping on a super simple UI to add some templates, buttons, and some frontend javascript and of course set the background as NASA’s picture of the day. Run it from local and now there is a working demo of the data scientist NLP model assessing duplicates. How can this be useful beyond a demo? Make it an API

flask app

lightweight simple to work locally

db connection

create a reusable component to manage db connection

sample data

control the outcomes

slap on simple UI

add some templates, buttons, and some frontend javascript nasa picture of the day

working demo

gunicorn delivers

Refactoring flask demo to fastAPI

Drop the flask app and insert fastAPI, this required changing all the resposes to json. But you get the added benefit of getting a self documenting api page( same as what is seen in swagger/openapi).

add auth mechanism

the hell that is a frontend auth token broker

oauth2_scheme = OAuth2PasswordBearer(tokenUrl=“token”) HTTPBasic() encryption csrf_token frontend login vs api barer token

tagged based routes(or at least that how I think it works)

seperation of auth routes from common routes. segmentation just sounded right in my brain

structure things as modules

should have started with tests, but now we’ve got em

middleware

httpsredirect csrf_token

Database things

Embracing first principles of agile work doesn’t need to be perfect, but get it out. Deliver the MVP first before having conversations about performance durability, and whatnot. Adding any greenfield services to existing products, needs to be eased into. There’s no need to rush, optimizing for a cashing queue until after first contact with the users.

Going to Prod

Taking an application to production requires a lot of careful planning and testing. The application needs to be stable, secure, and able to handle the expected load. Here are top level considerations you should at least run through and think about:

  • Code Quality
  • Performance
  • Security
  • Scalability
  • Monitoring & Logging
  • Disaster Recovery & Backup Strategy

Remember that deploying to production is not the end of the development process but rather a new phase where you’ll need to monitor the system closely, gather user feedback, fix bugs, and continuously improve based on user needs and business goals.

Getting a demo [cloud.gov buildpacks]: https://cloud.gov/docs/getting-started/concepts/#buildpacks [cloudfoundry-community-buildpacks]: https://github.com/cloudfoundry-community/cf-docs-contrib/wiki/Buildpacks [dive]: https://github.com/wagoodman/dive [paketo]: https://paketo.io/

#blog #nlp #cloud_gov