Documentation: https://troubleshoot.sh
Project repo: https://github.com/replicatedhq/troubleshoot
Roadmap: https://github.com/orgs/replicatedhq/projects/4/views/1
Subscribe to the Replicated Community Calendar for the latest updates on community meetings for Replicated Open Source projects
Join #app-troubleshoot on Kubernetes slack:
Intent: This meeting is an opportunity for folks involved in the Troubleshoot project to map out the future, roadmap, and discuss ideas and features that are important to them.
Regular Agenda:
- Current state of the project
- Gaps and needs
- Prioritization
- meeting chairs
- open floor
Add items to the agenda in the following format:
$git_username
: general outline of item for discussion
- enter any details in a sub-bullet
- add as much detail as you can
$git_username
: item #2
- Current state of the project
- Gaps and needs
- Prioritization
- meeting chairs
- open floor
- Current state of the project
- Gaps and needs
- Prioritization
- meeting chairs
- open floor
https://replicated.zoom.us/j/88452766694?pwd=VFRXYmhxNHpzRGY1RkFTdythR0xpZz09
- Current state of the project
- Gaps and needs
- @cwyl02: Run hosts collector command
- @chris-sanders: thinking about automatic bundle uploading
- Prioritization
- meeting chairs
- open floor
https://replicated.zoom.us/j/88452766694?pwd=VFRXYmhxNHpzRGY1RkFTdythR0xpZz09
- Current state of the project
- Gaps and needs
- Prioritization
- meeting chairs
- open floor
https://replicated.zoom.us/j/88452766694?pwd=VFRXYmhxNHpzRGY1RkFTdythR0xpZz09
- Current state of the project
- Stable API and Consolildated CLI in review
- Gaps and needs
- Prioritization
- meeting chairs
- open floor
- @danj-replicated: Uploading support bundles and preflights implementation requires review. Current implementation is fragile and also differs between preflight and support-bundle workflows.
- Current state of the project
- Enhancements to work better with directly with helm
- Exit code added so we can use it as CLI tool in a script and detect the outcome
- Added stdin support to preflight that can handle a stream of output and just find the relevant manifests for itsself
- Gaps and needs
- Combining support bundle and preflight into a single concept
- Make the schema for both the same
- Preflights for checking required versions
- Maybe troubleshoot can already do this and just need an example of necessary templating
- Run all collectors in cluster
- Analyzers are too hard to write
- Writing new ones too hard
- Utilizing existing ones too hard
- More examples.. finding the files to analyzing
- Good example in slackernews, how would anyone not intimately fimiliar with the internals know how to analyze the slack api return codes?
- Would be nice if sbctl handled older bundles better
- troubleshoot.io/ labels switch to troubleshoot.sh
- Prioritization
- meeting chairs
- open floor
- @drohnow
- Search support bundle for errors automatically
- @z4ce: Maybe use generative AI.. ChatGPT plugin
- Current state of the project
- Gaps and needs
- @mhrabovcin: Improvements cluster resources collector
- Try to find a more generic way of collecting all resources with an option of having a "filtering" mechanism where we might want to ignore certain resources
- @banjoh: "kubectl cluster-info dump" does this already. We'd want to explore this
- Prioritization
- meeting chairs
- open floor
- @mhrabovcin: Troubleshoot live project
- https://github.com/mhrabovcin/troubleshoot-live
- @danj-replicated: We'd need to test this out to see how it works. sbctl has some limitations that troubleshoot-live solves such as surfacing CRDs
- Launches KAS and ETCD and creates all the resources via k8s API
- Current state of the project
- Gaps and needs
- anyone have issues that aren't getting the priority they need?
- Dan: support-bundle hanging - need more info, can't get debugging from the environment we are seeing this. Needs more investigation. It's possible this is when containers to run the collector go to ImagePullBackoff
- anyone have new issues/requests that aren't yet documented?
- @York Chen: (D2iq) have forked Troubleshoot, want to keep up to date
- Prioritization
- meeting chairs
- open floor
- Docs site search isn't intuitive
- no CLI docs, there are docs in teh Troubleshoot repo but not in troubleshoot.sh
- maybe add a PR to Github actions that updates the docs when the markdown in repo changes.
Regular Agenda:
- Current state of the project
- sbctl and interest in getting more 'mock projects' to run on sbctl
- Gaps and needs
- Improvements to analyzers, they aren't getting a ton of use
- How can we make finding fields in returned objects better and more approchable?
- Even if objects are collected by different collectors
- Writting analyzers in go for full logic control is cumbersome
- Can we do things like allow us to use shell commands?
- Maybe just run external tooling/scripts?
- This does pull in a lot of host dependencies and failure points
- Maybe instead we make it much easier to creating a boiler plate
- If we need 'jq' style selections is that what people are actually trying to get at anyway?
- Metrics collection
- Prometeheus, metrics-server, Loki, etc collection?
- Can we capture data on things like cpu/memory over time?
- Alternately collect Alert Manager or Prometheus alert information?
- Ex: CPU starvation and OOM kills
- Loki or other out-of-cluster items would likely need user arguments
- Collectors
- HostCollector to check metadata for cloud providers
- Ex: Getting IMDS which is at least simliar on a lot of cloud providers even OpenStack
- SDK/CLI consolidation
- Projects importaing troubleshoot don't have a clearly defined stable api today
- The CLI's aren't consistent in usage
- Preflight and support-bundle may not have and difference between them anymore
- Consider building a troupleshoot analyse|collect|server|etc which would also become the reference for the SDK for how to import and use Troubleshoot
- Helm pre-flights / support bundles?
- Prioritization
- Getting sbctl and the CLI/SDK implemented feel like things we should consider as high priority.
- These can both enable other projects and increase the user base which will in the end expedite good analyzer creation when we do polish it. More feedback from users is good.
- meeting chairs
- open floor
Regular Agenda:
- Current state of the project
- Gaps and needs
- Prioritization
- meeting chairs
- open floor
-
@mhrabovcin: Kubernetes API version compatibility guarantees?
- interest in sbctl, using support bundle as a data layer for other applications
- would be nice to have support matrix so we know what APIs are supported in the bundle
- would like to run community inspection tools against the bundle snapshot
- @banjoh:
-
Running external analysers against a collection of data in a bundle
- @mhrabovcin: k9s as an inspiration for crafting plugins
- @mhrabovcin: Another inspiration is Github actions API where one can define a set of variables that actions use to write output to
-
Collector binary that receives a location and a set of other parameters to facilitate collecting data and storing for TS to package in a support bundle
- Spec with user defined variables passed in as-is to the collector
- Default values injected by troubleshoot e.g k8s identity, path to store data etc
- Use case: Use existing tools that already collect diagnostic data rather than trying to fit the implementation to troubleshoot's model
-
@danj-replicated: Remote hosted community specs to make it easier to discover e.g helm-like workflow where you can add a repo then list/reference specs by name
-
@banjoh: Add ability to define collectors and/or analysers as presets
Regular Agenda:
- Current state of the project
- Gaps and needs
- Prioritization
- meeting chairs
- volunteer to lead next months' community meeting
- open floor
- API stability matters
- @danj-replicated: When using troubleshoot as a library, parsing specs is not a very straight forward experience
- @danj-replicated: Reading files from support bundles not very intuitive. This would affect anyone who wants to discover what goes where. A suggestion was to have collectors be the source of truth of where they store their files.
- Current state of the project
- Gaps and needs
- Prioritization
- meeting chairs
- volunteer to lead next months' community meeting (US hours)
- Next one most of the folks in Replicated are away on an off site
- open floor
- @xavpaice: Google Summer of Code - do we want to register as a project?
- Where to run collectors: in cluster (i.e. in a pod) can give different results than in host. If running from the CLI, we can use the runPod to collect info from inside the cluster but other collectors might want to get that same functionality, however, if we're running Troubleshoot from inside a cluster/pod we might not want to spin up a new pod for that.
- Discussion about Helm, using for preflights and potentially vendoring in Troubleshoot. Current discussion is around calling other programs as a step during the install. Troubleshoot is controlled by the chart author rather than Helm.
- ability to use Priv containers would allow some of the host collector functionality to run from inside the cluster
- Conclusions:
- other projects are interested in importing Troubleshoot, having a stable API would help that immensely
-
Current state of the project
- Gaps and needs
- etcd and apiserver checks
- allow optional collectors in the spec?
- Prioritization
- meeting chairs
- volunteer to lead next months' community meeting
- open floor
- Current state of the project
- Gaps and needs
- @adamancini: version info, "upgrade" subcommand
- can't get version from
sbctl
today, but nice to know if you can upgrade
- it's not on brew/pkg managers, so "get the latest release tarball" invovles some clicking into github. we don't have a "latest" shortcut
- Additionally collect this into the bundle so you can tell what version it was collected from
support-bundle --debug ./bundle.tar.gz
should maybe tell me what version of support-bundle
was used to generate the bundle, since some of the bundle features are in the tarball itself, like logs support
- @adamancini: one CLI to rule them all?
- Compound conditions
- Should we try to mirror how K8s does this?
- json compare: only has equality, would be valuable to add others
- We should consider how to use other libraries to enable a range of comparisons in all analyzers
- Prioritization
- Increasing adoption of new features in other projects (e.g. KOTS)
- Enabling analyzers to operate on kubectl API and not file paths
- multiple collector analyzers
- decouple analyzers from collectors
- Dependent on getting sbctl into the project
- meeting chairs
- volunteer to lead next months' community meeting
- open floor
- Current state of the project
- @xavpaice: Good momentum
- @xavpaice: Roadmap for the next three months to match goals of project
- Gaps and needs
- @xavpaice: Need to support multiple preflight specifications
- @z4ce: have
support-bundle
accept type: preflight
specs
- alternately, consolidate preflight and support-bundle, so that they're the same thing but run at different times
- having a support bundle generated by preflights would be really useful
- Prioritization
- log collectors - limit collection by size, as well as lines/age. TODO: check the size of the task
- sbctl integration to the Troubleshoot repo (spec doc in progress)
- meeting chairs
- volunteer to lead next months' community meeting
- open floor
- Helm plugin for running preflights etc., it's been evaluated prior but determined to not be ideal (not sure of reasons)
Join the meeting on Zoom using this link
- Current state of the project
- hunting tar for the pod logs is difficult
- changing location that the logs collector stores logs is in progress, but difficulties with symlinks
- identifying file names especially when trying to write an analzyer or pre-flight
- This could help: https://github.com/replicatedhq/troubleshoot/pull/780
- would be nice to run sbctl in an analyzer
- There has been a discussion about moving sbctl into the project, and with that possibly using it as a way to write analyzers against the K8s API
- This might also benefit from adding proper entrypoint and subcommands to the project at the same time, enable subcomands more like kubectl w/ verbs and nouns
- Gaps and needs
- Prioritization
- log file access via sbctl (This has been in-flight with simlinks but has gotten harder than expected)
- API based access to Kubernetes data rather than file scraping (ex: sbctl for analzyers and users in-project)
- meeting chairs
- volunteer to lead next months' community meeting
- open floor
- Ian: What can/should we be considering to increase visiblity for the community and get more members involved?
- Ada: More tutorials and blogs for periodic releases and presentations would help introduce people
- Ian: We should review and reach out to projects using Troubleshoot (D2IQ, EKS Anywhere, etc)
- @xavpaice: IP address redaction https://github.com/replicatedhq/troubleshoot/issues/735
- Options: replace with tokens, redact by default and have an option to disable, redact only when instructed to and not by default
- Either way this was a feature in a previous Troubleshoot and would be useful regardless of default decision
- consensus: discuss with KOTS about a deprecation period, best is to default to not redact IPs
- @xavpaice: any thoughts on reducing duplication of code (collection in particular), making a stable API, and general simplification?
- @chris-sanders: How/does this inform the idea of moving sbctl into the project propper.
Join the meeting on Zoom using this link
Attendees: Martin Hrabovicin, Evans Mungai, Dan Jones, Edgar Lanting, Xav Paice
-
Current state of the project
-
Gaps and needs
- mailing list for the project? (TODO: Xav)
- IRC channel (TODO: Xav)
- https://github.com/mesosphere/troubleshoot fork
- idea to add a generic runtime arg option to specs, which could work in a similar way to CLI options
- add a dry-run option
- have some options to change default behavior in the spec (e.g. which default redactors run)
- plugable collectors, a means to run a custom collector that's not upstream (see Velero for example)
-
Prioritization
- IP addres redaction change is a quick win
- concurrency of collectors
- design and understanding for pluggable collectors/analyzers/redactors
-
meeting chairs
- volunteer to lead next months' community meeting
-
open floor
-
@xavpaice: Speed of collection
-
@xavpaice: Efficiency of redaction
- Redaction is adding about 16 seconds when you are limited to 1 cpu.
-
@xavpaice: Stable API for projects to import
- Projects (kots, kURL, EKS Anywhere) are importing parts of Troubleshoot and running them. We should consider a stable API so we do not make breaking changes that affect those projects.
-
@xavpaice : Sbctl - do we replace this with a ‘real’ k8s API?
Join the meeting on Zoom using this link
- item 1
- item 2
- next chair
- open floor
Join the meeting on Zoom using this link
- v0.32.0 release
@OGtrilliams
KubeCon EU updates
- KubeCon webinar will be livestreamed on Replicated YouTube
- meeting chairs
- volunteer to lead next months' community meeting
@divolgin
to chair next month's meeting (TENTATIVE)
- open floor
Join the meeting on Zoom using this link
Join the meeting on Zoom using this link
- sbctl overview with @divolgin
- open floor
Interview with Chris Sanders
Zoom link
- Open issues
- LIVE bug-bash w/
@OGtrilliams
- open floor
Join the meeting using the following Zoom link: https://replicated.zoom.us/j/84125433779?pwd=ZHAwUFFid2thdzM2Rzdxek05cG1udz09 (ID: 84125433779, passcode: 6An1Rpp9)
Join the meeting using the following Zoom link: https://replicated.zoom.us/j/84125433779?pwd=ZHAwUFFid2thdzM2Rzdxek05cG1udz09
@programmerq
- ability to specify namespace and/or selectors at runtime (templated support bundle definition?) to accomodate a bundle for a specific given instance of an application. For cases where there may be multiple instances of the application that vary by namespace, deployment names, labels, etc… https://github.com/replicatedhq/troubleshoot/issues/481
@programmerq
- ability to determine storageclass capabilities in analyzers. conditions based on provisioner, allowVolumeExpansion, or anything else that may come up. https://github.com/replicatedhq/troubleshoot/issues/482
@ogtrilliams
- open floor
Join the meeting using the following Zoom link: https://replicated.zoom.us/j/81568123981?pwd=a0lFSXpoVXA4bkJVamVyUTdNdFZodz09
Join the meeting using the following Zoom link: https://replicated.zoom.us/j/89062276386?pwd=dHhMMmpBRWUyYzhOZDh5cEFLRFRsQT09
@murphybytes
: - Discuss Remote Host Collector feature by @croomes
- open floor?
- placeholder (delete me)
Action items
John Murphy (@murphybytes) has volunteered to work with @crooms to refine PR #392
Join the meeting with the following Zoom link: https://replicated.zoom.us/j/89062276386?pwd=dHhMMmpBRWUyYzhOZDh5cEFLRFRsQT09
-
@dexhorthy
- Aggregating awesome SupportBundle and Preflight specs from the wild
-
@divolgin
- better process for reviewing and merging community contributions
- e.g. https://github.com/replicatedhq/troubleshoot/pull/392 by
@croomes
- currently being worked on by John Murphy
ogtrilliams
will work w/ John Murphy to update community guidelines & investigate CI/CD platforms
- dedicated reviewers list?
- develop pre-vetting process
ogtrilliams
create process where potential contributors write out issue template with outline on proposed contribution that'll be sent to reviewer board. once approved, PR can be submitted.
-
@divolgin
- Things that make support bundle hard to use
- Analyzers are hard to troubleshoot when they don't work.
- File names produced by collectors are hard to figure out when result is used with analyzers.
- Collectors may never complete and there is no global timeout.
-
@emosbaugh
- this is such a large change to host preflights. is this a direction we want to take them? should this be a "regular" preflight?
-
@marccampbell
- Replace the CLA with a DCO?
- will be implemented in ~1 week's time
-
Open floor
action items
- Contributing guide will be started by John Murphy
- look into creating public version of design plans for troubleshoot.sh