Debugging IPFS

# Debugging IPFS ## Profiling * pprof * goroutines * cpu * heap profile for memory use * mutex profile - goroutines waiting on mutexes * allocations * first step is goroutines * thousands in dht * whyrusleeping/stackparse * -summary * go tool pprof :5001 /debug/pprof * go tool pprof --symbolize=remote - pulls down debugging symbols * run a few profiles and look through * go pprof/bin/collect-profiles.sh * run mutux profile through collect-profiles.sh ## Debuggers/replay * delve - gdb debugger for go * gdelve * plugin for gdb * rr * good for hard to capture scenarios * not general purpose * huge output, hundreds of gigs ## Mutexes * go-deadlock ## Logging * lots of new logging added * dht, bitswap ## ipfs diag * still useful? correct? * unclear ## /debug/metrics/prometheus * perf metrics * network metrics * send to grafana ## metrics * opencensus * opentracing * being phased in * only collecting for nodes we run ## network status * ipfs swarm peers -v * --enc=json | jq ## tracing ??? ## Content availability * https://pl-diagnose.on.fleek.co/#/diagnose/access-content?backend=https%3A%2F%2Fpl-diagnose.onrender.com ## Adin on perf debugging - Feb 27 2023 > There's a variety of reasons for potential performance issues. My general guideline goes like this: >* Is the amount of time to get any data at all large (i.e. >5s), if so probably no one is advertising the data >* Is the throughput really slow, might be that the people who have the data aren't sending it to you fast enough >* Does the data come quickly if you use a local kubo node (after you've run ipfs repo gc to make sure you don't have the data cached locally), probably an issue with the gateway you're using > I was able to load it pretty quickly with my local kubo gateway and now see it on that link you posted. Although I think whoever was advertising the data might've just done it because when I looked for providers (https://pl-diagnose.on.fleek.co/ and https://check.ipfs.network are what I'd use for this if not using CLI tools) > If this is for newly generated data you might want to evaluate how long it takes for your nodes (estuary or your own) to advertise the data you create and if that's taking a longer time than works for your use case ## Jorropo & Iand addendum - Feb 28 2023 > the only provider in the dht is offline. (this CID does not load) is extremely rarely usefull, in almost all cases it's not in the DHT, and users gets tricked by bitswap broadcasting (because it sometimes work when it should not). The checklist for CID does not load should be something like this: Check ipfs routing findprovs bafybeigfheaau27rao4lkchipu5zoq4liyhkbzrc2fvhyf6soj2262peou (the thing you want to see is some hosts that announce in a routing system) Then ipfs id peerids you find (because they might be offline nodes). (the thing you want to see is hosts that are online) If you found at least one host and they are online, then it's intresting to report. The fact that it work or not on a gateway, or that some different nodes load it or not is completely irrelevant to weather or not it will work in the future because you can't rely on a broken dissfunctional randomwalk implementation (bitswap never meant to do random walk, and thus it does not walk anything, it just random). > Also it may have been served once via a gateway node and the blocks will then be cached in that single node's blockstore. You might be lucky and hit the same one again and get an instant response or you might have to wait for bitswap discovery between peered gateway nodes. Then the node will GC and wipe it all anyway. > `ipfs routing` is dht + indexers > `ipfs dht` is dht > check each peer id is online > or at least a peer > yes with `ipfs id` or `ipfs ping` > `ipfs routing findpeer Qmfoo | while read peerid; do ipfs ping "${peerid}"& done; wait` > > will do this in parallel > check.ipfs.network or https://pl-diagnose.on.fleek.co/? They're a bit out of date (don't users indexers in the way kubo does now and don't separately report on the indexers), but generally get the point across. Basically if the problem is they're not quite usable enough because of X we need to invest in dealing with X :slightly_smiling_face:. Related note about gateways: With the Rhea endeavor a bunch of the things that make data show up when our diagnostic tools tell you you're going to have a problem are going away. Now that we get custom code to run (i.e. https://github.com/ipfs/bifrost-gateway) I'd also love some custom/better error messages to help users better understand what's happening (if/when this rolls out devrel feedback would be appreciated) Similarly, I think it's probably about time we have a website like gateway.ipfs.io. There is nowhere for people to learn about the gateway as a service (e.g. like https://cloudflare-ipfs.com/), file bugs against gateway code, figure out where to send abuse repots, report service disruptions, etc.). The status quo of reporting issues ad-hoc on Matrix/Slack or filing an issue on kubo seems particularly bad once Rhea changes the backing code for the gateway to not be kubo and instead be another service (Saturn).