Bring extensions under the umbrella.

Bring extensions under the umbrella.
This commit is contained in:
Sean E. Russell 2021-03-02 03:16:40 -06:00
parent 94d5c2e33d
commit a44ced4bba
6 changed files with 531 additions and 15 deletions

View File

@ -35,23 +35,14 @@ If you install gotop by hand, or you download or create new layouts or colorsche
``` ```
- **OSX**: gotop is in *homebrew-core*. `brew install gotop`. Make sure to uninstall and untap any previous installations or taps. - **OSX**: gotop is in *homebrew-core*. `brew install gotop`. Make sure to uninstall and untap any previous installations or taps.
- **Prebuilt binaries**: Binaries for most systems can be downloaded from [the github releases page](https://github.com/xxxserxxx/gotop/releases). RPM and DEB packages are also provided. - **Prebuilt binaries**: Binaries for most systems can be downloaded from [the github releases page](https://github.com/xxxserxxx/gotop/releases). RPM and DEB packages are also provided.
- **Prebuild binaries with extensions**: - **Source**: gotop requires Go >= 1.14: `go get -u github.com/xxxserxxx/gotop/cmd/gotop`
- [NVidia GPU support](https://github.com/xxxserxxx/gotop-nvidia/releases)
- [Remote gotop support](https://github.com/xxxserxxx/gotop-remote/releases)
- **Source**: This requires Go >= 1.14. `go get -u github.com/xxxserxxx/gotop/cmd/gotop`
### Extension builds ### Extension builds
An evolving mechanism in gotop are extensions. This is designed to allow gotop to support feature sets that are not universally needed without blowing up the application for average users with unused features. Examples are support for specific hardware sets like video cards, or things that are just obviously not a core objective of the application, like remote server monitoring. Extensions have proven problematic; go plugins are not usable in real-world cases, and the solution I had running for a while was hacky, at best. Consequently, extensions have been moved into the main code base for now.
The path to these extensions is a tool called [gotop-builder](https://github.com/xxxserxxx/gotop-builder). It is easy to use and depends only on having Go installed. You can read more about it on the project page, where you can also find binaries for Linux that have *all* extensions built in. If you want less than an all-inclusive build, or one for a different OS/architecture, you can use gotop-builder itself to create your own. - nvidia support: requires the `enable` flag. Detecting nvidia hardware, or rather, the absense of NVidia hardware, can take seconds; this greatly slows down gotop's start-up time. To avoid this, the NVidia code will not be run unless it has been enabled with the `--enable nvidia` runtime flag.
- remote: allows gotop to pull sensor data from applications exporting Prometheus metrics, including remote gotop instances themselves.
There are currently two extensions:
- Support for [NVidia GPUs](https://github.com/xxxserxxx/gotop-nvidia), which add GPU usage, memory, and temperature data to the respective widgets
- Support for [remote devices](https://github.com/xxxserxxx/gotop-remote), which allows running gotop on a remote machine and seeing the sensors from that as if they were local sensors.
There are builds for those binaries for Linux in each of the repositories.
### Console Users ### Console Users

Binary file not shown.

After

Width:  |  Height:  |  Size: 182 KiB

184
devices/nvidia.go Normal file
View File

@ -0,0 +1,184 @@
package devices
import (
"bytes"
"encoding/csv"
"errors"
"fmt"
"os/exec"
"strconv"
"sync"
"time"
"github.com/xxxserxxx/opflag"
)
// Set up variables and register this plug-in with the main code.
// The functions Register*(f) tell gotop which of these plugin functions to
// call to update data; the RegisterStartup() function sets the function
// that gotop will call when everything else has been done and the plugin
// should start collecting data.
//
// In this plugin, one call to the nvidia program returns *all* the data
// we're looking for, but gotop will call each update function during each
// cycle. This means that the nvidia program would be called 3 (or more)
// times per update, which isn't very efficient. Therefore, we make this
// code more complex to run a job in the background that runs the nvidia
// tool periodically and puts the results into hashes; the update functions
// then just sync data from those hashes into the return data.
func init() {
opflag.BoolVarP(&nvidia, "nvidia", "", false, "Enable NVidia GPU support")
RegisterStartup(startNVidia)
}
// updateNvidiaTemp copies data from the local _temps cache into the passed-in
// return-value map. It is called once per cycle by gotop.
func updateNvidiaTemp(temps map[string]int) map[string]error {
nvidiaLock.Lock()
defer nvidiaLock.Unlock()
for k, v := range _temps {
temps[k] = v
}
return _errors
}
// updateNvidiaMem copies data from the local _mems cache into the passed-in
// return-value map. It is called once per cycle by gotop.
func updateNvidiaMem(mems map[string]MemoryInfo) map[string]error {
nvidiaLock.Lock()
defer nvidiaLock.Unlock()
for k, v := range _mems {
mems[k] = v
}
return _errors
}
// updateNvidiaUsage copies data from the local _cpus cache into the passed-in
// return-value map. It is called once per cycle by gotop.
func updateNvidiaUsage(cpus map[string]int, _ bool) map[string]error {
nvidiaLock.Lock()
defer nvidiaLock.Unlock()
for k, v := range _cpus {
cpus[k] = v
}
return _errors
}
// startNVidia is called once by gotop, and forks a thread to call the nvidia
// tool periodically and update the cached cpu, memory, and temperature
// values that are used by the update*() functions to return data to gotop.
//
// The vars argument contains command-line arguments to allow the plugin
// to change runtime options; the only option currently supported is the
// `nvidia-refresh` arg, which is expected to be a time.Duration value and
// sets how frequently the nvidia tool is called to refresh the date.
func startNVidia(vars map[string]string) error {
if !nvidia {
return nil
}
_, err := exec.Command("nvidia-smi", "-L").Output()
if err != nil {
return errors.New(fmt.Sprintf("NVidia GPU error: %s", err))
}
_errors = make(map[string]error)
_temps = make(map[string]int)
_mems = make(map[string]MemoryInfo)
_cpus = make(map[string]int)
_errors = make(map[string]error)
RegisterTemp(updateNvidiaTemp)
RegisterMem(updateNvidiaMem)
RegisterCPU(updateNvidiaUsage)
nvidiaLock = sync.Mutex{}
// Get the refresh period from the passed-in command-line/config
// file options
refresh := time.Second
if v, ok := vars["nvidia-refresh"]; ok {
if refresh, err = time.ParseDuration(v); err != nil {
return err
}
}
// update once to populate the device names, for the widgets.
update()
// Fork off a long-running job to call the nvidia tool periodically,
// parse out the values, and put them in the cache.
go func() {
timer := time.Tick(refresh)
for range timer {
update()
}
}()
return nil
}
// Caches for the output from the nvidia tool; the update() functions pull
// from these and return the values to gotop when requested.
var (
_temps map[string]int
_mems map[string]MemoryInfo
_cpus map[string]int
// A cache of errors generated by the background job running the nvidia tool;
// these errors are returned to gotop when it calls the update() functions.
_errors map[string]error
)
var nvidiaLock sync.Mutex
// update calls the nvidia tool, parses the output, and caches the results
// in the various _* maps. The metric data parsed is: name, index,
// temperature.gpu, utilization.gpu, utilization.memory, memory.total,
// memory.free, memory.used
//
// If this function encounters an error calling `nvidia-smi`, it caches the
// error and returns immediately. We expect exec errors only when the tool
// isn't available, or when it fails for some reason; no exec error cases
// are recoverable. This does **not** stop the cache job; that will continue
// to run and continue to call update().
func update() {
bs, err := exec.Command(
"nvidia-smi",
"--query-gpu=name,index,temperature.gpu,utilization.gpu,memory.total,memory.used",
"--format=csv,noheader,nounits").Output()
if err != nil {
_errors["nvidia"] = err
//bs = []byte("GeForce GTX 1080 Ti, 0, 31, 9, 11175, 206")
return
}
csvReader := csv.NewReader(bytes.NewReader(bs))
csvReader.TrimLeadingSpace = true
records, err := csvReader.ReadAll()
if err != nil {
_errors["nvidia"] = err
return
}
// Ensure we're not trying to modify the caches while they're being read by the update() functions.
nvidiaLock.Lock()
defer nvidiaLock.Unlock()
// Errors during parsing are recorded, but do not stop parsing.
for _, row := range records {
// The name of the devices is the nvidia-smi "<name>.<index>"
name := row[0] + "." + row[1]
if _temps[name], err = strconv.Atoi(row[2]); err != nil {
_errors[name] = err
}
if _cpus[name], err = strconv.Atoi(row[3]); err != nil {
_errors[name] = err
}
t, err := strconv.Atoi(row[4])
if err != nil {
_errors[name] = err
}
u, err := strconv.Atoi(row[5])
if err != nil {
_errors[name] = err
}
_mems[name] = MemoryInfo{
Total: 1048576 * uint64(t),
Used: 1048576 * uint64(u),
UsedPercent: (float64(u) / float64(t)) * 100.0,
}
}
}
var nvidia bool

271
devices/remote.go Normal file
View File

@ -0,0 +1,271 @@
package devices
import (
"bufio"
"log"
"net/http"
"net/url"
"strconv"
"strings"
"sync"
"time"
"github.com/xxxserxxx/opflag"
)
var name string
var remote_url string
var sleep time.Duration
var remoteLock sync.Mutex
// FIXME Widgets don't align values
// TODO remote network & disk aren't reported
// TODO network resiliency; I believe it currently crashes gotop when the network goes down
// TODO Replace custom decoder with https://github.com/prometheus/common/blob/master/expfmt/decode.go
// TODO MQTT / Stomp / MsgPack
func init() {
opflag.StringVarP(&name, "remote-name", "", "", "Remote: name of remote gotop")
opflag.StringVarP(&remote_url, "remote-url", "", "", "Remote: URL of remote gotop")
opflag.DurationVarP(&sleep, "remote-refresh", "", 0, "Remote: Frequency to refresh data, in seconds")
RegisterStartup(startup)
}
type Remote struct {
url string
refresh time.Duration
}
func startup(vars map[string]string) error {
// Don't set anything up if there's nothing to do
if name == "" || remote_url == "" {
return nil
}
_cpuData = make(map[string]int)
_tempData = make(map[string]int)
_netData = make(map[string]float64)
_diskData = make(map[string]float64)
_memData = make(map[string]MemoryInfo)
remoteLock = sync.Mutex{}
remotes := parseConfig(vars)
if remote_url != "" {
r := Remote{
url: remote_url,
refresh: 2 * time.Second,
}
if name == "" {
name = "Remote"
}
if sleep != 0 {
r.refresh = sleep
}
remotes[name] = r
}
if len(remotes) == 0 {
log.Println("Remote: no remote URL provided; disabling extension")
return nil
}
RegisterTemp(updateTemp)
RegisterMem(updateMem)
RegisterCPU(updateUsage)
// We need to know what we're dealing with, so the following code does two
// things, one of them sneakily. It forks off background processes
// to periodically pull data from remote sources and cache the results for
// when the UI wants it. When it's run the first time, it sets up a WaitGroup
// so that it can hold off returning until it's received data from the remote
// so that the rest of the program knows how many cores, disks, etc. it needs
// to set up UI elements for. After the first run, each process discards the
// the wait group.
w := &sync.WaitGroup{}
for n, r := range remotes {
n = n + "-"
r.url = r.url
var u *url.URL
w.Add(1)
go func(name string, remote Remote, wg *sync.WaitGroup) {
for {
res, err := http.Get(remote.url)
if err == nil {
u, err = url.Parse(remote.url)
if err == nil {
if res.StatusCode == http.StatusOK {
bi := bufio.NewScanner(res.Body)
process(name, bi)
} else {
u.User = nil
log.Printf("unsuccessful connection to %s: http status %s", u.String(), res.Status)
}
} else {
log.Print("error processing remote URL")
}
} else {
}
res.Body.Close()
if wg != nil {
wg.Done()
wg = nil
}
time.Sleep(remote.refresh)
}
}(n, r, w)
}
w.Wait()
return nil
}
var (
_cpuData map[string]int
_tempData map[string]int
_netData map[string]float64
_diskData map[string]float64
_memData map[string]MemoryInfo
)
func process(host string, data *bufio.Scanner) {
remoteLock.Lock()
for data.Scan() {
line := data.Text()
if line[0] == '#' {
continue
}
if line[0:6] != _gotop {
continue
}
sub := line[6:]
switch {
case strings.HasPrefix(sub, _cpu): // int gotop_cpu_CPU0
procInt(host, line, sub[4:], _cpuData)
case strings.HasPrefix(sub, _temp): // int gotop_temp_acpitz
procInt(host, line, sub[5:], _tempData)
case strings.HasPrefix(sub, _net): // int gotop_net_recv
parts := strings.Split(sub[5:], " ")
if len(parts) < 2 {
log.Printf(`bad data; not enough columns in "%s"`, line)
continue
}
val, err := strconv.ParseFloat(parts[1], 64)
if err != nil {
log.Print(err)
continue
}
_netData[host+parts[0]] = val
case strings.HasPrefix(sub, _disk): // float % gotop_disk_:dev:mmcblk0p1
parts := strings.Split(sub[5:], " ")
if len(parts) < 2 {
log.Printf(`bad data; not enough columns in "%s"`, line)
continue
}
val, err := strconv.ParseFloat(parts[1], 64)
if err != nil {
log.Print(err)
continue
}
_diskData[host+parts[0]] = val
case strings.HasPrefix(sub, _mem): // float % gotop_memory_Main
parts := strings.Split(sub[7:], " ")
if len(parts) < 2 {
log.Printf(`bad data; not enough columns in "%s"`, line)
continue
}
val, err := strconv.ParseFloat(parts[1], 64)
if err != nil {
log.Print(err)
continue
}
_memData[host+parts[0]] = MemoryInfo{
Total: 100,
Used: uint64(100.0 / val),
UsedPercent: val,
}
default:
// NOP! This is a metric we don't care about.
}
}
remoteLock.Unlock()
}
func procInt(host, line, sub string, data map[string]int) {
parts := strings.Split(sub, " ")
if len(parts) < 2 {
log.Printf(`bad data; not enough columns in "%s"`, line)
return
}
val, err := strconv.Atoi(parts[1])
if err != nil {
log.Print(err)
return
}
data[host+parts[0]] = val
}
func updateTemp(temps map[string]int) map[string]error {
remoteLock.Lock()
for name, val := range _tempData {
temps[name] = val
}
remoteLock.Unlock()
return nil
}
// FIXME The units are wrong: getting bytes, assuming they're %
func updateMem(mems map[string]MemoryInfo) map[string]error {
remoteLock.Lock()
for name, val := range _memData {
mems[name] = val
}
remoteLock.Unlock()
return nil
}
func updateUsage(cpus map[string]int, _ bool) map[string]error {
remoteLock.Lock()
for name, val := range _cpuData {
cpus[name] = val
}
remoteLock.Unlock()
return nil
}
func parseConfig(vars map[string]string) map[string]Remote {
rv := make(map[string]Remote)
for key, value := range vars {
if strings.HasPrefix(key, "remote-") {
parts := strings.Split(key, "-")
if len(parts) == 2 {
log.Printf("malformed Remote extension configuration '%s'; must be 'remote-NAME-url' or 'remote-NAME-refresh'", key)
continue
}
name := parts[1]
remote, ok := rv[name]
if !ok {
remote = Remote{}
}
if parts[2] == "url" {
remote.url = value
} else if parts[2] == "refresh" {
sleep, err := strconv.Atoi(value)
if err != nil {
log.Printf("illegal Remote extension value for %s: '%s'. Must be a duration in seconds, e.g. '2'", key, value)
continue
}
remote.refresh = time.Duration(sleep) * time.Second
} else {
log.Printf("bad configuration option for Remote extension: '%s'; must be 'remote-NAME-url' or 'remote-NAME-refresh'", key)
continue
}
rv[name] = remote
}
}
return rv
}
const (
_gotop = "gotop_"
_cpu = "cpu_"
_temp = "temp_"
_net = "net_"
_disk = "disk_"
_mem = "memory_"
)

View File

@ -1,9 +1,17 @@
% Plugins % Plugins
# Current state
# Extensions First, there were go plugins. This turned out to be impractical due to the limitations in plugins making them unsuitable for use outside of a small, strict, and (one could argue) useless use case.
- Plugins will supply an `Init()` function that will call the appropriate Then I tried external static extensions. This approach used a trick to copy and modify the gotop main executable, which then imported it's own packages from upstream. This worked, but was awkward and required several steps to build.
Currently, as I've only written two modules since I started down this path, and there's no clean, practical solution yet in Go, I've folded the extensions into the main codebase. This means there's no programmatic extension mechanism for gotop.
# Devices
- Devices supply an `Init()` function that will call the appropriate
`Register\*()` functions in the `github.com/xxxserxxx/gotop/devices` package. `Register\*()` functions in the `github.com/xxxserxxx/gotop/devices` package.
- `devices` will supply: - `devices` will supply:
- RegisterCPU (opt) - RegisterCPU (opt)

62
docs/remote-monitoring.md Normal file
View File

@ -0,0 +1,62 @@
# Remote monitoring extension for gotop
Show data from gotop running on remote servers in a locally-running gotop. This allows gotop to be used as a simple terminal dashboard for remote servers.
![Screenshot](/assets/screenshots/fourby.png)
## Configuration
gotop exports metrics on a local port with the `--export <port>` argument. This is a simple, read-only interface with the expectation that it will be run behind some proxy that provides security. A gotop built with this extension can read this data and render it as if the devices being monitored were on the local machine.
On the local side, gotop gets the remote information from a config file; it is not possible to pass this in on the command line. The recommended approach is to create a remote-specific config file, and then run gotop with the `-C <remote-config-filename>` option. Two options are available for each remote server; one of these, the connection URL, is required.
The format of the configuration keys are: `remote-SERVERNAME-url` and `remote-SERVERNAME-refresh`; `SERVERNAME` can be anything -- it doesn't have to reflect any real attribute of the server, but it will be used in widget labels for data from that server. For example, CPU data from `remote-Jerry-url` will show up as `Jerry-CPU0`, `Jerry-CPU1`, and so on; memory data will be labeled `Jerry-Main` and `Jerry-Swap`. If the refresh rate option is omitted, it defaults to 1 second.
### An example
One way to set this up is to run gotop behind [Caddy](https://caddyserver.com). The `Caddyfile` would have something like this in it:
```
gotop.myserver.net {
basicauth / gotopusername supersecretpassword
proxy / http://localhost:8089
}
```
Then, gotop would be run in a persistent terminal session such as [tmux](https://github.com/tmux/tmux) with the following command:
```
gotop -x :8089
```
Then, on a local laptop, create a config file named `myserver.conf` with the following lines:
```
remote-myserver-url=https://gotopusername:supersecretpassword@gotop.myserver.net/metrics
remote-myserver-refresh=2
```
Note the `/metrics` at the end -- don't omit that, and don't strip it in Caddy. The refresh value is in seconds. Run gotop with:
```
gotop -C myserver.conf
```
and you should see your remote server sensors as if it were running on your local machine.
You can add as many remote servers as you like in the config file; just follow the naming pattern.
## Why
This can combine multiple servers into one view, which makes it more practical to use a terminal-based monitor when you have more than a couple of servers, or when you don't want to dedicate an entire wide-screen monitor to a bunch of gotop instances. It's simple to set up, configure, and run, and reasonably resource efficient.
## How
Since v3.5.2, gotop's been able to export its sensor data as [Prometheus](https://prometheus.io/) metrics using the `--export` flag. Prometheus has the advantages of being simple to integrate into clients, and a nice on-demand design that depends on the *aggregator* pulling data from monitors, rather than the clients pushing data to a server. In essence, it inverts the client/server relationship for monitoring/aggregating servers and the things it's monitoring. In gotop's case, it means you can turn on `-x` and not have it impact your gotop instance at all, until you actively poll it. It puts the control on measurement frequency in a single place -- your local gotop. It means you can simply stop your local gotop instance (e.g., when you go to bed) and the demand on the servers you were monitoring drops to 0.
On the client (local) side, sensors are abstracted as devices that are read by widgets, and we've simply implemented virtual devices that poll data from remote Prometheus instances. At a finer grain, there's a single process spawned for each remote server that periodically polls that server and collects the information. When the widget updates and asks the virtual device for data, the device consults the cached data and provides it as the measurement.
The next iteration will optimize the metrics transfer protocol; while it'll likely remain HTTP, optimizations may include HTTP/2.0 streams to reduce the HTTP connection overhead, and a binary payload format for the metrics -- although HTTP/2.0 compression may eliminate any benefit of doing that.