MJ12

Information. Technology. Control.


Elastic vs Chaos: Normalising Control-M Logs for Observability

Categories: [elastic], [control-m], [observability], [logstash], [devops], [real-world]
Tags: [elastic], [control-m], [observability], [logstash], [devops], [real-world]


Part 1 – The Agent Frontlines

acde246d-7af7-4395-ad25-8c5999c4b653-2

The Mess Behind the Scheduler

I’ve always had a soft spot for Control-M. It’s one of those quietly terrifying systems that just runs — until it doesn’t.
And when it doesn’t, you don’t get a tidy stack trace; you get a log file that looks like a ransom note written by three different developers and a COBOL job from 1998.

My first deep dive into Control-M observability started with a simple question from ops:

“Can we get this into Elastic so we can actually see what’s happening?”

I said yes before remembering how Control-M logs actually look.

The first agent I connected to greeted me with 17 GB of flat text spread across sysout, proclog, out, and err.
Some entries had timestamps. Some didn’t. Some were in UTC, others in whatever time zone the admin felt like that day.

You learn humility fast when you tail a Control-M joblog at 2 a.m.

15312375-8273-4f3b-86eb-53a885cebfb3

The Jungle of Logs

Before you can make sense of it, you need to know where everything hides.

Log Type Typical Path Description
joblog $HOME/ctm/sysout/ The “official” story of a job run (when you’re lucky).
out / err $HOME/ctm/out/ Whatever your scripts decide to print.
sysout /opt/controlm/ctm_sys/ Agent background noise; occasionally useful.
proclog /var/log/controlm/proclog/ Process chatter, heartbeats, restarts.
statistics Database tables, for later. Where truth goes to die.

When you have 150 agents across 2 000 servers, these logs multiply like fruit flies.

A typical “nice” log looks like this:

==== JOB ENDED OK  (DAILY_REPORT) ====
Ended: 2025-10-27 22:14:37
CPU Time: 00:00:02

And then there’s the other kind — the one that just says:

JOBNAME: DAILY_REPORT
ENDED OK

That’s not a log; that’s a shrug in ASCII.

Reality check: Control-M was built to run jobs, not to be pretty. Parsing it means forgiving a lot of design sins.

Why Normalise at All?

You could throw those logs into Elastic raw, sure — but you’d get a sea of text and no insight.
What I wanted was structure without fantasy — something honest enough to represent the chaos, but consistent enough to let me build a dashboard.

So I mapped the journey:

  1. Collect – Filebeat on each agent.
  2. Normalise – Logstash filters that bend but don’t break.
  3. Enrich – Add metadata (hostnames, job IDs, timestamps).
  4. Index – ECS-friendly documents Kibana won’t choke on.

edf406d4-0e56-43af-8a11-8d5d2902c3a5

Building a Pipeline That Doesn’t Cry

Here’s where the real work happens — the part that separates a pretty diagram from an operational system.
I didn’t want another brittle ingest that died the moment someone renamed a log. I wanted elasticity inside Elastic.

Logs came from a predictable chaos:

/opt/controlm/ctm/sysout/<AGENT>/<JOBNAME>/<DATE>/joblog

…but naming discipline varied by region, so the path itself became one of my best data sources.

Filebeat didn’t just read logs — it understood where it found them:

fields_under_root: true
fields:
  ctm_agent_path: "${path}"
  ctm_agent: "${path_basename_3}"   # Extracted agent name
  ctm_job: "${path_basename_2}"     # Derived job folder

Each Filebeat instance carried a local name (filebeat.agent.hostname) reused as controlm.agent.id.
By the time data reached Logstash, I already knew:

That contextual metadata became the bones of the event — the part I could trust even when the text inside was nonsense.

54fa0dd0-4bde-4c30-bc6e-5c6dd28de98a

The Logstash Philosophy

Every line was an unreliable witness.
The pipeline became a series of interrogations:

  1. Multiline joined all lines starting with ^==== JOB.
  2. Dissect turned file paths into controlm.job.name and controlm.agent.host.
  3. Grok parsed known patterns.
  4. Enrichment filled gaps from path fields.
  5. Date politely searched for timestamps, then defaulted to metadata.
dissect {
  mapping => {
    "[log][file][path]" => "/opt/controlm/ctm/sysout/%{controlm.agent.host}/%{controlm.job.name}/%{+event.date}/joblog"
  }
}

Then the extras:

mutate {
  add_field => {
    "controlm.environment" => "%{[agent][name]}"
    "controlm.pipeline" => "agent_ingest"
  }
}

Small win: At first we had 12 unique job names. After path enrichment, 3 987 — actual visibility.

Grok with Mercy

Better 80 % truth than 100 % fiction.
We used progressive grok — a hierarchy of patterns from strict to loose.

grok {
  match => [
    "message", "==== JOB ENDED %{WORD:controlm.job.status}\s+\(%{WORD:controlm.job.id}\).*",
    "message", "JOBNAME:\s+%{WORD:controlm.job.name}"
  ]
  tag_on_failure => ["_grok_fail"]
}

When parsing failed, secondary grok looked for anything timestamp-like.
Everything else landed in log.unparsed, later mined by ML.

0fe9a4e5-3f8b-4112-ab60-03a2a191c7fe

The Time Problem

Time in Control-M logs is a suggestion, not a guarantee.
So we built a hierarchy of trust:

  1. Ended: line → primary source.
  2. File mtime → fallback.
  3. Ingestion time → last resort.

No more jobs ending tomorrow.

Enrichment as Context, Not Decoration

We attached:

Result:

controlm:
  job:
    name: DAILY_REPORT
    id: J12345
    status: ENDED_OK
  agent:
    host: ctm-agent-01
    environment: PROD
event:
  start: 2025-10-27T22:14:35Z
  end:   2025-10-27T22:14:37Z
log:
  file:
    path: /opt/controlm/ctm/sysout/ctm-agent-01/DAILY_REPORT/20251027/joblog
  sha1: a1b2c3d4e5...

Lesson: Don’t enrich to impress — enrich to explain.

When Chaos Starts Behaving

The first time I saw it in Kibana, I smiled.
After weeks of grey text, dashboards felt like sunlight breaking through.

Top panels: jobs per hour, average duration, success vs failure.
Bottom: agents by volume, error types, no-timestamp count.

7a080da4-d036-4782-926c-ee78d85742b7

Patterns emerged.
We found an agent whose CPU time spiked every Thursday — a rogue backup job.
No one had noticed for months.
That’s the power of structure: not perfection, but visibility.

Things the Logs Taught Me

“Logs lie,” one of our ops guys said, “but they lie in predictable ways.”

By the time we stabilised, parsing accuracy rose from 40 % to 98 %.
Elastic finally started looking like something worth paying for.

3892ad98-2e8e-4eaa-9d3d-3562eedfcf33

The Moment of Clarity

A funny thing happens when you make chaos visible — it becomes boring.
And boring, in infrastructure, is the dream.

Two weeks without touching the pipeline. No errors. No surprises.
That’s when you know observability works: when it stops being exciting.

What Comes Next

This was just the agent side — the visible battlefield.
Part 2 goes deeper into the Control-M database, where job statistics hide behind opaque tables and numeric codes.
That’s where “structured” stops meaning “understandable.”