IBM Event Streams and IBM Event Automation

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

View Only

Back to Blog List

Analysing Wikipedia edits with IBM Event Processing

By Dale Lane posted Mon October 14, 2024 04:15 PM

In this post, I’ll share a demo I gave today to explain some of the processing nodes in the palette of IBM Event Processing.

I’ve found that demonstrations of Event Processing are easier to understand when I don’t need to explain the stream of events I’m processing in the first place. This means I’m always looking for interesting real-world event streams that are widely understood, as they can make for the most effective demos.

With this in mind, today I tried explaining a few of the Event Processing nodes by using them with a live stream of events representing pages that are being created and edited in the English Wikipedia.

Each event contains:

title of the page
who made the edit (user ID if logged in, or IP address if anonymous)
was this the creation of a new page, or an edit of an existing page?

Every edit on Wikipedia results in an event on the Kafka topic, so there are typically a few events a second. It’s not a super-high-throughput topic in Kafka terms, but there are enough events to try out interesting ideas.

Here are a few of the demos I gave today.

This is by no means an exhaustive list of what you could do with this data, but it was enough to let me show what the most commonly-used tools in the palette can do.

How many Wikipedia edits are made per day?

The Aggregate node lets us easily count how many edits we can see in the event stream.

Aggregate node: edits per day

Time window: 1 day
Aggregate function: COUNT

Which Wikipedia pages were edited the most times each day?

Using the Aggregate node together with a Top-n node lets us count things, and then keep the ones with the highest counts.

For example, for each day, we can see which three pages had the most edit events.

Aggregate node: edits per page

Time window: 1 day
Aggregate function: COUNT
Group by: title

Top-n node: pages with most edits

Number of results to keep: 3
Ordered by: number of edits (descending)

Who made the most edits on Wikipedia each day?

Adding a Filter node before the Aggregate node means we only count the events that are relevant to our query – and then the Top-n node lets us keep the results with the highest counts.

For example, for each day, we can see which logged-in users produced the most edit events.

Filter node: ignore anon users & bots

userid <> 0 (Wikipedia uses 0 to indicate anonymous users)
userid <> ‘bot name’ (repeat this for the most popular bots, such as “Citation bot”, “InternetArchiveBot”, “WikiCleanerBot”, etc.)

Aggregate node: edits per user

Time window: 1 day
Aggregate function: COUNT
Group by: user

Top-n node: users with most edits

Number of results to keep: 3
Ordered by: number of edits (descending)

Where are most of the anonymous Wikipedia editors?

For example, for each day, we can see the IP address where most of the anonymous edits were made from.

Filter node: anonymous users

userid = 0 (Wikipedia uses 0 to indicate anonymous users)

Aggregate node: count edits per location

Time window: 1 day
Aggregate function: COUNT
Group by: user

Top-n node: locations with most anon edits

Number of results to keep: 1
Ordered by: number of edits (descending)

How many anonymous Wikipedia editors have an IPv6 address?

Using a Transform node lets us derive new properties from the existing event attributes.

For example, using regular expressions on the IP address in the events for anonymous edits lets us recognise and count the number of edits from IPv4 and IPv6 addresses.

anonymous edits - ip address type counts

Filter node: anonymous users

userid = 0 (Wikipedia uses 0 to indicate anonymous users)

Transform node: check IP address type

isIPv4 =
IF(REGEXP(`user`, '\b(?:\d{1,3}\.){3}\d{1,3}\b'), 1, 0)
isIPv6 =
IF(REGEXP(`user`, '\b([0-9a-fA-F]{1,4}:){7}([0-9a-fA-F]{1,4})\b'), 1, 0)

Aggregate node: count

Time window: 1 day
Aggregate function: SUM isIPv4
Aggregate function: SUM isIPv6

A Transform node also lets us transform results into a form that is easier to consume.

For example, we can take the raw count from the previous example, and turn them into percentages.

Transform node: calculate percentages

IPv4 edits (%) =
ROUND(100 * CAST(`edits from IPv4 addresses` AS DOUBLE) / (`edits from IPv4 addresses` + `edits from IPv6 addresses`), 0)
IPv6 edits (%) =
ROUND(100 * CAST(`edits from IPv6 addresses` AS DOUBLE) / (`edits from IPv4 addresses` + `edits from IPv6 addresses`), 0)

Which new Wikipedia pages received the most edits in the first hour after creation?

Using an Interval join lets us make time-based correlations between event streams.

For example, if we split the Wikipedia events into events about creation of new pages, and events about edits of existing pages, we can correlate to see which new pages received the most edits.

Filter node: new page

type = ‘new’

Filter node: edits

type = ‘edit’

Interval join node: edits of new pages

Join condition: `new pages`.`title` = `edits`.`title`
Time window: 1 hour from new pages event_time

Aggregate node: number of edits per page

Time window: 1 day
Aggregate function: COUNT
Group by: page title

Top-n node: new pages with the most edits in the first hour

Number of results to keep: 1
Ordered by: number of edits (descending)

Want to try this for yourself?

If you’d like to recreate this demo for yourself, I have instructions for how to get access to this stream of events at github.com/dalelane/kafka-demos.

0 comments

55 views

Permalink

https://community.ibm.com/community/user/blogs/dale-lane1/2024/10/14/analysing-wikipedia-with-ibm-event-processing

IBM Event Streams and IBM Event Automation

IBM Event Streams and IBM Event Automation

Analysing Wikipedia edits with IBM Event Processing

By Dale Lane posted Mon October 14, 2024 04:15 PM

How many Wikipedia edits are made per day?

Which Wikipedia pages were edited the most times each day?

Who made the most edits on Wikipedia each day?

Where are most of the anonymous Wikipedia editors?

How many anonymous Wikipedia editors have an IPv6 address?

Which new Wikipedia pages received the most edits in the first hour after creation?

Want to try this for yourself?

Permalink

Additional
Resources

Office

Quick Links

IBM Event Streams and IBM Event Automation

IBM Event Streams and IBM Event Automation

Analysing Wikipedia edits with IBM Event Processing

By Dale Lane posted Mon October 14, 2024 04:15 PM

How many Wikipedia edits are made per day?

Which Wikipedia pages were edited the most times each day?

Who made the most edits on Wikipedia each day?

Where are most of the anonymous Wikipedia editors?

How many anonymous Wikipedia editors have an IPv6 address?

Which new Wikipedia pages received the most edits in the first hour after creation?

Want to try this for yourself?

Permalink

Additional Resources

Office

Quick Links

Additional
Resources