Rusty Advent of Cyber 2023: Day 2

Time for some Polars data science!

All Entries

Day 2 of Advent of Cyber 2023, done in Rust as much as possible! I was surprised to find that Rust actually had a crate for data science, given that historically, that's been Python's domain. However, with the Advent (pardon the pun) of efficiency and memory safety, Rust is definitely a solid candidate for what we have today.

If you want to follow along, make sure you have Rust installed (I’m using 1.74.1 for reference).

Obligatory disclaimer: This is for educational purposes only. I am not responsible for any irresponsible or unethical use of these techniques.

With that out of the way, let's rock!

Day 2: Scenario and Recon

Most of this module is made up of 3 Jupyter notebooks, designed to get you up to speed with Python, Pandas, and Jupyter.

Cool, I'm working with Rust so I don't need all that.

The fourth notebook is where the meat of our task comes in, as we need to analyze some packet capture metadata. We have a .csv file on the machine that we need to analyze. All of this is looking good, but there's one tiny problem we need to resolve before we can proceed.

How do we get the file onto our local machine?

You weren't about to work with Rust in the VM, were you? Personally, I like having a development environment under my control that isn't time boxed to under 2 hours unless you refresh every hour or so. So how can we get that CSV on our machine? We're going to need VPN access for this, so make sure you're linked up.

File Exfiltration

One of the simplest ways to send files back and forth between systems is an HTTP server. Install a quick module with upload functionality (I typically use this for my purposes), fire up the server, and you're in business.

Luckily, we have an HTTP server with upload functionality available to us thanks to this super handy crate! I was not writing a custom server this early on, I had enough of that in Day 5.

Let's install simple-http-server real quick. We already have Rust on our system, so we just need to install the crate.

cargo install simple-http-serverrehash

Bingo. Now, what do we need out of this server? We'll need upload functionality since we're exfiltrating. Everything else about the defaults looks pretty good, though!

Since we're going to want the CSV file in the root of our Rust project, let's make that now. We’ll get the server up and running right away as well.

cargo new aocyber-day2cd aocyber-day2simple-http-server -u

This'll open an HTTP server on port 8000, so if that's taken up, you can swap the port with the -p flag. In a separate terminal, grab your IP address on the VPN. This command will show us the IPV4 address of our local machine on tun0, which is what OpenVPN uses as a network interface (in other words, it’s kinda like an ethernet port).

ip -4 addr show dev tun0

Copy the address and paste it into your VM's browser, along with our port. The final URL should look like: http://xx.yy.zz.aa:8000. Obviously, replace xx.yy.zz.aa with your machine IP.

Once you see the web page, click on the Browse button and find our network traffic CSV. It'll be under the 4th notebook. Upload it and let's get back to home base: we have some frosty Rust to write.

Polars: Rusty Data Science

First off, let’s make sure our CSV file is somewhere out of the way: I made a data folder so it’s located at data/network_traffic.csv. Here’s a look at the project structure.

Let’s also get Polars onboard. We’ll need lazy loading for this, which will make sure we don’t actually process data until we make a final call in the program.

cargo add polars -F lazy

Alright, time to get cracking. First things first, let’s load in our CSV. Our main function should look like this:

use polars::error::PolarsResult;use polars::{lazy::dsl::count, prelude::*};fn main() -> PolarsResult<()> { let df = CsvReader::from_path("data/network_traffic.csv")?.finish()?; Ok(())}

Let’s check to make sure everything’s in working order (it might take a second, Polars is a sizable library):

cargo run

If that completed with no errors, we should be golden. Let’s move on and run some numbers.

Task 1: Packet Counting

Task 1, we just need to count some packets. This should be easy. Add the following to your main function, just after loading in the CSV:

//Task 1: Count the Packetslet processed1: DataFrame = df.clone().lazy().select([count()]).collect()?;//Get the actual result out of the "count" columnif let Some(res1) = processed1["count"].iter().next() { println!("Number of packets in file: {}", res1);} else { println!("Something went wrong with task 1.");}

First bit is fairly cut and dry, we’re just making a count column that counts up our rows and packing it into a data frame that we can process. The weird part is this if let stuff. What’s going on here?

Well, in Rust, you need a way to represent data that doesn’t exist. However, null pointers aren’t allowed, so we need to represent that a different way. This is Rust’s Option structure, which either has Some data or None. None is about what it sounds like, no data whatsoever. In Rust, whenever there’s ambiguity about whether data exists or not, we generally return an Option.

Now, the part that’s causing the weirdness:

if let Some(res1) = processed1["count"].iter().next() { // BRANCH ALPHA} else { // BRANCH BRAVO}

What this is essentially saying is, if we actually get something out of whatever is on the right hand of that equal sign (=), we’ll store it in res1 and go down BRANCH ALPHA. Otherwise (i.e. we get None), we go down BRANCH BRAVO.

So that actually gets us our number of packets! Great. Let’s move on.

Task 2: Who’s Feeling Chatty?

Task 2 has us trying to figure out which IP address had the highest number of packets sent. So we need to make sure we bundle up our data so we can find out who sent how many packets, count those packets sent, and grab the highest.

//Task 2: IP Address with the most packets sentlet processed2: DataFrame = df .clone() .lazy() .group_by(["Source"]) .agg([count()]) .sort( "count", SortOptions { descending: true, nulls_last: true, ..Default::default() }, ) .limit(1) .collect()?;if let Some(res2) = processed2["Source"].iter().next() { println!("Most frequent sender: {}", res2.get_str().unwrap());} else { println!("Something went wrong with task 2.");}

If you’ve done anything with data science or databases, the phrase “group by” should be familiar. You pick a column of data, mark that as your groups, and then process data accordingly. So, say if I had some data about cars, I could group by manufacturer to find out who sold the most cars, as an example. Or, I could group by vehicle type to see what was most popular that year.

The code above has this in action. We’re grouping by the “Source” column, which has our source IP address for each packet. Then, we’re using an aggregator to count up the packets by their source IP. Finally, we’ll sort by those packet counts and snag the first one, which since we’re sorting in descending order, should give us our highest.

Similar deal in extracting the data, but we need to convert the data into a string slice (the get_str() method) just so it reads a little cleaner.

Task 3: Mr. Popular Protocol

Task 3 is just about finding out what the most commonly used protocol was in the packet capture.

Fun fact: You only have to change, like, 2 values in the source code, and they’re the column value.

//Task 3: Most Popular Protocol let processed3 = df .clone() .lazy() .group_by(["Protocol"]) .agg([count()]) .sort( "count", SortOptions { descending: true, nulls_last: true, ..Default::default() }, ) .limit(1) .collect()?; if let Some(res3) = processed3["Protocol"].iter().next() { println!("Most frequent protocol: {}", res3.get_str().unwrap()); } else { println!("Something went wrong with task 3."); }

A Quick Refactor

Now that I think about it, we can probably clean this up and make it a little less copy-paste-y. Let’s offload our most popular call chain into its own function:

fn most_popular(column: &str, df: &DataFrame) -> PolarsResult<DataFrame> { Ok(df .clone() .lazy() .group_by([column]) .agg([count()]) .sort( "count", SortOptions { descending: true, nulls_last: true, ..Default::default() }, ) .limit(1) .collect()?)}

Unfortunately I wasn’t able to return the iterator result directly because of borrow checking, so we still have to do that somewhat repetitively. Modifying our main function:

//Task 2: IP Address with the most packets sentlet processed2: DataFrame = most_popular("Source", &df)?;if let Some(res2) = processed2["Source"].iter().next() { println!("Most frequent sender: {}", res2.get_str().unwrap());} else { println!("Something went wrong with task 2.");}//Task 3: Most Popular Protocollet processed3: DataFrame = most_popular("Protocol", &df)?;if let Some(res3) = processed3["Protocol"].iter().next() { println!("Most frequent protocol: {}", res3.get_str().unwrap());} else { println!("Something went wrong with task 3.");}

This way, if we ever needed to figure out what the most common target was, or if we had some additional data we needed the mode of, or we ever needed to get the top 5 of the most popular source IPs, or what have you, we have a way of going about that much more easily.

The Payoff

Let’s test and make sure what we have works:

Bingo! I specifically redacted the outputs so that you don’t get the flags for free (come on, you thought I was gonna let you have it that easy?), but that is our data processed.

Conclusion

This has been day 2 of my Rusty Advent of Cyber! A new data science library I had never heard of, file exfiltration, and some handy error handling.

If you want to test out the full versions of these programs, I have the full repo on Github!

Next time, we’re going to try our hand at brute forcing a PIN and writing our own mini version of Hydra. See you then!

Reply

or to participate.