March 7, 2021

884 words 5 mins read

Your Meta-Data

Your Meta-Data

This website has been up for a while now, and I can genuinely say that I have been very happy with it. Hugo has shown to be a wonderful static-site generator. And at this point, if I had to choose between it or what I use to run The Glowing Fool website , WordPress, I think I would go with Hugo nine times out of ten.

When you are running your own website though it can be very easy to get hung-up on the stats. How many people have visited? How are the posts performing? How are people discovering the site? Perhaps this is just a personal issue, but once data is presented to me, I can get lost in them. On WordPress, site stats are presented very clearly. It’s one of the built-in standard functions. But that functionality is just not available when you use a static-site generator. That’s not to say that the data doesn’t exist. Whenever any one of us visits a website, that traffic is always going to be logged. It is just a matter of what logs it, and how it is stored.

In the case of this website, logs are stored by a reverse-proxy called nginx .

A reverse-proxy is an intermediary server that sits behind a firewall, and acts as a go-between clients (you) and applications (this website). Routing to the requested applications based on what clients are asking for.

An nginx server re-routes users to what they are looking for.

As shown in the graph above, any traffic that accesses any of my applications would need to go through nginx. And nginx has a logging function that takes note of the meta-data that is transmitted.

Meta-data is “data about data”. It doesn’t pertain to what is actually being transmitted – the “data” (i.e. the contents of a message, an image, a video, etc.) – but rather, the meta-data focuses on the data about the data that is being transmitted. Examples like when the data was sent, how big the files were, etc. Each time someone accesses an application a meta-data entry is created and stored somewhere for later reference.

So what does this meta-data look like? A default nginx meta-data entry for a single interaction would look something like this save in a log (usually a text file):

14.0.152.0 - - [01/Feb/2021:19:02:39 +0800] "GET /post/password-manager/ HTTP/1.1" 304 0 "https://www.linkedin.com/" "Mozilla/5.0 (Linux; Android 10; SM-G9XXX) AppleWebKit/537.00 (KHTML, like Gecko) Chrome/88.0.4324.00 Mobile Safari/537.00"

This log occurred when I accessed this website via my mobile phone. And this information could be interpreted as follows:

Data Type Description Example (taken from above) Inference
remote_addr The IP address that made the connection to the server. 14.0.152.0 This is a Hong Kong IP address, so the client is connecting from Hong Kong.
remote_user The user name if any login information was given on the page. -
time_local The time of access [01/Feb/2021:19:02:39 +0800]
request The content that was requested. Usually the web page. "GET /post/password-manager/ HTTP/1.1" The client was looking at the “Use a Password Manager” post
status The HTTP request code status 304
body_bytes_sent The total size of the file sent. 0
http_referer The referring website. The website that sent the user to this page. "https://www.linkedin.com/" The client found this post through LinkedIn
http_user_agent The type of device, OS, and application that the requester is using. "Mozilla/5.0 (Linux; Android 10; SM-G965F) AppleWebKit/537.00 (KHTML, like Gecko) Chrome/88.0.4324.00 Mobile Safari/537.00" The client is using a Samsung phone and Google Chrome to access the site.

If I were to collect all the entries that nginx logs over a period of time I would be able to theoretically analyze:

  1. What content people are looking at on my website.
  2. When people access my website.
  3. From what country/continent people access my website.
  4. With what OS/device people use to access my content.
  5. What sites referred people to my site.

And then using a tool called GoAccess I can create graphs that chart out this information for me. Some examples are here:

Number of hits in a day.

Countries of visitors in a day. Based on IP address.

OS used by visitors.

All of this is data that a host is able to gather whenever you visit their website. It is unfortunately just another aspect of browsing the internet. The “price of admission” so to speak. And outside of using a privacy-centric browser or a VPN, you can’t really mask your presence on the internet.

But on the bright side, the meta-data that is logged is not truly personal, nor by itself can it be used to identify you as an individual. The most identifiable piece of meta-data you hand over when you visit a website is your IP address. But your IP address changes depending on which network you access from. If you browse a website from a coffee shop the IP address that is logged belongs to the shop. And even if you access a website from home, most home internet plans don’t have fixed IP addresses. Which means that your IP address rotates/changes every so-often anyway.

That being said, it is just handy to know that this is information that websites collect on you whenever you visit them. Knowing what information you hand over to companies is the first major step in maintaining good cyber hygiene.