Arthur Camberlein >> SEO & Data articles >> Unique URLs on a dataframe column in Python

Unique URLs on a dataframe column in Python

Written by Arthur Camberlein | Published on & updated on

Might be a simple trick, but wanted to share something I am using on a regular basis: unique URL per column on a dataframe and how to export it!

I will use pandas so start by importing the libraryimport pandas as pd

I am labelling this as Python SEO as I am using it most of the time to extract 404s, create redirects, avoid redirect loops on files, ...

Load you dataframe from the csv

In this case I am using a csv: df = pd.read_csv('pages.csv') and loading the data.

unique_urls = df['Source url'].unique()

Using .unique() will save you some time: unique_urls = df['Source url'].unique() where the column name is Source url

Exporting the unique URLs to a text file

I didn't find a simpler solution ...

with open(output_filepath, 'w') as file:
    for url in unique_urls:
        file.write(url + '\n')

Bonus point: tell you how-many URLs

Using a print function: print(f"Unique URLs have been written to {output_filepath}")

Finally, the whole script!

import pandas as pd

# Load the DataFrame from a CSV file
df = pd.read_csv('pages.csv')

# Extract unique URLs from the 'Source url' column
unique_urls = df['Source url'].unique()

# Path for the output text file
output_filepath = 'unique_urls.txt'

# Exporting the unique URLs to a text file, each URL on a new line
with open(output_filepath, 'w') as file:
    for url in unique_urls:
        file.write(url + '\n')

print(f"{len(unique_urls)} Unique URLs have been written to {output_filepath}")
Retour au blog

En savoir plus avec l'article FAQ

Unique URLs on a dataframe column in Python - FAQ

What is a unique URL?

A unique URL is a URL that only appear once in a dataframe.

Why is it good to only have unique URLs?

Having unique URLs could avoid to count data twice, the large the dataframe is, the more columns and rows you will have and the more error you can encounter.

Imagine having several URLs with ~100 monthly organic sessions counted several times while you "only" have a few thousand sessions per month on your website. It would introduce a large error margin.

Why using python?

Because I can, because it's a known language used by several people on the web, and finally you could also use R (or even your Google Spreadsheet or MS Excel to achieve that!)

Blog post taggued in: Python, Python SEO

Written by