Programming Archives - Go Fish Digital https://gofishdigital.com/blog/category/programming/ Fri, 15 Nov 2024 13:06:34 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://gofishdigital.com/wp-content/uploads/2021/09/cropped-gfdicon-color-favicon-1-32x32.png Programming Archives - Go Fish Digital https://gofishdigital.com/blog/category/programming/ 32 32 How SEOs Can Identify Low-Quality Pages with Python & Compression Ratios https://gofishdigital.com/blog/identify-low-quality-pages-compression-python-seo/ https://gofishdigital.com/blog/identify-low-quality-pages-compression-python-seo/#respond Fri, 15 Nov 2024 13:02:29 +0000 https://gofishdigital.com/?p=8175 In our SEO efforts, we’re always on the lookout for innovative ways to assess page quality. Recently, an article on Search Engine Journal got us thinking about a unique approach: using compression ratios as a signal for low-quality content. Inspired by this concept, as well as a 2006 research paper on spam detection, we decided […]

How SEOs Can Identify Low-Quality Pages with Python & Compression Ratios is an original blog post first published on Go Fish Digital.

]]>
In our SEO efforts, we’re always on the lookout for innovative ways to assess page quality. Recently, an article on Search Engine Journal got us thinking about a unique approach: using compression ratios as a signal for low-quality content. Inspired by this concept, as well as a 2006 research paper on spam detection, we decided to explore whether page compressibility could reveal potential quality issues on our own site.

To test this out, we drafted a Python script to analyze a page’s compression ratio. The basic idea is that pages with redundant or low-value content tend to compress more than high-quality, informative pages. This redundant or low-value content often shows up in spammy pages or low-quality SEO content.

We ran this on every page of Go Fish Digital’s website.  The results show that we have 157 pages on our site that score above a 4.0 – the threshold at which the study suggests the likelihood of a page being low quality is greater than 50%.  Below, we’ll walk through the script we created to help score a webpage and explain each part so you can use it to analyze your own pages.

Interesting in scoring your entire site to identify content quality issues, we can help.

Request a Custom Proposal

Understanding Compression Ratios as a Quality Metric

The theory is simple: compression algorithms like gzip reduce file sizes by eliminating redundant data. If a page compresses significantly, it likely has a lot of repetitive or boilerplate content. According to the research we reviewed, high compression ratios can indicate lower-quality or spammy pages, as they often contain repeated phrases, excessive keywords, or general “filler” content. By measuring this ratio, we can identify pages that might be impacting the overall quality of a site.

The Python Code: Analyzing Page Compression Ratios

We drafted python code that fetches a page, extracts its main content, compresses it, and then calculates the compression ratio. Below, we’ll break down each function and provide the code for the full script below.

Breaking Down the Code

Let’s take a closer look at each function and explain how they work. We’ll walk through the python modules needed, how we request a page and extract the text, calculate the compression ration, and print the results.  Finally, we’ll share the entire script at the end.

Step 1: Import the needed python modules

To get started we’ll want to use requests, BeautifulSoup, and gzip.  We added these to the top of our script.

import requests
from bs4 import BeautifulSoup
import gzip

Step 2: Fetch and Parse the Webpage

We then created a fetch_and_pars function that sends a request to the URL and parses the HTML content using BeautifulSoup. We also remove unnecessary tags (<head> <header>, <footer>, <script>,<style>, and <meta>) to focus on the main body of the content. Removing these tags helps us avoid compressing unrelated HTML or scripts, allowing us to analyze only the visible text content.

# Function to fetch and parse a webpage with headers to mimic a browser
def fetch_and_parse(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    # Use a session to reuse the connection
    with requests.Session() as session:
        response = session.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

    # Remove head, header, and footer sections
    for tag in soup(['head', 'header', 'footer', 'script', 'style', 'meta']):
        tag.decompose()  # Remove the tag and its contents

    return soup

Step 3: Extract the Text Content

Next, the extract_text_selectively function extracts the main text content from specific tags (<p>, <li>, <h1>, <h2>, <h3>) that are typically used for visible text. We concatenate the text from these tags into a single string. This way, we avoid compressing empty or non-text content that could skew the ratio.

# Function to extract text from a soup object with selective combining
def extract_text_selectively(soup):
    individual_tags = {'p', 'li', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'table', 'tr'}
    container_tags = {'div', 'section', 'article', 'main'}
    excluded_tags = {'style', 'script', 'meta', 'body', 'html', '[document]', 'button'}
    inline_tags = {'a', 'strong', 'b', 'i', 'em', 'span'}

    text_lines = []
    processed_elements = set()

    for element in soup.find_all(True, recursive=True):
        if element in processed_elements or element.name in excluded_tags:
            continue

        # Handle table rows () and their child elements
        if element.name == 'tr':
            row_text = []
            for cell in element.find_all(['th', 'td']):
                cell_text = format_cell_text(cell)
                if cell_text:
                    row_text.append(cell_text)

            if row_text:
                combined_row_text = ', '.join(row_text)
                text_lines.append(('tr', combined_row_text))

            for descendant in element.descendants:
                processed_elements.add(descendant)
            processed_elements.add(element)
            continue

        elif element.name in individual_tags:
            if element.name == 'p' and element.find_parent('td') and element.find_parent('tr'):
                continue
            if element.parent in processed_elements:
                continue

            inline_text = ''
            if any(element.find(tag) for tag in inline_tags):
                inline_text = ' '.join(element.stripped_strings)
            else:
                inline_text = ' '.join(element.find_all(text=True, recursive=False)).strip()

            if inline_text and element not in processed_elements:
                text_lines.append((element.name, inline_text))
                processed_elements.add(element)

        elif element.name in container_tags:
            if element.parent in processed_elements:
                continue
            direct_text = ' '.join([t for t in element.find_all(text=True, recursive=False) if t.strip()]).strip()
            if direct_text and element not in processed_elements:
                text_lines.append((element.name, direct_text))
                processed_elements.add(element)

    combined_text = ' '.join(line for _, line in text_lines)
    return combined_text

Step 4: Calculate the Compression Ratio

With the extracted text, we now calculate the compression ratio using the calculate_compression_ratio function. Here, we:

  1. Measure the original size of the text in bytes.
  2. Compress the text using gzip and get the compressed size.
  3. Divide the original size by the compressed size to obtain the compression ratio.

A higher compression ratio indicates that the text was more compressible, potentially hinting at repetitive or low-quality content.

def calculate_compression_ratio(text):
    original_size = len(text.encode('utf-8'))
    compressed_data = gzip.compress(text.encode('utf-8'))
    compressed_size = len(compressed_data)
    compression_ratio = original_size / compressed_size
    return compression_ratio

Step 5: User Input, Calling Functions, Printing Results

Now that we have the functions we need, we can pull everything together by asking the user to enter the URL, fetching it with our functions, compressing the text, and showing the compression ratio for the URL.

# Prompt the user to enter the URL
url = input("Please enter the URL: ")

# Fetch, parse, and extract text
soup = fetch_and_parse(url)
combined_text = extract_text_selectively(soup)

# Print extracted text
print(f"\nExtracted text from the page:\n{combined_text}\n")

# Inform that compression is about to start
print("Compressing text...")

# Calculate and print the compression ratio
compression_ratio = calculate_compression_ratio(combined_text)
print(f"\nCompression ratio for {url}: {compression_ratio:.2f}")

Full Python Script for Calculating Compression Ratio

Here is the full script you can run or modify locally to get the compression ratio for pages on your site.

import requests
from bs4 import BeautifulSoup
import gzip

# Function to fetch and parse a webpage with headers to mimic a browser
def fetch_and_parse(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    # Use a session to reuse the connection
    with requests.Session() as session:
        response = session.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

    # Remove head, header, and footer sections
    for tag in soup(['head', 'header', 'footer', 'script', 'style', 'meta']):
        tag.decompose()  # Remove the tag and its contents

    return soup

# Function to properly format cell content by ensuring spaces between header and value
def format_cell_text(cell):
    text = cell.get_text(separator=' ', strip=True)
    parts = text.split()
    formatted_text = ' '.join(parts)
    return formatted_text

# Function to extract text from a soup object with selective combining
def extract_text_selectively(soup):
    individual_tags = {'p', 'li', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'table', 'tr'}
    container_tags = {'div', 'section', 'article', 'main'}
    excluded_tags = {'style', 'script', 'meta', 'body', 'html', '[document]', 'button'}
    inline_tags = {'a', 'strong', 'b', 'i', 'em', 'span'}

    text_lines = []
    processed_elements = set()

    for element in soup.find_all(True, recursive=True):
        if element in processed_elements or element.name in excluded_tags:
            continue

        # Handle table rows () and their child elements
        if element.name == 'tr':
            row_text = []
            for cell in element.find_all(['th', 'td']):
                cell_text = format_cell_text(cell)
                if cell_text:
                    row_text.append(cell_text)

            if row_text:
                combined_row_text = ', '.join(row_text)
                text_lines.append(('tr', combined_row_text))

            for descendant in element.descendants:
                processed_elements.add(descendant)
            processed_elements.add(element)
            continue

        elif element.name in individual_tags:
            if element.name == 'p' and element.find_parent('td') and element.find_parent('tr'):
                continue
            if element.parent in processed_elements:
                continue

            inline_text = ''
            if any(element.find(tag) for tag in inline_tags):
                inline_text = ' '.join(element.stripped_strings)
            else:
                inline_text = ' '.join(element.find_all(text=True, recursive=False)).strip()

            if inline_text and element not in processed_elements:
                text_lines.append((element.name, inline_text))
                processed_elements.add(element)

        elif element.name in container_tags:
            if element.parent in processed_elements:
                continue
            direct_text = ' '.join([t for t in element.find_all(text=True, recursive=False) if t.strip()]).strip()
            if direct_text and element not in processed_elements:
                text_lines.append((element.name, direct_text))
                processed_elements.add(element)

    combined_text = ' '.join(line for _, line in text_lines)
    return combined_text

# New function to compress text and calculate compression ratio
def calculate_compression_ratio(text):
    original_size = len(text.encode('utf-8'))
    compressed_data = gzip.compress(text.encode('utf-8'))
    compressed_size = len(compressed_data)
    compression_ratio = original_size / compressed_size
    return compression_ratio

# Prompt the user to enter the URL
url = input("Please enter the URL: ")

# Fetch, parse, and extract text
soup = fetch_and_parse(url)
combined_text = extract_text_selectively(soup)

# Print extracted text
print(f"\nExtracted text from the page:\n{combined_text}\n")

# Inform that compression is about to start
print("Compressing text...")

# Calculate and print the compression ratio
compression_ratio = calculate_compression_ratio(combined_text)
print(f"\nCompression ratio for {url}: {compression_ratio:.2f}")

Running the Script and Interpreting the Results

To use the script, you’ll need to save a python file with the code provided.  Next run the python file on your computer through terminal or command line and then simply enter a URL when prompted. The script will display the extracted text, inform you that it’s compressing the text, and then show the compression ratio. A high ratio could indicate low-quality content, though it’s helpful to establish a baseline by analyzing a variety of pages on your site.

terminal results of compression script

Interpreting and Applying Compression Ratios for SEO

After running this script across several pages, you’ll start to see patterns in the compression ratio of your pages. According to Search Engine Journal and the 2006 research paper, pages with high compression ratios often correlate with spammy or low-quality content. While there’s no universal threshold, if you see pages with significantly higher ratios than others, it may be worth reviewing those pages for redundant or keyword-stuffed content.

Compression Ratio Chart showing Correlation To Low Quality or Spam

Limitations and Final Thoughts

Compression ratio analysis shouldn’t be used in isolation, as it’s just one indicator of quality. However, it’s a quick, data-driven way to identify potential issues, especially on large sites where manual review of every page isn’t feasible. By using this technique alongside traditional SEO audits, you can identify opportunities to enhance your site’s overall content quality and possibly improve rankings by identifying and refining lower-quality pages.

Try running the script across pages on your site and see what insights you discover.  Or if you want help reviewing the content quality of your entire site, we can help – contact us and let us know you’re interested in content quality assessment.

How SEOs Can Identify Low-Quality Pages with Python & Compression Ratios is an original blog post first published on Go Fish Digital.

]]>
https://gofishdigital.com/blog/identify-low-quality-pages-compression-python-seo/feed/ 0
Universal Web Design: A Guide to WCAG Compliance https://gofishdigital.com/blog/universal-web-design-and-wcag-compliance/ https://gofishdigital.com/blog/universal-web-design-and-wcag-compliance/#respond Thu, 06 Aug 2020 14:00:11 +0000 https://gofishdigital.com/universal-web-design-and-wcag-compliance/ Universal Design for the Web Accessibility is an extremely important yet often overlooked component of web design and development. Many companies don’t think about the importance of Universal Web Design until they are troubleshooting client issues or facing a potentially costly lawsuit. I have put together this quick start guide to help companies understand their […]

Universal Web Design: A Guide to WCAG Compliance is an original blog post first published on Go Fish Digital.

]]>
Universal Design for the Web

Accessibility is an extremely important yet often overlooked component of web design and development. Many companies don’t think about the importance of Universal Web Design until they are troubleshooting client issues or facing a potentially costly lawsuit. I have put together this quick start guide to help companies understand their compliance needs as well as simple guidelines on how to accomplish them.

What is Universal Web Design

Universal Web Design is intended to ensure information and communication technology (ICT) can be accessed, understood, and used to the greatest extent possible by all people regardless of their disability. The primary international standards for the World Wide Web and its accessibility are set by the World Wide Web Consortium (W3C), who have created the Web Content Accessibility Guidelines (WCAG 2.0 & 2.1).

Related Content:

In the United States, Section 508 of the Rehabilitation Act of 1973 covers accessibility requirements. The Section 508 guidelines reference the WCAG and require the specific techniques within for compliance.

Who needs to be compliant with the WCAG?

All organizations, Federal and State agencies, and educational institutions should follow the WCAG. In the United States, federal agencies and their contractors are required to conform with WCAG 2.0 (A & AA). Many other countries and international organizations also require compliance with WCAG 2.0 or 2.1. 

Levels of Compliance

The Web Content Accessibility Guidelines (WCAG) are categorized into three levels of compliance:

  • A  – the minimum level of conformance
  • AA – the typical level of conformance required (includes A requirements)
  • AAA – the highest level of conformance (includes A & AA requirements)

Getting Started with WCAG Compliance

Many of us have developed bad compliance habits without realizing the consequences of our choices. I, for one, have been guilty of hiding the outline on the :focus indicator in the past as I found it unpleasant from a design perspective. I have put together the following list of recommendations to help prevent designers and developers from making similar mistakes.

Guidelines for Universal Web Design & Development

  • Always have alt text for visual elements
  • Orient content in a meaningful sequence with appropriate heading tags for each section
  • Use HTML rather than CSS when adding emphasis to text (ex. <em> <strong>)
  • Use valid autocomplete attributes which correspond to the label on input fields
  • Use visual elements, such as underline or icon, to signify a link or use a change in contrast of 3:1 or greater
  • Use visual elements, such as underline or asterisk (*), to signify an error in a form
  • Use a text contrast ratio of at least 4.5:1 to meet AA standards
  • Use a text contrast ratio of  7:1 or more whenever possible to meet AAA standards
  • Ensure all text can be zoomed up to 200%
  • Use em units, named font sizes, or percentages for font sizes
  • Make all elements of your website responsive
  • Ensure visual elements have a contrast ratio of at least 3:1 to meet AA standards
  • Ensure your website can be navigated via keyboard only
  • Provide options for users to pause scrolling or automatic content
  • Provide more than one way to navigate to each web page whenever possible
  • Use techniques to show the user’s current location within your website such as:
    • Breadcrumbs
    • Site Map
    • Visual cues in navigation
    • Titles to indicate parent pages
  • Specify language attribute in HTML (ex. <html lang=”fr”>)

Common WCAG Compliance Mistakes

Avoid these common mistakes to maintain universal design during your development process:

  • Do not use CSS to create “headings.” Always use the HTML tags (ex. <h2>)
  • Do not use CSS to add non-decorative images
  • Do not use  :before or :after for non-decorative content 
  • Do not lock orientation to either portrait or landscape view
  • Do not use color alone to signify a link or to show an error in a form
  • Do not allow text to become unreadable while viewed at 200%:
    • Avoid setting overflow to hidden on absolute elements
    • Avoid creating popups or modals with limited height properties and no scroll
    • Avoid adding height properties to paragraphs
  • Do not use images for text (unless it’s a logo or purely decorative)
  • Do not use CSS to turn off the visual focus indicator (ex. :focus {outline: none})
  • Do not set time limits on user interaction
  • Do not add anything that flashes more than three times in any one second period

View the How to Meet WCAG (Quick Reference) for more information.

If you are using video, audio, pdfs, gestures (swipe, pinch to zoom, etc.) or have other additional components to your site, read more on compliance at w3.org.

Need to know which disabilities may be impacted by non-conformance? Click here for a helpful chart.

Tools for Universal Design & Compliance:

Universal Web Design: A Guide to WCAG Compliance is an original blog post first published on Go Fish Digital.

]]>
https://gofishdigital.com/blog/universal-web-design-and-wcag-compliance/feed/ 0
I’m Considered “Non-Technical” – Should I Learn How to Code? https://gofishdigital.com/blog/im-considered-non-technical-should-i-learn-how-to-code/ https://gofishdigital.com/blog/im-considered-non-technical-should-i-learn-how-to-code/#respond Fri, 10 May 2019 12:00:16 +0000 https://gofishdigital.com/im-considered-non-technical-should-i-learn-how-to-code/ Yes! I get a lot of questions from coworkers, friends, and family about how and why I learned to code. I did not complete a Computer Science degree in college, my job title does not contain “Developer”, and many would consider my role to be a “Non-Technical” one. Yet, I invested significant time and effort […]

I’m Considered “Non-Technical” – Should I Learn How to Code? is an original blog post first published on Go Fish Digital.

]]>
Yes!

I get a lot of questions from coworkers, friends, and family about how and why I learned to code. I did not complete a Computer Science degree in college, my job title does not contain “Developer”, and many would consider my role to be a “Non-Technical” one. Yet, I invested significant time and effort and learned to code and manage infrastructure and data both on my own computer and in cloud platforms. I have come to believe that every person, in almost every position, can benefit from coding and understanding data. What follows are the tenants that have helped me come to this understanding.

Related Content:

Coding is about making decisions.

When you think about it, a computer’s main job is to help us make decisions. Many people who have never learned programming have trouble understanding this fact because they see a computer as an unfeeling blank slate, and they find that it’s an impediment. I would argue that a computer is the opposite of that. Writing code allows us to instill exactly what we want into the decision-making process by including data, excluding outliers, and creating or eliminating bias. The blank slate that a computer represents is actually the perfect starting point to help you create and organize logic.

Along with that, computers help us make decisions incredibly fast (when you can translate your desires into the language they understand.) Think about some decisions you make every day:

  • Who should I assign this task to?
  • When do I need to leave to make the train on time?
  • How effective is my team?

Coding effectively translates these types of questions into language the computer can understand. With the computer’s help, you can resolve those decisions millions of times per second. This allows you to answer important questions more completely than you could with your own intuition, freeing up your time for other tasks. Coding is about organizing logic, and it is one of the best ways we have to make you better at being you.

Coding can solve your repeatable tasks.

Computers are good at making a lot of decisions really quickly, especially when they have identical or similar decision-making criteria. Indeed, this is the exact area where we as humans start to feel stress and burnout. We are often faced with a mountain of work that is mostly similar, but just different enough that we have to invest significant amounts of effort to make the decisions required to complete it.

The key to coding is that when you can define your logic in a programming language, you only ever have to do it once. Once the task is defined, the action of carrying it out is infinitely repeatable. When we can automate this mostly repeatable work away, we can focus on more important things that don’t fit this paradigm. Some examples:

  • Data Analysis, Excel Sheets, Pivot Tables, Summarization
  • Categorization, Tagging, Labeling
  • Copying, Storing, Uploading, Downloading
  • Modifying, Reformatting, Transforming

I’ll bet that everyone reading that can relate at least one of those topics to the work they do every day. Imagine if you could step back and take a look at the bigger picture instead of worrying about the minutiae. This is what coding allows you to do.

Coding trades one-time effort for indefinite benefit.

We have this irrational feeling that the effort we put into our work is what defines its quality. The painful truth is that the only thing that defines the quality of our work is how it is perceived by others. Pulling an all-nighter is pretty meaningless if what you deliver doesn’t solve your problem. Coding allows you to redefine how you value work. Effort is a subjective metric that you can really only attribute to yourself. Time, however, is an objective commodity that you can’t trade, you can’t get back, and everyone experiences exactly the same.

When I started to think about the work I did in terms of how much time it would save, and not how much effort would be expended, I realized that programming is often the most efficient way to trade one for the other. I put in a tremendous effort to learn how to program and how to apply it to the work I do both at my job and at home. Because I put in the time to learn something that has a lot of inherent value, I now have much more time to pursue the things I like to do than I ever would have if I had left those tasks un-automated. I traded one-time effort for repeatable saved time. I’ve always identified with this quote often attributed to Bill Gates (but probably not):

I will always choose a lazy person to do a difficult job because a lazy person will find an easy way to do it.

Conclusion

I’ve found that “Non-Technical” is sort of an oxymoron because, in my experience, people in the least technical positions can often influence their job performance the most with the use of technology. The bottom line is this: your job title does not define you. Your ability to achieve results most certainly does. I’ve learned that understanding programming has made me undeniably better at achieving results, which has, in turn, lead to me achieving better results than I ever could have without it. I wholeheartedly recommend that everyone give programming a shot.

 

 

I’m Considered “Non-Technical” – Should I Learn How to Code? is an original blog post first published on Go Fish Digital.

]]>
https://gofishdigital.com/blog/im-considered-non-technical-should-i-learn-how-to-code/feed/ 0