Open Source Intelligence (OSINT) refers to the collection and analysis of information from publicly available sources. In security assessments and penetration testing, OSINT is typically the first phase of reconnaissance, providing valuable insights about a target without direct interaction.

This guide covers essential OSINT techniques and tools for gathering information ethically and legally.

Before conducting any OSINT research:

  • Ensure you have proper authorization if targeting an organization
  • Respect privacy laws and terms of service
  • Document your findings and methodology
  • Never access systems or data without permission
  • Be aware of jurisdiction-specific regulations

OSINT should only gather publicly available information. Crossing into unauthorized access is illegal.

Domain and DNS Intelligence

WHOIS Lookups

WHOIS provides registration information for domains:

whois example.com

Information typically includes:

  • Registrar details
  • Registration and expiration dates
  • Name servers
  • Contact information (if not privacy-protected)

DNS Enumeration

Gather DNS records to understand the target’s infrastructure:

# All DNS records
dig example.com ANY

# Specific record types
dig example.com MX
dig example.com TXT
dig example.com NS

# Zone transfer attempt (rarely works)
dig axfr @ns1.example.com example.com

Subdomain Discovery

Finding subdomains reveals additional attack surface:

# Using amass
amass enum -d example.com

# Using subfinder
subfinder -d example.com

# Using dnsenum
dnsenum example.com

Manual methods:

  • Certificate Transparency logs (crt.sh)
  • Search engines with site:*.example.com
  • Brute forcing with wordlists

Certificate Transparency

Search certificate logs for issued certificates:

curl -s "https://crt.sh/?q=%.example.com&output=json" | jq '.[] | .name_value' | sort -u

This often reveals subdomains, internal hostnames, and related domains.

Search Engine Intelligence

Google Dorking

Advanced search operators help find specific information:

Operator Purpose Example
site: Limit to domain site:example.com
filetype: Specific file types filetype:pdf
intitle: Text in page title intitle:”index of”
inurl: Text in URL inurl:admin
intext: Text in page body intext:password
ext: File extension ext:sql

Useful combinations:

site:example.com filetype:pdf
site:example.com inurl:admin
site:example.com ext:sql OR ext:bak
site:example.com intitle:"index of"

Finding exposed files:

site:example.com filetype:log
site:example.com filetype:conf
site:example.com filetype:env

Other Search Engines

Different search engines index different content:

  • Bing: Sometimes indexes content Google misses
  • DuckDuckGo: Uses Bing’s index with different ranking
  • Yandex: Better coverage of Russian and Eastern European sites
  • Baidu: Chinese content

Email Intelligence

Email Format Discovery

Common patterns:

  • first.last@company.com
  • flast@company.com
  • firstl@company.com
  • first@company.com

Tools for verification:

# Using theHarvester
theHarvester -d example.com -b all

Email Header Analysis

Email headers reveal:

  • Originating IP addresses
  • Mail servers used
  • Authentication results (SPF, DKIM, DMARC)
  • Routing path

Analyze headers at online tools or manually:

Received: from mail.sender.com (192.168.1.1)
X-Originating-IP: [10.0.0.5]

Social Media Intelligence

LinkedIn

Valuable for:

  • Employee names and roles
  • Technology stack (from job postings)
  • Organizational structure
  • Company size and growth

Search techniques:

  • Current employees at company
  • Past employees who might share information
  • Job postings revealing technologies used

GitHub and Code Repositories

Developers often accidentally expose sensitive information:

# Search for company repositories
# Look for:
# - API keys and tokens
# - Configuration files
# - Internal documentation
# - Employee usernames

Things to look for in repositories:

  • Commit history (deleted secrets may still exist)
  • Issues and pull requests
  • Contributor information
  • README files with internal details

Document Metadata

Documents often contain metadata with usernames, software versions, and internal paths:

# Extract metadata with exiftool
exiftool document.pdf

# Using metagoofil
metagoofil -d example.com -t pdf,doc,xls -o output/

Common metadata fields:

  • Author
  • Creation software
  • Internal file paths
  • Revision history

Infrastructure Mapping

IP Address Intelligence

Gather information about IP addresses:

# WHOIS for IP
whois 192.168.1.1

# Reverse DNS
dig -x 192.168.1.1

# ASN lookup
whois -h whois.cymru.com " -v 192.168.1.1"

Shodan

Shodan indexes internet-connected devices:

# Basic search
shodan search hostname:example.com

# Search by organization
shodan search org:"Example Company"

# Filter by port
shodan search hostname:example.com port:22

Shodan reveals:

  • Open ports and services
  • Banner information
  • SSL certificate details
  • Potential vulnerabilities

Censys

Similar to Shodan with different data:

  • Focuses on certificates and hosts
  • Good for finding assets by certificate attributes
  • Historical data available

Web Application Reconnaissance

Technology Fingerprinting

Identify technologies used:

# Using whatweb
whatweb example.com

# Using wappalyzer (browser extension)
# Identifies CMS, frameworks, libraries

Web Archives

The Wayback Machine preserves historical snapshots:

  • View old versions of websites
  • Find removed content
  • Track changes over time
  • Discover old endpoints
# Get archived URLs
curl "http://web.archive.org/cdx/search/cdx?url=example.com/*&output=text&fl=original&collapse=urlkey"

robots.txt and sitemap.xml

These files often reveal hidden paths:

curl https://example.com/robots.txt
curl https://example.com/sitemap.xml

Data Aggregation Tools

theHarvester

Comprehensive email and subdomain gathering:

theHarvester -d example.com -b all -l 500

Sources include:

  • Search engines
  • PGP key servers
  • LinkedIn
  • DNS brute force

Maltego

Visual link analysis tool for:

  • Mapping relationships
  • Discovering connections
  • Visualizing infrastructure

SpiderFoot

Automated OSINT collection:

# Run scan
spiderfoot -s example.com -t all

Collects data from over 100 sources automatically.

Organizing Findings

Documentation

Maintain detailed records:

  • Source of each piece of information
  • Date collected
  • Relevance to assessment
  • Confidence level

Data Validation

Cross-reference findings:

  • Verify from multiple sources
  • Check for outdated information
  • Distinguish confirmed from speculative data

Reporting Structure

Organize findings by category:

  1. Domain and DNS information
  2. Network infrastructure
  3. Personnel and organizational data
  4. Technology stack
  5. Potential vulnerabilities
  6. Social media presence

Operational Security

When conducting OSINT:

  • Use VPN or Tor for anonymity when appropriate
  • Create separate research accounts
  • Clear browser data and cookies
  • Be aware of logging and tracking
  • Don’t interact directly with target systems

Practical Workflow

A typical OSINT workflow:

  1. Define scope and objectives
  2. Passive reconnaissance (no target interaction)
  3. Domain and DNS enumeration
  4. Search engine intelligence
  5. Social media research
  6. Document and metadata collection
  7. Technology identification
  8. Aggregate and analyze findings
  9. Document and report

Limitations

OSINT has limitations:

  • Information may be outdated
  • Not all data is accurate
  • Privacy protections hide information
  • Requires interpretation and analysis
  • Cannot replace active reconnaissance

Conclusion

OSINT is a critical first step in security assessments. By gathering publicly available information, you can understand a target’s attack surface without triggering any alerts.

Key principles:

  1. Stay within legal boundaries
  2. Document everything
  3. Verify findings from multiple sources
  4. Organize data systematically
  5. Protect your own operational security

The more thorough your OSINT phase, the more effective your subsequent testing will be. Practice these techniques regularly to develop intuition for where valuable information hides.