OSINT Reconnaissance: Information Gathering Techniques
Open Source Intelligence (OSINT) refers to the collection and analysis of information from publicly available sources. In security assessments and penetration testing, OSINT is typically the first phase of reconnaissance, providing valuable insights about a target without direct interaction.
This guide covers essential OSINT techniques and tools for gathering information ethically and legally.
Legal and Ethical Considerations
Before conducting any OSINT research:
- Ensure you have proper authorization if targeting an organization
- Respect privacy laws and terms of service
- Document your findings and methodology
- Never access systems or data without permission
- Be aware of jurisdiction-specific regulations
OSINT should only gather publicly available information. Crossing into unauthorized access is illegal.
Domain and DNS Intelligence
WHOIS Lookups
WHOIS provides registration information for domains:
whois example.com
Information typically includes:
- Registrar details
- Registration and expiration dates
- Name servers
- Contact information (if not privacy-protected)
DNS Enumeration
Gather DNS records to understand the target’s infrastructure:
# All DNS records
dig example.com ANY
# Specific record types
dig example.com MX
dig example.com TXT
dig example.com NS
# Zone transfer attempt (rarely works)
dig axfr @ns1.example.com example.com
Subdomain Discovery
Finding subdomains reveals additional attack surface:
# Using amass
amass enum -d example.com
# Using subfinder
subfinder -d example.com
# Using dnsenum
dnsenum example.com
Manual methods:
- Certificate Transparency logs (crt.sh)
- Search engines with
site:*.example.com - Brute forcing with wordlists
Certificate Transparency
Search certificate logs for issued certificates:
curl -s "https://crt.sh/?q=%.example.com&output=json" | jq '.[] | .name_value' | sort -u
This often reveals subdomains, internal hostnames, and related domains.
Search Engine Intelligence
Google Dorking
Advanced search operators help find specific information:
| Operator | Purpose | Example |
|---|---|---|
| site: | Limit to domain | site:example.com |
| filetype: | Specific file types | filetype:pdf |
| intitle: | Text in page title | intitle:”index of” |
| inurl: | Text in URL | inurl:admin |
| intext: | Text in page body | intext:password |
| ext: | File extension | ext:sql |
Useful combinations:
site:example.com filetype:pdf
site:example.com inurl:admin
site:example.com ext:sql OR ext:bak
site:example.com intitle:"index of"
Finding exposed files:
site:example.com filetype:log
site:example.com filetype:conf
site:example.com filetype:env
Other Search Engines
Different search engines index different content:
- Bing: Sometimes indexes content Google misses
- DuckDuckGo: Uses Bing’s index with different ranking
- Yandex: Better coverage of Russian and Eastern European sites
- Baidu: Chinese content
Email Intelligence
Email Format Discovery
Common patterns:
- first.last@company.com
- flast@company.com
- firstl@company.com
- first@company.com
Tools for verification:
# Using theHarvester
theHarvester -d example.com -b all
Email Header Analysis
Email headers reveal:
- Originating IP addresses
- Mail servers used
- Authentication results (SPF, DKIM, DMARC)
- Routing path
Analyze headers at online tools or manually:
Received: from mail.sender.com (192.168.1.1)
X-Originating-IP: [10.0.0.5]
Social Media Intelligence
Valuable for:
- Employee names and roles
- Technology stack (from job postings)
- Organizational structure
- Company size and growth
Search techniques:
- Current employees at company
- Past employees who might share information
- Job postings revealing technologies used
GitHub and Code Repositories
Developers often accidentally expose sensitive information:
# Search for company repositories
# Look for:
# - API keys and tokens
# - Configuration files
# - Internal documentation
# - Employee usernames
Things to look for in repositories:
- Commit history (deleted secrets may still exist)
- Issues and pull requests
- Contributor information
- README files with internal details
Document Metadata
Documents often contain metadata with usernames, software versions, and internal paths:
# Extract metadata with exiftool
exiftool document.pdf
# Using metagoofil
metagoofil -d example.com -t pdf,doc,xls -o output/
Common metadata fields:
- Author
- Creation software
- Internal file paths
- Revision history
Infrastructure Mapping
IP Address Intelligence
Gather information about IP addresses:
# WHOIS for IP
whois 192.168.1.1
# Reverse DNS
dig -x 192.168.1.1
# ASN lookup
whois -h whois.cymru.com " -v 192.168.1.1"
Shodan
Shodan indexes internet-connected devices:
# Basic search
shodan search hostname:example.com
# Search by organization
shodan search org:"Example Company"
# Filter by port
shodan search hostname:example.com port:22
Shodan reveals:
- Open ports and services
- Banner information
- SSL certificate details
- Potential vulnerabilities
Censys
Similar to Shodan with different data:
- Focuses on certificates and hosts
- Good for finding assets by certificate attributes
- Historical data available
Web Application Reconnaissance
Technology Fingerprinting
Identify technologies used:
# Using whatweb
whatweb example.com
# Using wappalyzer (browser extension)
# Identifies CMS, frameworks, libraries
Web Archives
The Wayback Machine preserves historical snapshots:
- View old versions of websites
- Find removed content
- Track changes over time
- Discover old endpoints
# Get archived URLs
curl "http://web.archive.org/cdx/search/cdx?url=example.com/*&output=text&fl=original&collapse=urlkey"
robots.txt and sitemap.xml
These files often reveal hidden paths:
curl https://example.com/robots.txt
curl https://example.com/sitemap.xml
Data Aggregation Tools
theHarvester
Comprehensive email and subdomain gathering:
theHarvester -d example.com -b all -l 500
Sources include:
- Search engines
- PGP key servers
- DNS brute force
Maltego
Visual link analysis tool for:
- Mapping relationships
- Discovering connections
- Visualizing infrastructure
SpiderFoot
Automated OSINT collection:
# Run scan
spiderfoot -s example.com -t all
Collects data from over 100 sources automatically.
Organizing Findings
Documentation
Maintain detailed records:
- Source of each piece of information
- Date collected
- Relevance to assessment
- Confidence level
Data Validation
Cross-reference findings:
- Verify from multiple sources
- Check for outdated information
- Distinguish confirmed from speculative data
Reporting Structure
Organize findings by category:
- Domain and DNS information
- Network infrastructure
- Personnel and organizational data
- Technology stack
- Potential vulnerabilities
- Social media presence
Operational Security
When conducting OSINT:
- Use VPN or Tor for anonymity when appropriate
- Create separate research accounts
- Clear browser data and cookies
- Be aware of logging and tracking
- Don’t interact directly with target systems
Practical Workflow
A typical OSINT workflow:
- Define scope and objectives
- Passive reconnaissance (no target interaction)
- Domain and DNS enumeration
- Search engine intelligence
- Social media research
- Document and metadata collection
- Technology identification
- Aggregate and analyze findings
- Document and report
Limitations
OSINT has limitations:
- Information may be outdated
- Not all data is accurate
- Privacy protections hide information
- Requires interpretation and analysis
- Cannot replace active reconnaissance
Conclusion
OSINT is a critical first step in security assessments. By gathering publicly available information, you can understand a target’s attack surface without triggering any alerts.
Key principles:
- Stay within legal boundaries
- Document everything
- Verify findings from multiple sources
- Organize data systematically
- Protect your own operational security
The more thorough your OSINT phase, the more effective your subsequent testing will be. Practice these techniques regularly to develop intuition for where valuable information hides.