Added new yml
This commit is contained in:
119
README.md
119
README.md
@@ -1,37 +1,104 @@
|
||||
RSS to Bluesky - in Python
|
||||
--------------------------
|
||||
# post2bsky
|
||||
|
||||
This is a proof-of-concept implementation for posting RSS/Atom content to Bluesky. Some hacking may be required. Issues and pull requests welcome to improve the system.
|
||||
post2bsky is a Python-based tool for automatically posting content from RSS feeds and Twitter accounts to Bluesky (AT Protocol). It supports both RSS-to-Bluesky and Twitter-to-Bluesky synchronization, with configurable workflows for various sources.
|
||||
|
||||
## Features
|
||||
|
||||
## Built with:
|
||||
- **RSS to Bluesky**: Parse RSS feeds and post new entries to Bluesky with proper formatting and media handling.
|
||||
- **Twitter to Bluesky**: Scrape tweets from specified Twitter accounts and repost them to Bluesky, including media attachments.
|
||||
- **Daemon Mode**: Run as a background service for continuous monitoring and posting.
|
||||
- **Configurable Workflows**: Use YAML-based workflows to define sources, schedules, and posting rules.
|
||||
- **Media Support**: Handle images, videos, and other media from feeds and tweets.
|
||||
- **Deduplication**: Prevent duplicate posts using state tracking.
|
||||
- **Logging**: Comprehensive logging for monitoring and debugging.
|
||||
|
||||
* [arrow](https://arrow.readthedocs.io/) - Time handling for humans
|
||||
* [atproto](https://github.com/MarshalX/atproto) - AT protocol implementation for Python. The API of the library is still unstable, but the version is pinned in requirements.txt
|
||||
* [fastfeedparser](https://github.com/kagisearch/fastfeedparser) - For feed parsing with a unified API
|
||||
* [httpx](https://www.python-httpx.org/) - For grabbing remote media
|
||||
## Installation
|
||||
|
||||
1. Clone the repository:
|
||||
```bash
|
||||
git clone https://github.com/yourusername/post2bsky.git
|
||||
cd post2bsky
|
||||
```
|
||||
|
||||
## Features:
|
||||
2. Install Python dependencies:
|
||||
```bash
|
||||
pip install -r requeriments.txt
|
||||
```
|
||||
|
||||
* Deduplication: The script queries the target timeline and only posts RSS items that are more recent than the latest top-level post by the handle.
|
||||
* Filters: Easy to extend code to support filters on RSS contents for simple transformations and limiting cross-posts.
|
||||
* Minimal rich-text support (links): Rich text is represented in a typed hierarchy in the AT protocol. This script currently performs post-processing on filtered string content of the input feeds to support links as long as they stand as a single line in the text. This definitely needs some improvement.
|
||||
* Threading for long posts
|
||||
* Tags
|
||||
* Image references: Can forward image links from RSS to Bsky
|
||||
3. Set up environment variables:
|
||||
Create a `.env` file with your Bluesky credentials:
|
||||
```
|
||||
BSKY_USERNAME=your_bluesky_handle
|
||||
BSKY_PASSWORD=your_bluesky_password
|
||||
```
|
||||
|
||||
## Usage and configuration
|
||||
For Twitter scraping, additional setup may be required (see Configuration).
|
||||
|
||||
1. Start by installing the required libraries `pip install -r requirements.txt`
|
||||
2. Copy the configuration file and then edit it `cp config.json.sample config.json`
|
||||
3. Run the script like `python rss2bsky.py`
|
||||
## Configuration
|
||||
|
||||
The configuration file accepts the configuration of:
|
||||
### RSS Feeds
|
||||
Use `rss2bsky.py` to post from RSS feeds. Configure the feed URL and other options via command-line arguments.
|
||||
|
||||
* a feed URL
|
||||
* bsky parameters for a handle, username, and password
|
||||
* Handle is like name.bsky.social
|
||||
* Username is the email address associated with the account.
|
||||
* Password is your password. If you have a literal quote it can be escaped with a backslash like `\"`
|
||||
* sleep - the amount of time to sleep while running
|
||||
Example:
|
||||
```bash
|
||||
python rss2bsky.py --feed-url https://example.com/rss --bsky-handle your_handle
|
||||
```
|
||||
|
||||
### Twitter Accounts
|
||||
Use `twitter2bsky_daemon.py` for Twitter-to-Bluesky posting. It requires browser automation for scraping.
|
||||
|
||||
Configure Twitter accounts in the script or via environment variables.
|
||||
|
||||
### Workflows
|
||||
The `workflows/` directory contains Jenkins pipeline configurations for automated runs. Each `.yml` file defines a pipeline for a specific source (e.g., `324.yml` for 324 RSS feed).
|
||||
|
||||
To run a workflow manually, use the `sync_runner.sh` script or execute the Python scripts directly.
|
||||
|
||||
## Usage
|
||||
|
||||
### Running RSS Sync
|
||||
```bash
|
||||
python rss2bsky.py [options]
|
||||
```
|
||||
|
||||
Options:
|
||||
- `--feed-url`: URL of the RSS feed
|
||||
- `--bsky-handle`: Your Bluesky handle
|
||||
- Other options for filtering, formatting, etc.
|
||||
|
||||
### Running Twitter Daemon
|
||||
```bash
|
||||
python twitter2bsky_daemon.py [options]
|
||||
```
|
||||
|
||||
Options:
|
||||
- Configure Twitter accounts and Bluesky credentials
|
||||
- Run in daemon mode for continuous operation
|
||||
|
||||
### Using Sync Runner
|
||||
```bash
|
||||
./sync_runner.sh
|
||||
```
|
||||
|
||||
This script can be used to run multiple syncs or integrate with cron jobs.
|
||||
|
||||
## Dependencies
|
||||
|
||||
All Python dependencies are listed in `requeriments.txt`. Key packages include:
|
||||
- `atproto`: For Bluesky API interaction
|
||||
- `fastfeedparser`: For RSS parsing
|
||||
- `playwright`: For browser automation (Twitter scraping)
|
||||
- `beautifulsoup4`: For HTML parsing
|
||||
- And many others for media processing, logging, etc.
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the GNU General Public License v3.0. See [LICENSE](LICENSE) for details.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! Please open issues or submit pull requests on GitHub.
|
||||
|
||||
## Disclaimer
|
||||
|
||||
This tool is for personal use and automation. Ensure compliance with the terms of service of Bluesky, Twitter, and any RSS sources you use. Respect rate limits and avoid spamming.
|
||||
@@ -8,6 +8,7 @@ import httpx
|
||||
import time
|
||||
import os
|
||||
import subprocess
|
||||
import tempfile
|
||||
from urllib.parse import urlparse
|
||||
from dotenv import load_dotenv
|
||||
from atproto import Client, client_utils, models
|
||||
@@ -21,6 +22,9 @@ SCRAPE_TWEET_LIMIT = 30
|
||||
DEDUPE_BSKY_LIMIT = 30
|
||||
TWEET_MAX_AGE_DAYS = 3
|
||||
|
||||
STATE_MAX_ENTRIES = 5000
|
||||
STATE_MAX_AGE_DAYS = 180
|
||||
|
||||
# --- Logging Setup ---
|
||||
logging.basicConfig(
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
@@ -261,6 +265,27 @@ def build_text_media_key(normalized_text, media_fingerprint):
|
||||
return hashlib.sha256(f"{normalized_text}||{media_fingerprint}".encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
def safe_remove_file(path):
|
||||
if path and os.path.exists(path):
|
||||
try:
|
||||
os.remove(path)
|
||||
logging.debug(f"🧹 Removed temp file: {path}")
|
||||
except Exception as e:
|
||||
logging.warning(f"⚠️ Could not remove temp file {path}: {e}")
|
||||
|
||||
|
||||
def build_temp_video_output_path(tweet):
|
||||
"""
|
||||
Create a unique temp mp4 path for this tweet.
|
||||
"""
|
||||
canonical_url = canonicalize_tweet_url(tweet.tweet_url) or ""
|
||||
seed = canonical_url or f"{tweet.created_on}_{tweet.text[:50]}"
|
||||
suffix = hashlib.sha256(seed.encode("utf-8")).hexdigest()[:12]
|
||||
|
||||
temp_dir = tempfile.gettempdir()
|
||||
return os.path.join(temp_dir, f"twitter2bsky_{suffix}.mp4")
|
||||
|
||||
|
||||
# --- Local State Management ---
|
||||
def default_state():
|
||||
return {
|
||||
@@ -357,47 +382,59 @@ def candidate_matches_state(candidate, state):
|
||||
if canonical_tweet_url and canonical_tweet_url in posted_tweets:
|
||||
return True, "state:tweet_url"
|
||||
|
||||
for _, record in posted_tweets.items():
|
||||
for record in posted_tweets.values():
|
||||
if record.get("text_media_key") == text_media_key:
|
||||
return True, "state:text_media_fingerprint"
|
||||
|
||||
for _, record in posted_tweets.items():
|
||||
for record in posted_tweets.values():
|
||||
if record.get("normalized_text") == normalized_text:
|
||||
return True, "state:normalized_text"
|
||||
|
||||
return False, None
|
||||
|
||||
|
||||
def prune_state(state, max_entries=5000):
|
||||
def prune_state(state, max_entries=STATE_MAX_ENTRIES, max_age_days=STATE_MAX_AGE_DAYS):
|
||||
"""
|
||||
Keep state file from growing forever.
|
||||
Prunes oldest records by posted_at if necessary.
|
||||
Prunes:
|
||||
- entries older than max_age_days
|
||||
- entries beyond max_entries, keeping newest first
|
||||
- orphan posted_by_bsky_uri keys
|
||||
"""
|
||||
posted_tweets = state.get("posted_tweets", {})
|
||||
cutoff = arrow.utcnow().shift(days=-max_age_days)
|
||||
|
||||
if len(posted_tweets) <= max_entries:
|
||||
return state
|
||||
kept_items = []
|
||||
|
||||
sortable = []
|
||||
for key, record in posted_tweets.items():
|
||||
posted_at = record.get("posted_at") or ""
|
||||
sortable.append((key, posted_at))
|
||||
posted_at_raw = record.get("posted_at")
|
||||
keep = True
|
||||
|
||||
sortable.sort(key=lambda x: x[1], reverse=True)
|
||||
keep_keys = {key for key, _ in sortable[:max_entries]}
|
||||
if posted_at_raw:
|
||||
try:
|
||||
posted_at = arrow.get(posted_at_raw)
|
||||
if posted_at < cutoff:
|
||||
keep = False
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
new_posted_tweets = {}
|
||||
for key, record in posted_tweets.items():
|
||||
if key in keep_keys:
|
||||
new_posted_tweets[key] = record
|
||||
if keep:
|
||||
kept_items.append((key, record))
|
||||
|
||||
new_posted_by_bsky_uri = {}
|
||||
for bsky_uri, key in state.get("posted_by_bsky_uri", {}).items():
|
||||
if key in keep_keys:
|
||||
new_posted_by_bsky_uri[bsky_uri] = key
|
||||
kept_items.sort(key=lambda item: item[1].get("posted_at", ""), reverse=True)
|
||||
kept_items = kept_items[:max_entries]
|
||||
|
||||
keep_keys = {key for key, _ in kept_items}
|
||||
|
||||
state["posted_tweets"] = {key: record for key, record in kept_items}
|
||||
|
||||
posted_by_bsky_uri = state.get("posted_by_bsky_uri", {})
|
||||
state["posted_by_bsky_uri"] = {
|
||||
bsky_uri: key
|
||||
for bsky_uri, key in posted_by_bsky_uri.items()
|
||||
if key in keep_keys
|
||||
}
|
||||
|
||||
state["posted_tweets"] = new_posted_tweets
|
||||
state["posted_by_bsky_uri"] = new_posted_by_bsky_uri
|
||||
return state
|
||||
|
||||
|
||||
@@ -898,12 +935,8 @@ def download_and_crop_video(video_url, output_path):
|
||||
return None
|
||||
|
||||
finally:
|
||||
for path in [temp_input, temp_output]:
|
||||
if os.path.exists(path):
|
||||
try:
|
||||
os.remove(path)
|
||||
except Exception:
|
||||
pass
|
||||
safe_remove_file(temp_input)
|
||||
safe_remove_file(temp_output)
|
||||
|
||||
|
||||
def candidate_matches_existing_bsky(candidate, recent_bsky_posts):
|
||||
@@ -938,6 +971,8 @@ def sync_feeds(args):
|
||||
logging.info("🔄 Starting sync cycle...")
|
||||
try:
|
||||
state = load_state(STATE_PATH)
|
||||
state = prune_state(state)
|
||||
save_state(state, STATE_PATH)
|
||||
|
||||
tweets = scrape_tweets_via_playwright(
|
||||
args.twitter_username,
|
||||
@@ -1028,7 +1063,7 @@ def sync_feeds(args):
|
||||
return
|
||||
|
||||
new_posts = 0
|
||||
state_file = "twitter_browser_state.json"
|
||||
browser_state_file = "twitter_browser_state.json"
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(
|
||||
@@ -1043,8 +1078,8 @@ def sync_feeds(args):
|
||||
),
|
||||
"viewport": {"width": 1920, "height": 1080},
|
||||
}
|
||||
if os.path.exists(state_file):
|
||||
context_kwargs["storage_state"] = state_file
|
||||
if os.path.exists(browser_state_file):
|
||||
context_kwargs["storage_state"] = browser_state_file
|
||||
|
||||
context = browser.new_context(**context_kwargs)
|
||||
|
||||
@@ -1078,7 +1113,7 @@ def sync_feeds(args):
|
||||
logging.warning("⚠️ Tweet has video marker but no tweet URL. Skipping video.")
|
||||
continue
|
||||
|
||||
temp_video_path = "temp_video.mp4"
|
||||
temp_video_path = build_temp_video_output_path(tweet)
|
||||
|
||||
try:
|
||||
real_video_url = extract_video_url_from_tweet_page(context, tweet.tweet_url)
|
||||
@@ -1099,8 +1134,9 @@ def sync_feeds(args):
|
||||
video_embed = build_video_embed(video_blob, dynamic_alt)
|
||||
|
||||
finally:
|
||||
if os.path.exists(temp_video_path):
|
||||
os.remove(temp_video_path)
|
||||
safe_remove_file(temp_video_path)
|
||||
safe_remove_file(temp_video_path.replace(".mp4", "_source.mp4"))
|
||||
safe_remove_file(temp_video_path.replace(".mp4", "_cropped.mp4"))
|
||||
|
||||
try:
|
||||
post_result = None
|
||||
@@ -1116,7 +1152,7 @@ def sync_feeds(args):
|
||||
bsky_uri = getattr(post_result, "uri", None)
|
||||
|
||||
remember_posted_tweet(state, candidate, bsky_uri=bsky_uri)
|
||||
state = prune_state(state, max_entries=5000)
|
||||
state = prune_state(state)
|
||||
save_state(state, STATE_PATH)
|
||||
|
||||
recent_bsky_posts.insert(0, {
|
||||
@@ -1186,4 +1222,4 @@ def main():
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
main()
|
||||
Reference in New Issue
Block a user