Added new yml

2026-04-05 09:04:46 +02:00
parent cce6ff558c
commit 6fdd588179
2 changed files with 164 additions and 61 deletions
--- a/README.md
+++ b/README.md
@@ -1,37 +1,104 @@
-RSS to Bluesky - in Python
+# post2bsky
 --------------------------
-This is a proof-of-concept implementation for posting RSS/Atom content to Bluesky. Some hacking may be required. Issues and pull requests welcome to improve the system.
+post2bsky is a Python-based tool for automatically posting content from RSS feeds and Twitter accounts to Bluesky (AT Protocol). It supports both RSS-to-Bluesky and Twitter-to-Bluesky synchronization, with configurable workflows for various sources.
 ## Features
-## Built with:
+- **RSS to Bluesky**: Parse RSS feeds and post new entries to Bluesky with proper formatting and media handling.
 - **Twitter to Bluesky**: Scrape tweets from specified Twitter accounts and repost them to Bluesky, including media attachments.
 - **Daemon Mode**: Run as a background service for continuous monitoring and posting.
 - **Configurable Workflows**: Use YAML-based workflows to define sources, schedules, and posting rules.
 - **Media Support**: Handle images, videos, and other media from feeds and tweets.
 - **Deduplication**: Prevent duplicate posts using state tracking.
 - **Logging**: Comprehensive logging for monitoring and debugging.
-* [arrow](https://arrow.readthedocs.io/) - Time handling for humans
+## Installation
 * [atproto](https://github.com/MarshalX/atproto) - AT protocol implementation for Python. The API of the library is still unstable, but the version is pinned in requirements.txt
 * [fastfeedparser](https://github.com/kagisearch/fastfeedparser) - For feed parsing with a unified API
 * [httpx](https://www.python-httpx.org/) - For grabbing remote media
 1. Clone the repository:
   ```bash
   git clone https://github.com/yourusername/post2bsky.git
   cd post2bsky
   ```
-## Features:
+2. Install Python dependencies:
   ```bash
   pip install -r requeriments.txt
   ```
-* Deduplication: The script queries the target timeline and only posts RSS items that are more recent than the latest top-level post by the handle.
+3. Set up environment variables:
-* Filters: Easy to extend code to support filters on RSS contents for simple transformations and limiting cross-posts.
+   Create a `.env` file with your Bluesky credentials:
-* Minimal rich-text support (links): Rich text is represented in a typed hierarchy in the AT protocol. This script currently performs post-processing on filtered string content of the input feeds to support links as long as they stand as a single line in the text. This definitely needs some improvement.
+   ```
-* Threading for long posts
+   BSKY_USERNAME=your_bluesky_handle
-* Tags
+   BSKY_PASSWORD=your_bluesky_password
-* Image references: Can forward image links from RSS to Bsky
+   ```
-## Usage and configuration
+   For Twitter scraping, additional setup may be required (see Configuration).
-1. Start by installing the required libraries `pip install -r requirements.txt`
+## Configuration
 2. Copy the configuration file and then edit it `cp config.json.sample config.json`
 3. Run the script like `python rss2bsky.py`
-The configuration file accepts the configuration of:
+### RSS Feeds
 Use `rss2bsky.py` to post from RSS feeds. Configure the feed URL and other options via command-line arguments.
-* a feed URL
+Example:
-* bsky parameters for a handle, username, and password
+```bash
-  * Handle is like name.bsky.social
+python rss2bsky.py --feed-url https://example.com/rss --bsky-handle your_handle
-  * Username is the email address associated with the account.
+```
-  * Password is your password. If you have a literal quote it can be escaped with a backslash like `\"`
+
-* sleep - the amount of time to sleep while running
+### Twitter Accounts
 Use `twitter2bsky_daemon.py` for Twitter-to-Bluesky posting. It requires browser automation for scraping.
 Configure Twitter accounts in the script or via environment variables.
 ### Workflows
 The `workflows/` directory contains Jenkins pipeline configurations for automated runs. Each `.yml` file defines a pipeline for a specific source (e.g., `324.yml` for 324 RSS feed).
 To run a workflow manually, use the `sync_runner.sh` script or execute the Python scripts directly.
 ## Usage
 ### Running RSS Sync
 ```bash
 python rss2bsky.py [options]
 ```
 Options:
 - `--feed-url`: URL of the RSS feed
 - `--bsky-handle`: Your Bluesky handle
 - Other options for filtering, formatting, etc.
 ### Running Twitter Daemon
 ```bash
 python twitter2bsky_daemon.py [options]
 ```
 Options:
 - Configure Twitter accounts and Bluesky credentials
 - Run in daemon mode for continuous operation
 ### Using Sync Runner
 ```bash
 ./sync_runner.sh
 ```
 This script can be used to run multiple syncs or integrate with cron jobs.
 ## Dependencies
 All Python dependencies are listed in `requeriments.txt`. Key packages include:
 - `atproto`: For Bluesky API interaction
 - `fastfeedparser`: For RSS parsing
 - `playwright`: For browser automation (Twitter scraping)
 - `beautifulsoup4`: For HTML parsing
 - And many others for media processing, logging, etc.
 ## License
 This project is licensed under the GNU General Public License v3.0. See [LICENSE](LICENSE) for details.
 ## Contributing
 Contributions are welcome! Please open issues or submit pull requests on GitHub.
 ## Disclaimer
 This tool is for personal use and automation. Ensure compliance with the terms of service of Bluesky, Twitter, and any RSS sources you use. Respect rate limits and avoid spamming.
--- a/twitter2bsky_daemon-22.py
+++ b/twitter2bsky_daemon-22.py
@@ -8,6 +8,7 @@ import httpx
 import time
 import os
 import subprocess
 import tempfile
 from urllib.parse import urlparse
 from dotenv import load_dotenv
 from atproto import Client, client_utils, models
@@ -21,6 +22,9 @@ SCRAPE_TWEET_LIMIT = 30
 DEDUPE_BSKY_LIMIT = 30
 TWEET_MAX_AGE_DAYS = 3
 STATE_MAX_ENTRIES = 5000
 STATE_MAX_AGE_DAYS = 180
 # --- Logging Setup ---
 logging.basicConfig(
    format="%(asctime)s [%(levelname)s] %(message)s",
@@ -261,6 +265,27 @@ def build_text_media_key(normalized_text, media_fingerprint):
    return hashlib.sha256(f"{normalized_text}||{media_fingerprint}".encode("utf-8")).hexdigest()
 def safe_remove_file(path):
    if path and os.path.exists(path):
        try:
            os.remove(path)
            logging.debug(f"🧹 Removed temp file: {path}")
        except Exception as e:
            logging.warning(f"⚠️ Could not remove temp file {path}: {e}")
 def build_temp_video_output_path(tweet):
    """
    Create a unique temp mp4 path for this tweet.
    """
    canonical_url = canonicalize_tweet_url(tweet.tweet_url) or ""
    seed = canonical_url or f"{tweet.created_on}_{tweet.text[:50]}"
    suffix = hashlib.sha256(seed.encode("utf-8")).hexdigest()[:12]
    temp_dir = tempfile.gettempdir()
    return os.path.join(temp_dir, f"twitter2bsky_{suffix}.mp4")
 # --- Local State Management ---
 def default_state():
    return {
@@ -357,47 +382,59 @@ def candidate_matches_state(candidate, state):
    if canonical_tweet_url and canonical_tweet_url in posted_tweets:
        return True, "state:tweet_url"
-    for _, record in posted_tweets.items():
+    for record in posted_tweets.values():
        if record.get("text_media_key") == text_media_key:
            return True, "state:text_media_fingerprint"
-    for _, record in posted_tweets.items():
+    for record in posted_tweets.values():
        if record.get("normalized_text") == normalized_text:
            return True, "state:normalized_text"
    return False, None
-def prune_state(state, max_entries=5000):
+def prune_state(state, max_entries=STATE_MAX_ENTRIES, max_age_days=STATE_MAX_AGE_DAYS):
    """
    Keep state file from growing forever.
-    Prunes oldest records by posted_at if necessary.
+    Prunes:
    - entries older than max_age_days
    - entries beyond max_entries, keeping newest first
    - orphan posted_by_bsky_uri keys
    """
    posted_tweets = state.get("posted_tweets", {})
    cutoff = arrow.utcnow().shift(days=-max_age_days)
-    if len(posted_tweets) <= max_entries:
+    kept_items = []
        return state
    sortable = []
    for key, record in posted_tweets.items():
-        posted_at = record.get("posted_at") or ""
+        posted_at_raw = record.get("posted_at")
-        sortable.append((key, posted_at))
+        keep = True
-    sortable.sort(key=lambda x: x[1], reverse=True)
+        if posted_at_raw:
-    keep_keys = {key for key, _ in sortable[:max_entries]}
+            try:
                posted_at = arrow.get(posted_at_raw)
                if posted_at < cutoff:
                    keep = False
            except Exception:
                pass
-    new_posted_tweets = {}
+        if keep:
-    for key, record in posted_tweets.items():
+            kept_items.append((key, record))
        if key in keep_keys:
            new_posted_tweets[key] = record
-    new_posted_by_bsky_uri = {}
+    kept_items.sort(key=lambda item: item[1].get("posted_at", ""), reverse=True)
-    for bsky_uri, key in state.get("posted_by_bsky_uri", {}).items():
+    kept_items = kept_items[:max_entries]
-        if key in keep_keys:
+
-            new_posted_by_bsky_uri[bsky_uri] = key
+    keep_keys = {key for key, _ in kept_items}
    state["posted_tweets"] = {key: record for key, record in kept_items}
    posted_by_bsky_uri = state.get("posted_by_bsky_uri", {})
    state["posted_by_bsky_uri"] = {
        bsky_uri: key
        for bsky_uri, key in posted_by_bsky_uri.items()
        if key in keep_keys
    }
    state["posted_tweets"] = new_posted_tweets
    state["posted_by_bsky_uri"] = new_posted_by_bsky_uri
    return state
@@ -898,12 +935,8 @@ def download_and_crop_video(video_url, output_path):
        return None
    finally:
-        for path in [temp_input, temp_output]:
+        safe_remove_file(temp_input)
-            if os.path.exists(path):
+        safe_remove_file(temp_output)
                try:
                    os.remove(path)
                except Exception:
                    pass
 def candidate_matches_existing_bsky(candidate, recent_bsky_posts):
@@ -938,6 +971,8 @@ def sync_feeds(args):
    logging.info("🔄 Starting sync cycle...")
    try:
        state = load_state(STATE_PATH)
        state = prune_state(state)
        save_state(state, STATE_PATH)
        tweets = scrape_tweets_via_playwright(
            args.twitter_username,
@@ -1028,7 +1063,7 @@ def sync_feeds(args):
            return
        new_posts = 0
-        state_file = "twitter_browser_state.json"
+        browser_state_file = "twitter_browser_state.json"
        with sync_playwright() as p:
            browser = p.chromium.launch(
@@ -1043,8 +1078,8 @@ def sync_feeds(args):
                ),
                "viewport": {"width": 1920, "height": 1080},
            }
-            if os.path.exists(state_file):
+            if os.path.exists(browser_state_file):
-                context_kwargs["storage_state"] = state_file
+                context_kwargs["storage_state"] = browser_state_file
            context = browser.new_context(**context_kwargs)
@@ -1078,7 +1113,7 @@ def sync_feeds(args):
                                logging.warning("⚠️ Tweet has video marker but no tweet URL. Skipping video.")
                                continue
-                            temp_video_path = "temp_video.mp4"
+                            temp_video_path = build_temp_video_output_path(tweet)
                            try:
                                real_video_url = extract_video_url_from_tweet_page(context, tweet.tweet_url)
@@ -1099,8 +1134,9 @@ def sync_feeds(args):
                                video_embed = build_video_embed(video_blob, dynamic_alt)
                            finally:
-                                if os.path.exists(temp_video_path):
+                                safe_remove_file(temp_video_path)
-                                    os.remove(temp_video_path)
+                                safe_remove_file(temp_video_path.replace(".mp4", "_source.mp4"))
                                safe_remove_file(temp_video_path.replace(".mp4", "_cropped.mp4"))
                try:
                    post_result = None
@@ -1116,7 +1152,7 @@ def sync_feeds(args):
                    bsky_uri = getattr(post_result, "uri", None)
                    remember_posted_tweet(state, candidate, bsky_uri=bsky_uri)
-                    state = prune_state(state, max_entries=5000)
+                    state = prune_state(state)
                    save_state(state, STATE_PATH)
                    recent_bsky_posts.insert(0, {