@itsericwoodward Thanks! To be clear, my contribution was literally adding that sentence to the documentation, after other people did the work.
James Cook. Time-space trader and software hipster.
@itsericwoodward Thanks! To be clear, my contribution was literally adding that sentence to the documentation, after other people did the work.
@lyse Oops, I guess the new text is a bit obscure. If you follow the link, the text is a bit more explicit, but you still need to know what a lexical scope is. Anyway, this is part of Perl moving very carefully toward being UTF-8 by default while also not breaking code written in the 90s. If you name a recent version like "use v5.42;" then Perl stops letting you use non-ASCII characters unless you also say "use utf8;". The "lexically" part basically means that strictness continues until the next "}", or the end of the program. That lets you fix up old code one block at a time, if you aren't ready to apply the new strictness to a whole file at once.
@lyse Thank you for the suggestions. I will probably do some of that when I have time. For the thumbnails, I'm also thinking about trying the loading="lazy" img attribute. Top on my mind is actually understanding why the big images don't load. Maybe my VPS's network connection is saturated, for example. I've never needed to worry about such things until now. I'm looking forward to spending some time on it.
@rdlmda I am reasonably happy with jenny. If I find time for a twtxt project, I would like to make a web page that works as follows: you point it to your own twtxt feed (as a URL parameter), and then it shows you all the feeds referenced by your "# follow =" lines. So, if I put this up, anyone could use it to view their own feed, with no login required. (Probably a difficult project. For example, I'd want to make sure the backend couldn't be tricked into helping ddos a web server by trying to fetch lots of "feeds" from it. Anyway, I have too many other projects.)
@lyse Thanks for letting me know. HTML checkers seem happy now. I'm not sure what to do about the images not loading. The photos have three sizes (thumbnail, photo page, and original if you click the img tag on the photo page); can you at least see the smaller two sizes? Maybe I will do some experimental fetches and/or start measuring things on my web server.
@bender Oops, missed this. I haven't done any client work since my brief experiment modifying jenny a while back.
@movq Wow, I use Firefox and didn't realize this existed! Thanks for pointing it out. I noticed at least one bug cited a webcompat.com report; I wonder if someone at Mozilla monitors those. https://webcompat.com/issues?page=1&per_page=50&state=open&stage=all&sort=created&direction=desc
@lyse Thanks for taking a look, and for pointing out the mixture of tabs and spaces.
I think I'll leave reachability.c alone, since my intention there was to use an indent level of one tab, and the spaces are just there to line up a few extra things. I fixed reachability_with_stack.cc though.
@prologic Those aren't actually serving anything public-facing. I've thought about it, but for now I'm sticking with VPSs, partly because I don't relish the risk of weeks of downtime if something goes wrong while I'm travelling.
@lyse I don't remember exactly. They might have been growing all winter. The trick is to have a badly insulated extension to the house.
@lyse I am a big fan of "obvious" math facts that turn out to be wrong. If you want to understand how reusing space actually works, you are mostly stuck reading complexity theory papers right now. Ian wrote a good survey: https://iuuk.mff.cuni.cz/~iwmertz/papers/m23.reusing_space.pdf . It's written for complexity theorists, but some of will make sense to programmers comfortable with math. Alternatively, I wrote an essay a few years ago explaining one technique, with (math-loving) programmers as the intended audience: https://www.falsifian.org/blog/2021/06/04/catalytic/ .
@sorenpeter Sorry, I realized that shortly after posting. Here's another attempt to post the images:

@prologic Have you tried Google's robots.txt report? https://support.google.com/webmasters/answer/6062598?hl=en . I would expect Google to be pretty good about this sort of thing. If you have the energy to dig into it and, for example, post on support.google.com, I'd be curious to hear what you find out.
@eapl.me I like this idea. Another option would be to show a limited number of posts, with an option to see the omitted ones by user. Either way, I wonder how well that works with threading.
@lyse Thanks for sharing. I really enjoyed it. The beginning part about the history of life on Earth was fun to watch having just read Dawkin's old book The Selfish Geene, and now I want to read more about archaea. The end of the talk about what might be going on on Mars made me a bit hopeful someone will find some good evidence.
@movq Looks fun. Also kind of looks like APL and Forth had a baby on Jupyter.
@lyse Beautiful pictures, and beautiful HTML for a photo album!
@prologic I'm grateful for this accident. I find browsing twtxt.net useful even though I don't have an account there. I do it when I can't use Jenny because I only have my phone, or if I want to see messages I might have missed. I know it's not guaranteed to catch everything, but it's pretty good, even if it's not intentional.
@Codebuzz I use Jenny to add to a local copy of my twtxt.txt file, and then manually push it to my web servers. I prefer timestamps to end with "Z" rather than "+00:00" so I modified Jenny to use that format. I mostly follow conversations using Jenny, but sometimes I check twtxt.net, which could catch twts I missed.
@bender I try to avoid editing. I guess I would write 5/4, 6/4, etc, and hopefully my audience would be sympathetic to my failing.
Anyway, I don't think my eccentric decision to number my twts in the style of other social media platforms is the only context where someone might write 1/4 not meaning a quarter. E.g. January 4, to Americans.
I'm happy to keep overthinking this for as long as you are :-P
@bender @prologic I'm not exactly asking yarnd to change. If you are okay with the way it displayed my twts, then by all means, leave it as is. I hope you won't mind if I continue to write things like 1/4 to mean "first out of four".
What has text/markdown got to do with this? I don't think Markdown says anything about replacing 1/4 with ¼, or other similar transformations. It's not needed, because ¼ is already a unicode character that can simply be directly inserted into the text file.
What's wrong with my original suggestion of doing the transformation before the text hits the twtxt.txt file? @prologic, I think it would achieve what you are trying to achieve with this content-type thing: if someone writes 1/4 on a yarnd instance or any other client that wants to do this, it would get transformed, and other clients simply wouldn't do the transformation. Every client that supports displaying unicode characters, including Jenny, would then display ¼ as ¼.
Alternatively, if you prefer yarnd to pretty-print all twts nicely, even ones from simpler clients, that's fine too and you don't need to change anything. My 1/4 -> ¼ thing is nothing more than a minor irritation which probably isn't worth overthinking.
@prologic I'm not a yarnd user, so it doesn't matter a whole lot to me, but FWIW I'm not especially keen on changing how I format my twts to work around yarnd's quirks.
I wonder if this kind of postprocessing would fit better between composing (via yarnd's UI) and publishing. So, if a yarnd user types 1/4, it could get changed to ¼ in the twtxt.txt file for everyone to see, not just people reading through yarnd. But when I type 1/4, meaning first out of four, as a non-yarnd user, the meaning wouldn't get corrupted. I can always type ¼ directly if that's what I really intend.
(This twt might be easier to understand if you read it without any transformations :-P)
Anyway, again, I'm not a yarnd user, so do what you will, just know you might not be seeing exactly what I meant.
@prologic I wrote 1/4 (one slash four) by which I meant "the first out of four". twtxt.net is showing it as ¼, a single character that IMO doesn't have that same meaning (it means 0.25). Similarly, 3/4 got replaced with ¾ in another twt. It's not a big deal. It just looks a little wrong, especially beside the 2/4 and 4/4 in my other two twts.
@prologic One could argue twtxt.net's display formatting is a little over-eager here.
@prologic I think printf is a more portable option than echo -e for interpreting \t as tab. E.g. printf '%s\t%s\t%s' "$url" "$time" "$text". In general I always prefer printf over echo for anything non-trivial in unix shell scripts. See last paragraph of https://en.wikipedia.org/wiki/Echo_(command)#History
@aelaraji You could just remove the {getuser()} part because you added ~.
@bender It's the experience of an ordinary person in a strange place where memories are disappearing with the help of the Memory Police. The setting feels contemporary (to the book's 1994 publication date) rather than futuristic, except for some unexplained stuff about memories.
Yes, that is exactly what I meant. I like that collection and "twtxt v2" feels like a departure.
Maybe there's an advantage to grouping it into one spec, but IMO that shouldn't be done at the same time as introducing new untested ideas.
See https://yarn.social (especially this section: https://yarn.social/#self-host) -- It really doesn't get much simpler than this 🤣
Again, I like this existing simplicity. (I would even argue you don't need the metadata.)
That page says "For the best experience your client should also support some of the Twtxt Extensions..." but it is clear you don't need to. I would like it to stay that way, and publishing a big long spec and calling it "twtxt v2" feels like a departure from that. (I think the content of the document is valuable; I'm just carping about how it's being presented.)
@prologic Done. Also, I went ahead and made two changes: changed hexadecimal to base64 for hashes (wasn't sure if anyone objected), and changed "MUST follow the chain" to "SHOULD follow the chain.
@prologic Thanks for pointing out it lasts four hours. That's a big window! I wonder when most people will be on. I might aim for halfway through unless I hear otherwise. (12:00Z is a bit early for me.)
@movq Yes, the tools are surprisingly fast. Still, magrep takes about 20 seconds to search through my archive of 140K emails, so to speed things up I would probably combine it with an indexer like mu, mairix or notmuch.
@bender Ha! Maybe I should get on the Markdown train. You're taking away my excuses.
Sorry, you're right, I should have used numbers!
I'm don't understand what "preserve the original hash" could mean other than "make sure there's still a twt in the feed with that hash". Maybe the text could be clarified somehow.
I'm also not sure what you mean by markdown already being part of it. Of course people can already use Markdown, just like presumably nothing stopped people from using (twt subjects) before they were formally described. But it's not universal; e.g. as a jenny user I just see the plain text.
@prologic Do you feel the same about published vs. privately stored data?
For me there's a distinction. I feel very strongly that I should be able to retain whatever private information I like. On the other hand, I do have some sympathy for requests not to publish or propagate (though I personally feel it's still morally acceptable to ignore such requests).
@lyse I'd suggest making the whole content-type thing a SHOULD, to accommodate people just using some hosting service they don't have much control over. (The same situation could make detecting followers hard, but IMO "please email me if you follow me" is still legit twtxt, even if inconvenient.)
@prologic Thanks for writing that up!
I hope it can remain a living document (or sequence of draft revisions) for a good long time while we figure out how this stuff works in practice.
I am not sure how I feel about all this being done at once, vs. letting conventions arise.
For example, even today I could reply to twt abc1234 with "(#abc1234) Edit: ..." and I think all you humans would understand it as an edit to (#abc1234). Maybe eventually it would become a common enough convention that clients would start to support it explicitly.
Similarly we could just start using 11-digit hashes. We should iron out whether it's sha256 or whatever but there's no need get all the other stuff right at the same time.
I have similar thoughts about how some users could try out location-based replies in a backward-compatible way (append the replyto: stuff after the legacy (#hash) style).
However I recognize that I'm not the one implementing this stuff, and it's less work to just have everything determined up front.
Misc comments (I haven't read the whole thing):
Did you mean to make hashes hexadecimal? You lose 11 bits that way compared to base32. I'd suggest gaining 11 bits with base64 instead.
"Clients MUST preserve the original hash" --- do you mean they MUST preserve the original twt?
Thanks for phrasing the bit about deletions so neutrally.
I don't like the MUST in "Clients MUST follow the chain of reply-to references...". If someone writes a client as a 40-line shell script that requires the user to piece together the threading themselves, IMO we shouldn't declare the client non-conforming just because they didn't get to all the bells and whistles.
Similarly I don't like the MUST for user agents. For one thing, you might want to fetch a feed without revealing your identty. Also, it raises the bar for a minimal implementation (I'm again thinking again of the 40-line shell script).
For "who follows" lists: why must the long, random tokens be only valid for a limited time? Do you have a scenario in mind where they could leak?
Why can't feeds be served over HTTP/1.0? Again, thinking about simple software. I recently tried implementing HTTP/1.1 and it wasn't too bad, but 1.0 would have been slightly simpler.
Why get into the nitty-gritty about caching headers? This seems like generic advice for HTTP servers and clients.
I'm a little sad about other protocols being not recommended.
I don't know how I feel about including markdown. I don't mind too much that yarn users emit twts full of markdown, but I'm more of a plain text kind of person. Also it adds to the length. I wonder if putting a separate document would make more sense; that would also help with the length.
@prologic I have no specifics, only hopes. (I have seen some articles explaining the GDPR doesn't apply to a "purely personal or household activity" but I don't really know what that means.)
I don't know if it's worth giving much thought to the issue unless either you expect to get big enough for the GDPR to matter a lot (I imagine making money is a prerequisite) or someone specifically brings it up. Unless you enjoy thinking through this sort of thing, of course.
@david Thanks, that's good feedback to have. I wonder to what extent this already exists in registry servers and yarn pods. I haven't really tried digging into the past in either one.
How interested would you be in changes in metadata and other comments in the feeds? I'm thinking of just permanently saving every version of each twtxt file that gets pulled, not just the twts. It wouldn't be hard to do (though presenting the information in a sensible way is another matter). Compression should make storage a non-issue unless someone does something weird with their feed like shuffle the comments around every time I fetch it.
@movq I don't think it has to be like that. Just make sure the new version of the twt is always appended to your current feed, and have some convention for indicating it's an edit and which twt it supersedes. Keep the original twt as-is (or delete it if you don't want new followers to see it); doesn't matter if it's archived because you aren't changing that copy.
@prologic Do you have a link to some past discussion?
Would the GDPR would apply to a one-person client like jenny? I seriously hope not. If someone asks me to delete an email they sent me, I don't think I have to honour that request, no matter how European they are.
I am really bothered by the idea that someone could force me to delete my private, personal record of my interactions with them. Would I have to delete my journal entries about them too if they asked?
Maybe a public-facing client like yarnd needs to consider this, but that also bothers me. I was actually thinking about making an Internet Archive style twtxt archiver, letting you explore past twts, including long-dead feeds, see edit histories, deleted twts, etc.
@david Well, I wouldn't recommend using my code for your main jenny use anyway. If you want to try it out, set XDG_CONFIG_HOME and XDG_CACHE_HOME to some sandbox directories and only run my code there. If @movq is interested in any of this getting upstreamed, I'd be happy to try rebasing the changes, but otherwise it's a proof of concept and fun exercise.
I forgot to git add a new test file. Added to the patch now at https://www.falsifian.org/a/oDtr/patch0.txt
BTW this code doesn't incorporate existing twts into jenny's database. It's best used starting from scratch. I've been testing it using a custom XDG_CACHE_HOME and XDG_CONFIG_HOME to avoid messing with my "real" jenny data.
@prologic Wikipedia claims sha1 is vulnerable to a "chosen-prefix attack", which I gather means I can write any two twts I like, and then cause them to have the exact same sha1 hash by appending something. I guess a twt ending in random junk might look suspcious, but perhaps the junk could be worked into an image URL like
. If that's not possible now maybe it will be later.
git only uses sha1 because they're stuck with it: migrating is very hard. There was an effort to move git to sha256 but I don't know its status. I think there is progress being made with Game Of Trees, a git clone that uses the same on-disk format.
I can't imagine any benefit to using sha1, except that maybe some very old software might support sha1 but not sha256.
@movq Agreed that hashes have a benefit. I came up with a similar example where when I twted about an 11-character hash collision. Perhaps hashes could be made optional somehow. Like, you could use the "replyto" idea and then additionally put a hash somewhere if you want to lock in which version of the twt you are replying to.
@quark Oh, sure, it would be nice if edits didn't break threads. I was just pondering the circumstances under which I get annoyed about data being irrecoverably deleted or otherwise lost.
@quark I don't really mind if the twt gets edited before I even fetch it. I think it's the idea of my computer discarding old versions it's fetched, especially if it's shown them to me, that bugs me.
But I do like @movq's suggestion on this thread that feeds could contain both the original and the edited twt. I guess it would be up to the author.
@quark None. I like being able to see edit history for the same reason.
@prologic Why sha1 in particular? There are known attacks on it. sha256 seems pretty widely supported if you're worried about support.
@prologic I wouldn't want my client to honour delete requests. I like my computer's memory to be better than mine, not worse, so it would bug me if I remember seeing something and my computer can't find it.
There's a simple reason all the current hashes end in a or q: the hash is 256 bits, the base32 encoding chops that into groups of 5 bits, and 256 isn't divisible by 5. The last character of the base32 encoding just has that left-over single bit (256 mod 5 = 1).
So I agree with #3 below, but do you have a source for #1, #2 or #4? I would expect any lack of variability in any part of a hash function's output would make it more vulnerable to attacks, so designers of hash functions would want to make the whole output vary as much as possible.
Other than the divisible-by-5 thing, my current intuition is it doesn't matter what part you take.
Hash Structure: Hashes are typically designed so that their outputs have specific statistical properties. The first few characters often have more entropy or variability, meaning they are less likely to have patterns. The last characters may not maintain this randomness, especially if the encoding method has a tendency to produce less varied endings.
Collision Resistance: When using hashes, the goal is to minimize the risk of collisions (different inputs producing the same output). By using the first few characters, you leverage the full distribution of the hash. The last characters may not distribute in the same way, potentially increasing the likelihood of collisions.
Encoding Characteristics: Base32 encoding has a specific structure and padding that might influence the last characters more than the first. If the data being hashed is similar, the last characters may be more similar across different hashes.
Use Cases: In many applications (like generating unique identifiers), the beginning of the hash is often the most informative and varied. Relying on the end might reduce the uniqueness of generated identifiers, especially if a prefix has a specific context or meaning.
@quark It looks like the part about traditional topics has been removed from that page. Here is an old version that mentions it: https://web.archive.org/web/20221211165458/https://dev.twtxt.net/doc/twtsubjectextension.html . Still, I don't see any description of what is actually allowed between the parentheses. May be worth noting that twtxt.net is displaying the twts with the subject stripped, so some piece of code is recognizing it as a subject (or, at least, something to be removed).
It should be fixed now. Just needed some unusual quoting in my httpd.conf: https://mail-archive.com/misc@openbsd.org/msg169795.html
@lyse Sorry, I don't think I ever had charset=utf8. I just noticed that a few days ago. OpenBSD's httpd might not support including a parameter with the mime type, unfortunately. I'm going to look into it.
Maybe I’m being a bit too purist/minimalistic here. As I said before (in one of the 1372739 posts on this topic – or maybe I didn’t even send that twt, I don’t remember 😅), I never really liked hashes to begin with. They aren’t super hard to implement but they are kind of against the beauty of the original twtxt – because you need special client support for them. It’s not something that you could write manually in your
twtxt.txtfile. With @sorenpeter’s proposal, though, that would be possible.
Tangentially related, I was a bit disappointed to learn that the twt subject extension is now never used except with hashes. Manually-written subjects sounded so beautifully ad-hoc and organic as a way to disambiguate replies. Maybe I'll try it some time just for fun.
@falsifian You mean the idea of being able to inline
# url =changes in your feed?
Yes, that one. But @lyse pointed out suffers a compatibility issue, since currently the first listed url is used for hashing, not the last. Unless your feed is in reverse chronological order. Heh, I guess another metadata field could indicate which version to use.
Or maybe url changes could somehow be combined with the archive feeds extension? Could the url metadata field be local to each archive file, so that to switch to a new url all you need to do is archive everything you've got and start a new file at the new url?
I don't think it's that likely my feed url will change.
@mckinley Yes, changing domains is be a problem if you tie your identity to an https url. But I also worry about being stuck with a key I can't rotate. Whatever gets used, it would be nice to be able to rotate identities. I like @lyse's idea for that.
@prologic Brute force. I just hashed a bunch of versions of both tweets until I found a collision.
I mostly just wanted an excuse to write the program. I don't know how I feel about actually using super-long hashes; could make the twts annoying to read if you prefer to view them untransformed.
@prx I haven't messed with rdomains, but still it might help if you included the command that produced that error (and whether you ran it as root).
They're in Section 6:
Receiver should adopt UDP GRO. (Something about saving CPU processing UDP packets; I'm a but fuzzy about it.) And they have suggestions for making GRO more useful for QUIC.
Some other receiver-side suggestions: "sending delayed QUICK ACKs"; "using recvmsg to read multiple UDF packets in a single system call".
Use multiple threads when receiving large files.
HTTPS is supposed to do [verification] anyway.
TLS provides verification that nobody is tampering with or snooping on your connection to a server. It doesn't, for example, verify that a file downloaded from server A is from the same entity as the one from server B.
I was confused by this response for a while, but now I think I understand what you're getting at. You are pointing out that with signed feeds, I can verify the authenticity of a feed without accessing the original server, whereas with HTTPS I can't verify a feed unless I download it myself from the origin server. Is that right?
I.e. if the HTTPS origin server is online and I don't mind taking the time and bandwidth to contact it, then perhaps signed feeds offer no advantage, but if the origin server might not be online, or I want to download a big archive of lots of feeds at once without contacting each server individually, then I need signed feeds.
feed locations [being] URLs gives some flexibility
It does give flexibility, but perhaps we should have made them URIs instead for even more flexibility. Then, you could use a tag URI,
urn:uuid:*, or a regular old URL if you wanted to. The spec seems to indicate that theurltag should be a working URL that clients can use to find a copy of the feed, optionally at multiple locations. I'm not very familiar with IP{F,N}S but if it ensures you own an identifier forever and that identifier points to a current copy of your feed, it could be a great way to fix it on an individual basis without breaking any specs :)
I'm also not very familiar with IPFS or IPNS.
I haven't been following the other twts about signatures carefully. I just hope whatever you smart people come up with will be backwards-compatible so it still works if I'm too lazy to change how I publish my feed :-)
@xuu Thanks for the link. I found a pdf on one of the authors' home pages: https://ahmadhassandebugs.github.io/assets/pdf/quic_www24.pdf . I wonder how the protocol was evaluated closer to the time it became a standard, and whether anything has changed. I wonder if network speeds have grown faster than CPU speeds since then. The paper says the performance is around the same below around 600 Mbps.
To be fair, I don't think QUIC was ever expected to be faster for transferring a single stream of data. I think QUIC is supposed to reduce the impact of a dropped packet by making sure it only affects the stream it's part of. I imagine QUIC still has that advantage, and this paper is showing the other side of a tradeoff.
@lyse This looks like a nice way to do it.
Another thought: if clients can't agree on the url (for example, if we switch to this new way, but some old clients still do it the old way), that could be mitigated by computing many hashes for each twt: one for every url in the feed. So, if a feed has three URLs, every twt is associated with three hashes when it comes time to put threads together.
A client stills need to choose one url to use for the hash when composing a reply, but this might add some breathing room if there's a period when clients are doing different things.
(From what I understand of jenny, this would be difficult to implement there since each pseudo-email can only have one msgid to match to the in-reply-to headers. I don't know about other clients.)
@movq Another idea: just hash the feed url and time, without the message content. And don't twt more than once per second.
Maybe you could even just use the time, and rely on @-mentions to disambiguate. Not sure how that would work out.
Though I kind of like the idea of twts being immutable. At least, it's clear which version of a twt you're replying to (assuming nobody is engineering hash collisions).
In fact, maybe your public key idea is compatible with my last point. Just come up with a url scheme that means "this feed's primary URL is actually a public key", and then feed authors can optionally switch to that.
@prologic Some criticisms and a possible alternative direction:
Key rotation. I'm not a security person, but my understanding is that it's good to be able to give keys an expiry date and replace them with new ones periodically.
It makes maintaining a feed more complicated. Now instead of just needing to put a file on a web server (and scan the logs for user agents) I also need to do this. What brought me to twtxt was its radical simplicity.
Instead, maybe we should think about a way to allow old urls to be rotated out? Like, my metadata could somehow say that X used to be my primary URL, but going forward from date D onward my primary url is Y. (Or, if you really want to use public key cryptography, maybe something similar could be used for key rotation there.)
It's nice that your scheme would add a way to verify the twts you download, but https is supposed to do that anyway. If you don't trust https to do that (maybe you don't like relying on root CAs?) then maybe your preferred solution should be reflected by your primary feed url. E.g. if you prefer the security offered by IPFS, then maybe an IPNS url would do the trick. The fact that feed locations are URLs gives some flexibility. (But then rotation is still an issue, if I understand ipns right.)
@movq @prologic Another option would be: when you edit a twt, prefix the new one with (#[old hash]) and some indication that it's an edited version of the original tweet with that hash. E.g. if the hash used to be abcd123, the new version should start "(#abcd123) (redit)".
What I like about this is that clients that don't know this convention will still stick it in the same thread. And I feel it's in the spirit of the old pre-hash (subject) convention, though that's before my time.
I guess it may not work when the edited twt itself is a reply, and there are replies to it. Maybe that could be solved by letting twts have more than one (subject) prefix.
But the great thing about the current system is that nobody can spoof message IDs.
I don't think twtxt hashes are long enough to prevent spoofing.
@prologic Perfect, thanks. For my own future reference: curl -H 'Accept: application/json' https://twtxt.net/twt/st3wsda
@bender So far I've been following feeds fairly liberally. I'll check to see if we have anything in common and lean toward following, just because this is new to me and it feels like a small community. But I'm still figuring out what I want. Later I'll probably either trim my follower list or come up with some way to prioritize the feeds I'm more interested in.
@prologic Specifically, I could view yarnd's copy here, but only as rendered for a human to view: https://twtxt.net/twt/st3wsda
@movq thanks for getting to the bottom of it. @prologic is there a way to view yarnd's copy of the raw twt? The edit didn't result in a visible change; being able to see what yarnd originally downloaded would have helped me debug.
The actual end-user problem is that I can't see the thread properly when using neomutt+jenny.
@prologic One of your twts begins with (#st3wsda): https://twtxt.net/twt/bot5z4q
Based on the twtxt.net web UI, it seems to be in reply to a twt by @cuaxolotl which begins "I’ve been sketching out...".
But jenny thinks the hash of that twt is 6mdqxrq. At least, there's a very twt in their feed with that hash that has the same text as appears on yarn.social (except with ' instead of ’).
Based on this, it appears jenny and yarnd disagree about the hash of the twt, or perhaps the twt was edited (though I can't see any difference, assuming ' vs ’ is just a rendering choice).
@prologic I believe you when you say registries as designed today do not crawl. But when I first read the spec, it conjured in my mind a search engine. Now I don't know how things work out in practice, but just based on reading, I don't see why it can't be an API for a crawling search engine. (In fact I don't see anything in the spec indicating registry servers shouldn't crawl.)
(I also noticed that https://twtxt.readthedocs.io/en/latest/user/registry.html recommends "The registries should sync each others user list by using the users endpoint". If I understood that right, registering with one should be enough to appear on others, even if they don't crawl.)
Does yarnd provide an API for finding twts? Is it similar?
@prologic I guess I thought they were search engines. Anyway, the registry API looks like a decent one for searching for tweets. Could/should yarn.social pods implement the same API?
I just manually followed the steps at https://dev.twtxt.net/doc/twthashextension.html and got 6mdqxrq. I wonder what happened. Did @cuaxolo edit the twt in some subtle way after twtxt.net downloaded it? I couldn't spot a diff, other than ' appearing as ’ on yarn.social, which I assume is a transformation done by twtxt.net.
@prologic What's the difference between search.twtxt.net and the /api/plain/tweets endpoint of a registry? In my mind, a registry is a twtxt search engine. Or are registries not supposed to do their own crawling to discover new feeds?
@prologic How does yarn.social's API fix the problem of centralization? I still need to know whose API to use.
Say I see a twt beginning (#hash) and I want to look up the start of the thread. Is the idea that if that twt is hosted by a a yarn.social pod, it is likely to know the thread start, so I should query that particular pod for the hash? But what if no yarn.social pods are involved?
The community seems small enough that a registry server should be able to keep up, and I can have a couple of others as backups. Or I could crawl the list of feeds followed by whoever emitted the twt that prompted my query.
I have successfully used registry servers a little bit, e.g. to find a feed that mentioned a tag I was interested in. Was even thinking of making my own, if I get bored of my too many other projects :-)
@movq Thanks, it works!
But when I tried it out on a twt from @prologic, I discovered jenny and yarn.social seem to disagree about the hash of this twt: https://twtxt.net/twt/st3wsda . jenny assigned it a hash of 6mdqxrq but the URL and prologic's reply suggest yarn.social thinks the hash is st3wsda. (And as a result, jenny --fetch-context didn't work on prologic's twt.)
@movq Thanks! Looking forward to trying it out. Sorry for the silence; I have become unexpectedly busy so no time for twtxt these past few days.
@prologic Yes, fetching the twt by hash from some service could be a good alternative, in case the twt I have does not @-mention the source. (Besides yarnd, maybe this should be part of the registry API? I don't see fetch-by-hash in the registry API docs.)
@movq I don't know if I'd want to discard the twts. I think what I'm looking for is a command "jenny -g https://host.org/twtxt.txt" to fetch just that one feed, even if it's not in my follow list. I could wrap that in a shell script so that when I see a twt in reply to a feed I don't follow, I can just tap a key and the feed will get added to my maildir. I guess the script would look for a mention at the start of a selected twt and call jenny -g on the feed.
(@anth's feed almost never works, but I keep it because they told me they want to fix their server some time.)
I guess I can configure neomutt to hide the feeds I don't care about.
@bender Based on my experience so far, as a user, I would be upset if my client dropped someone from my follower list, i.e. stopped fetching their feed, without me asking for that to happen.
@bender I'm not a yarnd user, but automatically unfollowing on 404 doesn't seem right. Besides @lyse's example, I could imagine just accidentally renaming my own twtxt file, or forgetting to push it when I point my DNS to a new web server. I'd rather not lose all my yarnd followers in a situation like that (and hopefully they feel the same).
@prologic @bender Exponential backoff? Seems like the right thing to do when a server isn't accepting your connections at all, and might also be a reasonable compromise if you consider 404 to be a temporary failure.
@prologic The headline is interesting and sent me down a rabbit hole understanding what the paper (https://aclanthology.org/2024.acl-long.279/) actually says.
The result is interesting, but the Neuroscience News headline greatly overstates it. If I've understood right, they are arguing (with strong evidence) that the simple technique of making neural nets bigger and bigger isn't quite as magically effective as people say --- if you use it on its own. In particular, they evaluate LLMs without two common enhancements, in-context learning and instruction tuning. Both of those involve using a small number of examples of the particular task to improve the model's performance, and they turn them off because they are not part of what is called "emergence": "an ability to solve a task which is absent in smaller models, but present in LLMs".
They show that these restricted LLMs only outperform smaller models (i.e demonstrate emergence) on certain tasks, and then (end of Section 4.1) discuss the nature of those few tasks that showed emergence.
I'd love to hear more from someone more familiar with this stuff. (I've done research that touches on ML, but neural nets and especially LLMs aren't my area at all.) In particular, how compelling is this finding that zero-shot learning (i.e. without in-context learning or instruction tuning) remains hard as model size grows.
@movq Variable names used with -eq in [[ ]] are automatically expanded even without $ as explained in the "ARITHMETIC EVALUATION" section of the bash man page. Interesting. Trying this on OpenBSD's ksh, it seems "set -u" doesn't affect that substitution.
@prologic I don't know what you mean when you call them stochastic parrots, or how you define understanding. It's certainly true that current language models show an obvious lack of understanding in many situations, but I find the trend impressive. I would love to see someone achieve similar results with much less power or training data.
@prologic I thought "stochastic parrot" meant a complete lack of understanding.
@movq The success of large neural nets. People love to criticize today's LLMs and image models, but if you compare them to what we had before, the progress is astonishing.
@prologic Thanks. It's from a non-Euclidean geometry project: https://www.falsifian.org/blog/2022/01/17/s3d/
@prologic Thanks for the invitation. What time of day?
@prologic Fair enough! I just added some metadata.
Thanks @prologic! I like the way Yarn.social is making all of twtxt stronger, not just Yarn.social pods.