Trouble Running Single Finger Test - TriFinger Robot

Regarding the blinking LED:

Behaviour Meaning
Off System is disabled.
On System is enabled. Both motors are disabled.
Slow Blinking (1Hz) System and at least one motor are enabled.
Fast Blinking (4Hz) Same as above + at least one motor is currently aligned (happens when the motor is enabled for the first time after power cycling and takes around 5 seconds).

So blinking means that the initial communication is working but then it gets stuck for some reason.
You can monitor the messages on the CAN bus with candump (from the can-utils package):

candump -e can0

If I recall correctly, you are using a ā€œPCAN-PCI Express FD Four Channelā€, is this correct?
We recently started to test with one of those ourselves (so far, we always used the ones without ā€œFDā€) and are running into similar issues (sometimes the connection is interrupted a moment after the motor is enabled, sometimes it is running but has some notable delays).
The motor boards do not support CAN FD but in theory it should be backward compatible. It might be that the firmware needs to be configured a bit differently but I unfortunately havenā€™t figured out yet what the problem is.

I had some hope that our card simply has some damage but given that you have the same problem, it seems to be more fundamental :confused:
There is an ongoing discussion about this in the PEAK Forum.

So Iā€™m afraid I canā€™t help with this at the moment :frowning: However, Iā€™m still investigating. To my understanding it has to work with the FD card, it might just be a matter of finding the correct configuration.

1 Like

Hi Felix,

That is correct, I do have the ā€œPCAN-PCI Express FD Four Channelā€ CAN cards. Iā€™m hoping to figure something out to make these work as I donā€™t think I can return them at this point.

Iā€™ll go through the thread on the PEAK forums and see what I think, but let me know if thereā€™s anything specific I can do to help figure this out!

Thanks again,
Ross

Hi Ross,

I think I have good news! With the help of the PEAK support, I got it working. It seems there is some problem with the driver of the CAN FD cards, so we need to change some parameters there (unfortunately this involves rebuilding the kernel). See the PEAK forum.

I didnā€™t do any extensive tests yet and I think it needs some fine tuning (and a proper write-up from my side), but I wanted to share the good news immediately :slight_smile:

I donā€™t have time now but will provide more information on how to update the driver tomorrow, in case you want to test it yourself.

2 Likes

Wow, this is great news! I donā€™t have time to test it out today, but Iā€™ll hopefully be able to make it into the lab tomorrow and see.

Thanks for your tireless work on this project!!
Ross

Here are the steps I followed to update the driver:

To rebuild the kernel with the modified parameter I used the setup from installing the preempt_rt kernel.

I still had the ~/Downloads/rt_preempt_kernel_install directory in which the install script downloads and builds the kernel source. So I just went there and modified the parameter in the file linux-XXX/drivers/net/can/peak_canfd/peak_pciefd_main.c.

Then I modified the install_rt_preempt_kernel.sh to skip the part that downloads the source and applies the preempt_rt patch, so that I could use it to just rebuild and install the kernel with the modified source. Here is a diff of what I commented:

37,39c37,39
< wget -nc https://mirrors.edge.kernel.org/pub/linux/kernel/v$VERSION_MAJOR.x/linux-$VERSION.tar.xz
< wget -nc http://cdn.kernel.org/pub/linux/kernel/projects/rt/$VERSION_MAJOR.$VERSION_SECOND/older/patch-$VERSION_PATCH.patch.xz
< xz -cd linux-$VERSION.tar.xz | tar xvf -
---
> #wget -nc https://mirrors.edge.kernel.org/pub/linux/kernel/v$VERSION_MAJOR.x/linux-$VERSION.tar.xz
> #wget -nc http://cdn.kernel.org/pub/linux/kernel/projects/rt/$VERSION_MAJOR.$VERSION_SECOND/older/patch-$VERSION_PATCH.patch.xz
> #xz -cd linux-$VERSION.tar.xz | tar xvf -
43c43
< xzcat ../patch-$VERSION_PATCH.patch.xz | patch -p1
---
> #xzcat ../patch-$VERSION_PATCH.patch.xz | patch -p1

I havenā€™t tested this yet but I assume that the part for re-creating the config can be skipped as well.

Then I simple re-ran install_rt_preempt_kernel.sh and rebooted.

I also applied a fix in the blmc_drivers package (see pull request) which might be relevant when using a CAN FD card.

Further, I changed the CAN bit timing on the motor board which I am using for testing (see the discussion in the PEAK forum), but this might actually not be relevant.

Unfortunately, Iā€™m still running into issues that sometimes it getā€™s stuck in the ā€œwaiting for board and motors to be readyā€ phase. It seems the communication on the CAN bus just stops in this case for some reason. Turning the board off and on again and then restarting the application usually helps but itā€™s still an issue I need to investigate.

So to clarify: The driver fix Iā€™m talking about above fixes an issue that once the robot was running, messages are sometimes delayed, resulting in instable motor control. It seems that the ā€œgets stuck in the beginningā€-issue is a separate problem :(.

Hi Felix,

No problem ā€“ I understand that these things are very complex and can be difficult to solve. I was wondering if there has been any movement on this in the last couple weeks?

Also, is there anything that I can help out with ā€“ maybe providing some data or other troubleshooting? I would love to step in to help out where I can.

Thanks again,
Ross

Hi Ross,

I donā€™t really have news yet. The problem (at least for me) seems to be that the CAN communication is unstable and sometimes the CAN controller on the board is shutting down due to some error (probably because sent messages are not acknowledged by the PC for some reason). I have not yet figured out why, though. On the latest tests, I actually wasnā€™t able to reproduce the issue, i.e. the robot was running fine. We did some modifications on the CAN cable for debugging, not sure if this had some influence.

Right now Iā€™m on vacation, so I didnā€™t work on it in the last days but I plan to continue when I am back in mid September.

One thing you could test, if not done already:
Connect the board, power it on (without starting the robot software on the PC) and then monitor the CAN interface with candump -e can0. It should at this point receive one message per second (a status message from the board). If you see this, it means that the communication is generally working. Then start the robot software and see what happens in the can dump. In the cases where it is not working for me, the communication simply stops after a few moments (due to error on the board).

Best,
Felix

Hi Felix,

Sorry itā€™s been a bit since I have responded ā€“ things were hectic with the school year starting back up.
So here is what I did:

I first plugged in one single finger into CAN ports ā€œ4ā€ and ā€œ5ā€. I ran the above command, and then I turned the power onto the system. After which, this was output:Screenshot from 2021-09-24 16-44-03

I believe this is similar to what you see, with an error statement appearing once per second.

One odd thing is that I see NO messages coming through can5, which is the other port that I am plugged into.

Second, I started the ā€œsingle_finger_testā€ while monitoring can4, which output this:

There were no further messages after this, just like you said. It just output this error and then stopped outputting anything.

Hope this helps and let me know if you want me to run any other tests!!
Ross

Hi Ross,

Thanks for running the test. Your issue with can4 looks indeed like the issue I am observing. Unfortunately, I still donā€™t have a solution for this.

Itā€™s weird that you donā€™t receive anything on can5. Maybe there is something wrong with the cable or the corresponding board is not flashed correctly?

Best,
Felix

Hi Ross,

whenever you have time, could you repeat the test with the following modification in the initialize_can_bus.sh?
Append sample-point 0.667 fd off fd-non-iso off to the ip commands configuring the bitrate, e.g. for can0:

sudo ip link set can0 type can bitrate 1000000 sample-point 0.667 fd off fd-non-iso off

The sample point that is set there is actually not matching with the one used by the motor boards (which is 0.867) but when I tested today, it was mostly working for me with this setting (interestingly, it did not work when I set 0.867 explicitly).
Iā€™m not yet sure if this is really improving something or if it was just coincidence (this all seems to be a bit non-deterministicā€¦) so Iā€™ll do some more tests in the next days but in any case it would be interesting to see if this improves anything for you.

Instead of running single_finger_test, you can also use this script, which just enables a single board without sending any further commands. When running it, simply pass the can interface (e.g. ā€œcan0ā€) as argument.

One important note: While it was always working for me today, running the script linked above, I observed a somewhat weird behaviour when running an application on the whole finger: It always failed on the first attempt after initialising the CAN interfaces. However, when killing the application, power-cycling the finger and trying again (without re-initialising the CAN interfaces) it was working. Could you try that as well?

For all the tests, I had candump running so I could immediately see when the communication stopped.

1 Like

Hi Felix,

I got into the lab today to run this test. First, I changed the initialize_can_bus.sh script as you stated. I changed each line, even though right now I am only using can4 and can5.

While running the candump -e can4 command, I plugged the boards into the power supply and turned it on, which began producing many messages of can4 010 [1] 01.

Next, I ran the linked script (start_motors_via_can.sh) which produced the following stream of errors, after which the communication stopped:

Next, as you stated, I turned the power supply off, waited a bit, turned it back on, and re-ran the start_motors_via_can.sh script, at which point a never-ending stream of messages began appearing on screen:

Similar to this, but unlike before, the communication never stopped ā€“ it just continued to display errors extremely quickly (100s of lines per second) until killed the process.

Sorry itā€™s been a couple days here ā€“ hope this helps and please let me know if there are any other tests youā€™d like me to run!!

Thanks,
Ross

Thanks, this is good news! Those messages (IDs 020-050) are no errors but the sensor measurements (position, velocity and current) of the motors (i.e. it means that the board is sending data as requested). So it seems you are getting the same behaviour as I.

Can you try again with this configuration?

ip link set can4 type can \
    tq 12 prop-seg 26 phase-seg1 26 phase-seg2 27 sjw 2 \
    fd off fd-non-iso off

This worked best for me in the last tests.

Do you still have the issue with can5? In case this is fixed, you could try running the single_finger_test again.

1 Like

Oh, thatā€™s great news, yes! I did get a chance to get into the lab and test this new configuration out.
When running it, both can4 and can5 would receive consistent communication with the 020-050 messages.

I decided to run the single_finger_test and it ran the program, but didnā€™t really work as expected:

For example, the Single Finger Test Application screen (seen above) would come on, display that there were no errors, and just sort of jostle between -0.01 to 0.01 torque at each joint without changing the position. The actual robot itself didnā€™t physically move even though you could see it sort of ā€œjoltingā€ to life when you begin the test.

The ā€œqā€ button (for ā€œquitā€) worked successfully, and after quitting, I could restart the Single Finger Test again and get similar results.

Does your single_finger_test work with this configuration or is there something else wrong with mine?

Thanks!
Ross

Glad to hear that!
The single_finger_test just runs a position controller to hold the finger at its initial position, i.e. as long as you donā€™t touch it, there is no motion to be expected. However, when you manually move the finger away from the initial position, the controller should increase the torque, trying to bring it back. Likewise, you should see the position and velocity of the observation changing in this case. Does this happen?

Hi Felix,

Sorry itā€™s been a while since Iā€™ve been working on this project, but Iā€™m back on it again.

When running the single_finger_test, demo_single_finger_position_control, or demo_sparse_position_control, I get the following error.

Start homing.
Reached end stop.
BlmcJointModule::update_homing(): ERROR: Failed to find index with joint [0].
terminate called after throwing an instance of ā€˜std::runtime_errorā€™
what(): Robot needs to be initialized before applying actions. Run the initialize() method.

So, the initialize method is being called earlier in these scripts, but for whatever reason it isnā€™t able to finish initialization, and is throwing an error. Iā€™ve looked at this part in the cpp code, but I havenā€™t been able to figure out why itā€™s happening. This is a consistent and repeatable error which shows up even after power cycling and restarting the computer.

What Iā€™ve done so far is first confirm that all of the hardware is built correctly ā€“ I took the entire finger apart, double checked all of the connections with a multimeter to make sure they were working and everything was plugged in correctly, etc. I checked every port, every wire, etc. ā€“ everything looked good.

Next, I used the ā€˜candumpā€™ command along with the ā€˜start_motors_via_can.shā€™ script to observe the movement of the motors. I can see the data coming through the CAN ports and changing when I physically move each motor ā€“ this also looks good to me.

It looks like weā€™ve already solved everything with the CAN communication, so I tend to think it might be a hardware issue, but the hardware looks good to me, so Iā€™m a bit confusedā€¦Have you ever seen this error or have any idea what this might be?

Thanks again for your help!
Ross

Hi Ross,

Glad to see youā€™re still on it! :slight_smile:

The homing relies on the encoder index (a special encoder tick that appears only once per revolution). Maybe there is some problem with it on the upper joint (ā€œjoint [0]ā€).

The position of the index tick is transmitted via CAN using id 0x060 every time it is observed (see documentation).
So you should be able to check if it is detected using candump (ideally filtering, so it only shows the 0x060 messages). When you manually move the joint, it should send the message once per motor revolution.

If this is not the case, then it is probably either an issue with the encoder itself (e.g. damaged or dirty encoder disc) or with the cables connecting the encoder with the board. Maybe you can attach an oscilloscope to the ā€œIā€ line of the encoder cables (see here) to check if the index shows up there.

Wow, thank you so much for this! I didnā€™t see this documentation get launched, but itā€™s super helpful. Iā€™ll take a look at what else Iā€™ve missed in the last couple of months in Github.

I was able to quickly see that the 0x060 messages are coming through for both motors on can1 (joints 1 and 2), but nothing comes through with id 0x060 through can0 (joint 0). I will take this part of the finger apart and inspect it, test the ā€œIā€ line, potentially put a different encoder in there.

Iā€™ll let you know what I find.

Thank you for this help!!
Ross